Fitting a logistic curve to time series in Python
In this notebook we are going to fit a logistic curve to time series stored in Pandas, using a simple linear regression from scikitlearn to find the coefficients of the logistic curve.
Disclaimer: although we are going to use some COVID19 data in this notebook, I want the reader to know that I have ABSOLUTELY no knowledge in epidemiology or any medicinerelated subject, and clearly state that the result of fitting logistic curve to these data is an incredibly simplistic and naive approach. The point of this post is not the COVID19 at all but only to show an application of the Python data stack.
Edit: here is an interesting post about the difficulty of time series forecasting with logistic curves: Forecasting scurves is hard by Constance Crozier.
Let's start by decribing the logistic curve.
The Logistic curve
A logistic curve is a common Sshaped curve [sigmoid curve]. It can be usefull for modelling many different phenomena, such as [from wikipedia]:
 population growth
 tumor growth
 concentration of reactants and products in autocatalytic reactions
The equation is the following:
where

$t_{0}$ is the sigmoid's midpoint, 
$L$ is the curve's maximum value, 
$k$ is the logistic growth rate.
Here is an example of a logistic curve fitted to data of AIDS cases in the US:
Source: http://www.nlreg.com/aids.htm
Let's start by importing the libraries.
Imports
import itertools
from datetime import timedelta
import numpy as np
from matplotlib import pyplot as plt
plt.style.use("seaborn")
import pandas as pd
from sklearn.linear_model import LinearRegression
FS = (16, 9) # figure size
The Parameters
We have 3 parameters in the logistic curve:
t = np.linspace(5, 15, 1000)
fig = plt.figure(figsize=(10, 18))
ax = fig.add_subplot(3, 1, 1)
t0, L = 5., 10000.
for k in [0.5, 1., 2., 4.]:
D = L / (1. + np.exp(k * (t  t0)))
_ = plt.plot(t, D, label=f'k={k}')
_ = ax.legend()
_ = ax.set_xlabel('t')
ax = fig.add_subplot(3, 1, 2)
L, k = 10000., 2
for t0 in [2 , 4, 6, 8]:
D = L / (1. + np.exp(k * (t  t0)))
_ = plt.plot(t, D, label=f't0={t0}')
_ = ax.legend()
_ = ax.set_xlabel('t')
ax = fig.add_subplot(3, 1, 3)
t0, k = 5., 2
for L in range(4, 8):
L *= 2000
D = L / (1. + np.exp(k * (t  t0)))
_ = plt.plot(t, D, label=f'L={L}')
_ = ax.legend()
_ = ax.set_xlabel('t')
So how are we going to fit these paramters? We are going to use a linear regression to find some values for
In order to get a linear equation, we need to describe the logistic differential equation.
The logistic differential equation
If we differentiate
As you can infer from this equation, the proportional growth rate
\begin{equation} \frac{dD / dt}{D} = k \left( 1  \frac{D}{L} \right) \end{equation}
So the basic idea for fitting a logistic curve is the following:
 plot the proportional growth rate as a function of
$D$  try to find a range where this curve is close to linear
If we actually find a "large" interval of data for which the proportional growth rate is a linear function of
 find the coefficients of the linear function
$y=ax+b$ using a linear regression  compute
$L$ and$k$ from these coefficient [$k=b$,$L=k/a$ ]  find a value of
$t_0$ such that the logistic curve is as close as possible to the data on the interval of data [for which the proportional growth rate is a linear function of$D$ ]
Note that this process is very subjective! When applied to real data, we rarely find a strictly linear proportional growth rate and can very different sigmoid shapes just by choosing different intervals on which we apply the linear regression. Just because we can technically fit a line to a point cloud does not mean that it is appropriate.
Source: https://xkcd.com/1725/
Data
The data comes from the 2019 Novel Coronavirus COVID19 Data Repository by Johns Hopkins CSSE on github. We are going to look at the deathsbycountry time series.
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
df = pd.read_csv(url)
total = df.drop(['Province/State', 'Lat', 'Long'], axis=1).groupby('Country/Region').sum().max(axis=1).sort_values(ascending=False)
countries = total.index.to_list()
deaths = pd.DataFrame()
for country in countries:
temp = df[df['Country/Region'] == country][df.columns[4:]].T.sum(axis=1)
temp.index = pd.to_datetime(temp.index)
temp = temp.to_frame(country)
deaths = pd.concat([deaths, temp], axis=1)
start = '202003'
ax = deaths[['Italy', 'Spain', 'France']][start:].plot(style='', figsize=FS)
markers = itertools.cycle(("o", "v", "^", "<", ">", "s", "p", "P", "*", "h", "X", "D", '.'))
for i, line in enumerate(ax.get_lines()):
marker = next(markers)
line.set_marker(marker)
_ = ax.legend()
_ = ax.set_title("Number of COVID19 deaths per country")
Functions
For each of these 3 countries we are going to try to fit a logistic curve, using the following 4 functions:
def plot_ratios(country, death_min=0, death_max=None):
""" Plot the proportional growth rate with respect to D,
on the interval (death_min, death_max).
"""
country = country[country > death_min]
if death_max is not None:
country = country[country < death_max]
slopes = 0.5 * (country.diff(1)  country.diff(1))
ratios = slopes / country
x = country.values[1:1]
y = ratios.values[1:1]
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(1, 1, 1)
_ = plt.plot(x, y, 'o')
_ = ax.set_xlabel('D(t)')
_ = ax.set_ylabel('Ratios of slopes to function values')
return x, y
def linear_regression(x, y):
""" Find the coefficients of the linear function y=ax + b,
using a linear regression.
"""
X = x.reshape(1, 1)
reg = LinearRegression(fit_intercept=True, normalize=True)
_ = reg.fit(X, y)
a = reg.coef_[0]
b = reg.intercept_
y_hat = a * x + b
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(1, 1, 1)
_ = plt.plot(x, y, 'o')
_ = plt.plot(x, y_hat, '', label='Linear regression')
_ = ax.set_xlabel('D(t)')
_ = ax.set_ylabel('Ratios of slopes to function values')
_ = ax.legend()
return a, b
def plot_t0(a, b, country, death_min=0, death_max=None, t0=0):
""" Find a value of t0 such that the logistic curve is as close
as possible to the data on the given interval.
"""
k = b
L = b / a
country = country[country > death_min]
if death_max is not None:
country = country[country < death_max]
logis = L / (1. + np.exp(k * (np.arange(len(country))  t0)))
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(1, 1, 1)
_ = plt.plot(logis, 'o')
_ = plt.plot(country.values, 'd')
return L, k
def extended_plot(country, death_min, L, k, t0, days_before=30, days_after=30, figsize=(16, 9)):
""" Plot the logistic curve on an extended interval,
against the real data.
"""
country_original = country.copy(deep=True)
country = country[country > death_min]
country = country.to_frame('Deaths')
country_start = country.index.min()
country_end = country.index.max()
start=country_start  timedelta(days=days_before)
end=country_end + timedelta(days=days_after)
ix = pd.date_range(start=start, end=end, freq='D')
country = country.reindex(ix)
country['idx'] = np.arange(len(country))
country['idx'] = country.loc[country_start, 'idx']
country['logistic'] = L / (1. + np.exp(k * (country['idx'].values  t0)))
ax = country['logistic'].plot(figsize=figsize,logy=False)
_ = country_original[start:].plot(ax=ax, style='o')
Italy
country_name = 'Italy'
country = deaths[country_name]
x, y = plot_ratios(country)
It looks like the slope is changing several times. Let's focus on the last part of the curve.
death_min = 11000
x, y = plot_ratios(country, death_min)
a, b = linear_regression(x, y)
t0 = 1.5
L, k = plot_t0(a, b, country, death_min, t0=t0)
extended_plot(country, death_min, L, k, t0, figsize=FS)
Spain
country_name = 'Spain'
country = deaths[country_name]
x, y = plot_ratios(country)
Again, it looks like a piecewise linear curve and we are going to focus on the last part of the curve.
death_min = 11000
x, y = plot_ratios(country, death_min)
a, b = linear_regression(x, y)
t0 = 1.5
L, k = plot_t0(a, b, country, death_min, t0=t0)
extended_plot(country, death_min, L, k, t0, figsize=FS)
France
country_name = 'France'
country = deaths[country_name]
x, y = plot_ratios(country)
It does not look linear at all, but let's proceed with the regression anyway.
death_min = 200
x, y = plot_ratios(country, death_min)
a, b = linear_regression(x, y)
t0 = 17.5
L, k = plot_t0(a, b, country, death_min, t0=t0)
extended_plot(country, death_min, L, k, t0, figsize=FS)