Quick data exploration with pandas, matplotlib and seaborn


Pandas

In this JupyterLab Python notebook we are going to look at the rate of coronavirus [COVID-19] cases in french departments [administrative divisions of France]. The data source is the french government's open data.

We are going to perform a few operations, such has filtering some data, pivoting some tables, smoothing time series with a rolling window or plotting an heatmap.

Disclaimer : although we are going to use some COVID-19 data in this notebook, I want the reader to know that I have ABSOLUTELY no knowledge in epidemiology or any medicine-related subject. The point of this post is not COVID-19 at all but only to show an application of the Python data stack.

Imports

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import colorcet as cc
FS = (16, 8)  # figure size

Loading the data

We load the data from an URL straight to a pandas DataFrame:

tests = pd.read_csv(
    "https://www.data.gouv.fr/fr/datasets/r/406c6a23-e283-4300-9484-54e78c8ae675",
    sep=";",
    low_memory=False,
)
tests.head(2)
depjourPTcl_age90pop
0012020-05-13016983001.0
1012020-05-131171984665.0

We have 6 columns here:

  • dep : department's code
  • jour : date
  • P : number of positive tests per day
  • T : number of tests per day
  • cl_age90 : age group
  • pop : population corresponding to an age group and a department

We have 11 age group values, however 0 gathers all age groups:

tests.cl_age90.unique().tolist()
[9, 19, 29, 39, 49, 59, 69, 79, 89, 90, 0]

For example in Paris, we have:

depnum = "75"
pop_paris = (
    tests[tests.dep == depnum][["cl_age90", "pop"]]
    .drop_duplicates()
    .set_index("cl_age90")
)
ax = pop_paris.plot.bar(figsize=FS, alpha=0.6)
ax.grid()
_ = ax.set(
    title=f"Population in department {depnum} per age group", ylabel="Population",
)

assert (
    pop_paris[pop_paris.index > 0].sum().values[0]
    == pop_paris[pop_paris.index == 0].values[0][0]
)

We start by creating a DatetimeIndex:

tests.jour = pd.to_datetime(tests.jour, format="%Y-%m-%d")
tests.set_index("jour", inplace=True)
tests.index.name = "Date"
tests.head(2)
depPTcl_age90pop
Date
2020-05-1301016983001.0
2020-05-13011171984665.0

COVID-19 and test rates in the Rhône department

Now we select a department [Rhône department with code 69]:

depnum = "69"
dep_tot = tests[(tests.dep == depnum) & (tests.cl_age90 == 0)].copy(deep=True)
dep_tot.drop(["dep", "cl_age90"], axis=1, inplace=True)
dep_tot.head(2)
PTpop
Date
2020-05-132014681876051.0
2020-05-144115311876051.0

We can now compute and plot the COVID-19 rate for all age groups in this department:

ax = (
    (100000 * dep_tot.P / dep_tot["pop"])
    .rolling(7, center=True)
    .mean()
    .plot(style="-", figsize=FS, logy=True, alpha=0.6)
)
ax = (
    (100000 * dep_tot["T"] / dep_tot["pop"])
    .rolling(7, center=True)
    .mean()
    .plot(style="-", ax=ax, logy=True, alpha=0.6)
)
ax.grid()
_ = ax.set(
    title=f"Daily COVID-19 rate (per 100000) in department {depnum} (log scale)",
    ylabel="log scale",
)
_ = ax.legend(["Daily COVID-19 rate", "Daily test rate"])
ax.autoscale(enable=True, axis="x", tight=True)

We can also show the positivity rate:

ax = (100 * dep_tot.P / dep_tot["T"]).rolling(7, center=True).mean().plot(figsize=FS)
ax.grid()
_ = ax.set(
    title=f"Positivity rate in department {depnum}", ylabel="Positivity rate (%)",
)
ax.autoscale(enable=True, axis="x", tight=True)

Departement with the worst COVID-19 rate

First we need to select departments with a rather large population size [at least 50000 inhabitants for example] in order to compute a significative rate per 100000. Here is the population per department:

pop = (
    tests[tests.cl_age90 == 0][["dep", "pop"]]
    .drop_duplicates()
    .sort_values(by="pop")
    .reset_index(drop=True)
)
pop.head()
deppop
09755997.0
19779961.0
297835334.0
34876286.0
423116270.0

We create a list of departments with population above a threshold value:

pop_th = 50000
large_deps = pop[pop["pop"] > pop_th].dep.values.tolist()

Now we pivot the table such that each column corresponds to a department:

cr_alldep = tests[tests.cl_age90 == 0][["dep", "P", "pop"]]
cr_alldep["cr"] = 100000 * cr_alldep.P / cr_alldep["pop"]
cr_alldep.drop(["pop", "P"], axis=1, inplace=True)
cr_alldep = cr_alldep.pivot_table(index="Date", columns="dep", values="cr")
cr_alldep = cr_alldep[
    large_deps
]  # Here we select the largest departments regarding population
cr_alldep.head(2)
dep4823...7559
Date
2020-05-130.00.860067...1.7688641.390505
2020-05-140.00.860067...2.6067471.622255

2 rows × 102 columns

Let's look at the 5 departments with the highest COVID-19 rate in the most recent days:

n_deps = 5
deps = (
    cr_alldep.rolling(7, center=True)
    .mean()
    .dropna()
    .iloc[-1]
    .sort_values(ascending=False)[:n_deps]
    .index.values.tolist()
)
deps
['93', '95', '94', '77', '75']

We can now plot the evolution of the COVID-19 rate in these 5 most affected departments:

highest_cr = cr_alldep[deps]
ax = highest_cr.rolling(7, center=True).mean().plot(figsize=FS, alpha=0.6)
ax.grid()
_ = ax.set(
    title="Daily COVID-19 rate (per 100000) in the most affected departments",
    ylabel="COVID-19 rate",
)
ax.autoscale(enable=True, axis="x", tight=True)

Now we are going to focus on the department with highest COVID-19 rate.

Heatmap of the COVID-19 rate by age group in the most affected department

We start by pivoting the table such that each column corresponds to an age group:

depnum = deps[0]
dep_ag = tests[(tests.dep == depnum) & (tests.cl_age90 != 0)].copy(deep=True)
dep_ag["cr"] = 100000 * dep_ag.P / dep_ag["pop"]
dep_ag.drop(["dep", "P", "T", "pop"], axis=1, inplace=True)
dep_ag = dep_ag.pivot_table(index="Date", columns="cl_age90", values="cr")
dep_ag.head(2)
cl_age90919...8990
Date
2020-05-131.1605011.324369...19.27292996.936797
2020-05-140.0000000.000000...26.5002779.693680

Also, we compute the weekly average and transpose the table:

cr_smooth = dep_ag.resample("W").mean().T
cr_smooth = cr_smooth.sort_index(ascending=False)
cr_smooth.columns = [t.date() for t in cr_smooth.columns]
cr_smooth.head(2)
2020-05-172020-05-24...2021-03-142021-03-21
cl_age90
9023.26483124.926605...54.00764474.318211
8912.5274047.915667...60.91622163.038538

2 rows × 45 columns

We can now plot the heatmap:

fig, ax = plt.subplots(figsize=(25, 8))
ax = sns.heatmap(
    cr_smooth.astype(int), ax=ax, annot=True, cbar=False, fmt="d", cmap=cc.fire[::-1]
)
_ = ax.set(
    title=f"Daily COVID-19 rate in department {depnum}",
    xlabel="Date",
    ylabel="Age group",
)