Lunch break, ridge plots with Bokeh

Bokeh is a great visualization Python library. In this short post, we are going to use it to create a ridge plot.

closeup

For that purpose, we use the COVID-19 death data from Johns Hopkins University, and plot the daily normalized death rate (100000 * number of daily deaths / population) per EU(+UK) country.

Imports

import colorcet as cc
import numpy as np
import pandas as pd
from bokeh.io import show, output_notebook
from bokeh.models import ColumnDataSource, DatetimeTickFormatter
from bokeh.plotting import figure
output_notebook()
# Johns Hopkins University data url
URL = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"
Loading BokehJS ...

Load and prepare the data

Load the COVID-19 data into a dataframe:

deaths = pd.read_csv(URL)
deaths.head(2)
Province/StateCountry/RegionLat...8/24/208/25/20
0NaNAfghanistan33.93911...13891397
1NaNAlbania41.15330...254259

2 rows × 221 columns

Also load a list of EU countries:

countries = (
    pd.read_csv(
        "https://pkgstore.datahub.io/opendatafortaxjustice/listofeucountries/listofeucountries_csv/data/5ab24e62d2ad8f06b59a0e7ffd7cb556/listofeucountries_csv.csv"
    )
    .values[:, 0]
    .tolist()
)
# Match country names
countries = [c if c != "Czech Republic" else "Czechia" for c in countries]
countries = [c if c != "Slovak Republic" else "Slovakia" for c in countries]
n_countries = len(countries)
print(countries)
['Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czechia', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland', 'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'United Kingdom']

We select EU countries in the COVID-19 data:

deaths_eu = deaths.loc[deaths["Country/Region"].isin(countries)].copy(deep=True)
# cleanup
deaths_eu.drop(["Province/State", "Lat", "Long"], axis=1, inplace=True)
deaths_eu = deaths_eu.groupby("Country/Region").sum()  # with overseas territories
deaths_eu.index.name = "Country"
assert len(deaths_eu) == n_countries
deaths_eu.head(2)
1/22/201/23/20...8/24/208/25/20
Country
Austria00...733733
Belgium00...99969996

2 rows × 217 columns

Now we load the population count by country into a dataframe. The CSV file comes from this website.

pop = pd.read_csv(
    "./data/population-figures-by-country-csv_csv.csv",
    usecols=["Country", "Country_Code", "Year_2016"],
)
pop.loc[pop.Country == "Czech Republic", "Country"] = "Czechia"
pop.loc[pop.Country == "Slovak Republic", "Country"] = "Slovakia"

And select EU countries:

pop_eu = pop[pop.Country.isin(countries)].copy(deep=True)
pop_eu.drop("Country_Code", axis=1, inplace=True)
pop_eu.set_index("Country", drop=True, inplace=True)
assert len(pop_eu) == n_countries
pop_eu.head(2)
Year_2016
Country
Austria8747358.0
Belgium11348159.0

This population data date back to 2016, but it is recent enough for this blog post...

We compute the death density as the number of deaths per 100000 inhabitants for each country:

dd_eu = deaths_eu.div(pop_eu.Year_2016, axis=0) * 100000
dd_eu.head(2)
1/22/201/23/20...8/24/208/25/20
Country
Austria0.00.0...8.3796738.379673
Belgium0.00.0...88.08477288.084772

2 rows × 217 columns

Now we pivot the dataframe, convert the index into a DatetimeIndex:

dd_eu = dd_eu.T
dd_eu.index = pd.to_datetime(dd_eu.index)
dd_eu.tail(2)
CountryAustriaBelgium...SwedenUnited Kingdom
2020-08-248.37967388.084772...58.69866163.255251
2020-08-258.37967388.084772...58.70875963.279627

2 rows × 28 columns

and compute a smoothed daily count of deaths per 100000 inhabitants:

nd = 5
rate = (
    dd_eu.diff()
    .rolling(nd, center=True)
    .median()
    .rolling(3 * nd, center=False)
    .mean()
    .dropna()
)
rate.tail(2)
CountryAustriaBelgium...SwedenUnited Kingdom
2020-08-220.0068590.066971...0.0269280.015032
2020-08-230.0068590.066971...0.0276010.014423

2 rows × 28 columns

Let's reorder the countries from lowest to highest maximum daily death rate:

order = rate.max(axis=0).sort_values().index.values.tolist()
rate = rate[order]
rate.tail(2)
CountryLatviaSlovakia...SpainBelgium
2020-08-220.03.700743e-18...0.0256940.066971
2020-08-230.03.700743e-18...0.0291390.066971

2 rows × 28 columns

Here we duplicate the last row in order to later create nice looking Bokeh Patches (with a vertical line on the right side):

rate = pd.concat([rate, rate.tail(1)], axis=0)
rate.iloc[-1] = 0.0

We choose a color palette (linear sampling):

palette = [cc.rainbow[int(i * 9)] for i in range(len(order))]

Finally we can create the ridge plot.

Plot

Most of the following code comes from Bokeh's documentation.

def ridge(category, data, scale=5):
    return list(zip([category] * len(data), scale * data))

source = ColumnDataSource(data=dict(x=rate.index.values))
p = figure(
    y_range=order,
    plot_height=900,
    plot_width=900,
    toolbar_location=None,
    title="Daily normalized rate of COVID-19 deaths per EU(+UK) country",
)
p.title.text_font_size = "15pt"
p.xaxis.major_label_text_font_size = "10pt"
p.yaxis.major_label_text_font_size = "10pt"
for i, country in enumerate(order):
    y = ridge(country, rate[country])
    source.add(y, country)
    p.patch(
        "x",
        country,
        color=palette[i],
        alpha=0.25,
        line_color="black",
        line_alpha=0.5,
        source=source,
    )
p.outline_line_color = None
p.background_fill_color = "#efefef"
p.xaxis.formatter = DatetimeTickFormatter(days="%m/%d")
p.ygrid.grid_line_color = None
p.xgrid.grid_line_color = "#dddddd"
p.xgrid.ticker = p.xaxis.ticker
p.axis.minor_tick_line_color = None
p.axis.major_tick_line_color = None
p.axis.axis_line_color = None
p.y_range.range_padding = 0.85
show(p)

Ridge plot

The highest rate in this plot was reached in Belgium:

rate["Belgium"].max()
2.4268253555488593
str(rate["Belgium"].idxmax().date())
'2020-04-21'