Python

Airflow 2.0 - How it works

Airflow 2.0 & how it works : Scheduling and beyondAirflow became in the last recent years a major actor for scheduling a wide variety of actions. From running basic Python script to API calls up to acting as an ETL tools, it can address nearly…


Loading data from CSV files into a Tableau Hyper extract

Hyper is Tableau’s in-memory data engine technology, designed for fast data ingest and analytical query processing on large or complex data sets. In the present notebook, we are going to create a Tableau Hyper extract from CSV files in Python.…


More Heapsort in Cython

This post/notebook is the follow-up to a recent one : Heapsort with Numba and Cython, where we implemented heapsort in Python/Numba and Cython and compared the execution time with NumPy heapsort. However, heapsort in NumPy is written in C++…


Loading data from PostgreSQL to Pandas with ConnectorX

ConnectorX is a library, written in Rust, that enables fast and memory-efficient data loading from various databases to different dataframes. We refer to this interesting paper, in which the authors provide a detailed analysis of the pandas.read_sql…


Export data as fast as possible : from HANA to CSV

What is the fastest method to export HANA data (table or query result) to CSV ?I use a HANA 2.0 database. I want to export a table or a sql query from the database to an external client as fast as possible and using a command line (i’m on…


Heapsort with Numba and Cython

Updated April 20, 2022 following usefull comments by @scoder. Thank you very much for your pull requests! Heapsort is a classical sorting algorithm. We are going into a little bit of theory about the algorithm, but refer to Corman et al. [1] for…


Using Tableau to detect outlying trends and ruptures

Python Tableau detects outlying trends using Tabpy. To do that, we are going to get the outliers of a dataset, the change points and a piecewise approximation. For detailled information about the source code, you can look at this article : Outlier…


Plotting population density with datashader

In this short post, we are using the Global Human Settlement Layer from the European Commission: This spatial raster dataset depicts the distribution of population, expressed as the number of people per cell. The downloaded file has a worldwide…


Applying a row-wise function to a Pandas dataframe

More than 3 years ago, we posted a comparative study about Looping over Pandas data using a CPU. Because a lot of things evolved since 2018, this post is kind of an update. For example Pandas tag version was 0.23.3 at that time, it is now 1.4.0.…


A Parallel loop in Python with Joblib.Parallel

The goal of this post is to perform an embarrassingly parallel loop in Python, with the same code running on different platforms [Linux and Windows]. From wikipedia, here is a definition of embarassingly parallel: In parallel computing, an…


python spatial join

Python Spatial Join with GeoPandas (and GEOS)

Updated Sep 13, 2021 The purpose of this post is to perform an "efficient" spatial join in Python. What is a spatial join? Here is the definition from wiki.gis.com: A Spatial join is a GIS operation that affixes data from one feature layer’s…


Built-in Expectations in Great Expectations

Great expectation is a Python tool for data testing, documentation, and profiling. Here is a figure from the documentation describing its purpose: Great Expectations makes it easy to include data testing in your ML pipeline, when dealing with…


Numba and Datashader

Goldbach's Comet with Numba and Datashader

Updated Jul 29, 2021 This Python notebook is about computing and plotting Goldbach function. It requires some basic mathematical knowledge, nothing fancy! The main point is to perfom some computations with Numba and some efficient plotting with…


Le tour de france history web scraping

LeTour data set This file downloads raw data about every rider of every Tour de France (from 1903 up to 2020). This data will then be postprocessed and stored in CSV format. Executing this notebook might take some minutes. 1) Retrieve urls for data…


Some Pre-commit git hooks for Python

Pre-commit hooks are a great way to automatically check and clean the code. They are executed when committing changes to git . This can be useful when several people are working on the same package with different code styles, but also to help…


data exploration with pandas matplotlib and seaborn

Quick data exploration with pandas, matplotlib and seaborn

In this JupyterLab Python notebook we are going to look at the rate of coronavirus [COVID-19] cases in french departments [administrative divisions of France]. The data source is the french government's open data. We are going to perform a few…


Optuna and XGBoost

Optuna and XGBoost on a tabular dataset

Updated Sep 16, 2021 following a comment by @k_nzw about XGBoostPruningCallback The purpose of this Python notebook is to give a simple example of hyperparameter optimization [HPO] using Optuna and XGBoost. We are going to perform a regression…


Saving a tf.keras model with data normalization

Training a DL model might take some time, and make use of some special hardware. So you may want to save the model for using it later, or on another computer. In this short Python notebook, we are going to create a very simple tensorflow.keras…


Benford's law and the population of french cities

In this Python notebook, we are going to look at Benford's law, which predicts the leading digit distribution, when dealing with some real-world collections of numbers. This distribution usually occurs when the numbers are rather smoothly…


Merge Sort with Cython and Numba

In this post, we present an implementation of the classic merge sort algorithm in Python on NumPy arrays, and make it run reasonably "fast" using Cython and Numba. We are going to compare the run time with the numpy.sort(kind='mergesort')…


Logistic regression with JAX

Logistic regression with JAX

JAX is a Python package for automatic differentiation from Google Research. It is a really powerful and efficient library. JAX can automatically differentiate some Python code [supports the reverse- and forward-mode]. It can also speed up the…


Lunch break, ridge plots with Bokeh

Bokeh is a great visualization Python library. In this short post, we are going to use it to create a ridge plot. For that purpose, we use the COVID-19 death data from Johns Hopkins University, and plot the daily normalized death rate (100000 *…


Lunch break, fetching AROME temperature forecast in Lyon

Since a "small" heat wave is coming, I would like to get some temperature forecast for the next hours in my neighborhood, from my JupyterLab notebook. We are going to fetch the results from the Météo-France AROME 0.01 model. Here is the AROME item…


Outlier and Change Point Detection

Outlier and Change Point Detection in Data Integration Performance Metrics

Data integration involves combining data residing in different sources, and providing users with a unified view of them. In this post, we are interested in detecting performance drift of large and complex daily data integration processes performed…


Lunch break, plotting excess death in french department zones with Python

Daily deaths data are provided by INSEE - the national institute of statistics and economic studies. Here is the link of the page displaying these data, and here is a short description: During the Covid-19 pandemic, INSEE is reporting the number of…


A Quick study of air quality in Lyon with Python

The aim of this post is to use Python to fetch air quality data from a web service and to create a few plots. We are going to: look at some daily data from earlier this year, before and after the lockdown, in the city of Lyon, France compare 2020…


Fitting a logistic curve to time series in Python

In this notebook we are going to fit a logistic curve to time series stored in Pandas, using a simple linear regression from scikit-learn to find the coefficients of the logistic curve. Disclaimer: although we are going to use some COVID-19 data in…


Cython and Numba applied to a simple algorithm: Insertion sort

The aim of this notebook is to show a basic example of Cython and Numba, applied to a simple algorithm: Insertion sort. As we will see, the code transformation from Python to Cython or Python to Numba can be really easy [specifically for the…


Lunch break: plotting traffic injuries with datashader

Well I love the datashader Python package and I am always happy to use it on some interesting datasets. I recently came across a traffic injury database for french roads, which happens to have some geographical coordinates. This comes from the…


Fetching AROME weather forecasts and plotting temperatures

Accurate weather forecasts might be very usefull for various types of models. In this post, we are going to download the latest available weather forecasts for France and plot some temperature fields, using different Python libraries: Arome is…


Some cool open-source Python packages for Machine Learning Ep 4

DS There is a very rich ecosystem of Python libraries related to ML. Here is a list of some “active”, open-source packages that may be useful for ML day-to-day activities. This post is following these ones: Some cool open-source Python packages…


Some cool open-source Python packages for Machine Learning Ep 3

There is a very rich ecosystem of Python libraries related to ML. Here is a list of some “active”, open-source packages that may be useful for ML day-to-day activities. This post is following these ones: Some cool open-source Python packages…


First try of auto-sklearn

DS Since we are big users of scikit-learn and XGBoost, we wanted to try a package that would automate the process of building a machine learning model with these tools. Here is the introduction to auto-sklearn from its github.io website:…


Some cool open-source Python packages for Machine Learning Ep 2

There is a very rich ecosystem of Python libraries related to ML. Here is a list of some “active”, open-source packages that may be useful for ML day-to-day activities. This post is following that one: Some cool open-source Python packages for…


Loading data into a Pandas DataFrame - a performance study

Because doing machine learning implies trying many options and algorithms with different parameters, from data cleaning to model validation, the Python programmers will often load a full dataset into a Pandas dataframe, without actually…


Some cool open-source Python packages for Machine Learning Ep 1

There is a very rich ecosystem of Python libraries related to ML. Here is a list of some "active", open-source packages that may be useful for ML day-to-day activities. Of course, this list is far from being exhaustive and should evolve as fast…


GPU Analytics Ep 3, Apply a function to the rows of a dataframe

The goal of this post is to compare the execution time between Pandas (CPU) and RAPIDS (GPU) dataframes, when applying a simple mathematical function to the rows of a dataframe. Since the row-wise applied function is a re-projection of geographical…


Nighttime Lights with Rasterio and Datashader

daIn this post, we are going to plot some satellite GeoTIFF data in Python. The data is provided by NOAA (single GeoTIFF: F16_20100111-20110731_rad_v4.avg_vis.tif): The Operational Linescan System (OLS) flown on the Defense Meteorological Satellite…


Datashader

Symmetric Chaos with Datashader and Numba

Map equation and coefficient values are taken from here. Some mathematical explainations can be found here, by Mike Field and Martin Golubitsky. import numpy as np import pandas as pd import datashader as ds from datashader import transfer_functions…


Plotting Hopalong attractor with Datashader and Numba

What is an attractor? Definition from wikipedia: In the mathematical field of dynamical systems, an attractor is a set of numerical values toward which a system tends to evolve, for a wide variety of starting conditions of the system. System values…


Looping over Pandas data

I recently stumbled on this interesting post on RealPython (excellent website by the way!): Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects This post has different subjects related to Pandas: creating a datetime column…


Pandas Time Series example with some historical land temperatures

Monthly averaged historical temperatures in France and over the global land surface The aim of this notebook is just to play with time series along with a couple of statistical and plotting libraries. Imports %matplotlib inline import pandas as pd…