Python
Parquet file sorting test
Update Nov 17, 2023 - Added results using the latest DataFusion version. Some time ago, we came across an intriguing Parquet sorting test shared by Mimoune Djouallah on Twitter @mim_djo. The test involves reading a Parquet file, sorting the table,…
Installing your Python package on a Windows machine that does not have internet access
Suppose you've developed a Python package called MyPackage on Linux, with specific package requirements, and need to install it on a Windows machine that lacks internet access, on which you may not have any specific priviledges. This blog post will…
Calculating daily mean temperatures with scikit-learn
The goal is of this post is to predict the daily mean air temperature TAVG from the following climate data variables: maximum and minimum daily temperatures and daily precipitation, using Python and some machine learning techniques available in…
Vector similarity search with pgvector
In the realm of vector databases, pgvector emerges as a noteworthy open-source extension tailored for Postgres databases. This extension equips Postgres with the capability to efficiently perform vector similarity searches, a powerful technique…
Using a local sentence embedding model for similarity calculation
A simple yet powerful use case of sentence embeddings is computing the similarity between different sentences. By representing sentences as numerical vectors, we can leverage mathematical operations to determine the degree of similarity. For the…
Python plot - Antarctic sea ice extent
Data source : https://ads.nipr.ac.jp/vishop/#/extent REGION SELECTOR = Antarctic At the bottom of the page : Download the sea ice extent (CSV file) - seasonal dataset From the National Institute of Polar Research (Japan) website: The sea-ice extent…
Python plot - North Atlantic daily water surface temperature
Updated July 11, 2023 data update Data source : https://climatereanalyzer.org [NOAA Optimum Interpolation SST [OISST] dataset version 2.1] From the NOAA website: The NOAA 1/4° Daily Optimum Interpolation Sea Surface Temperature [OISST] is a long…
Hyperpath routing in the context of transit assignment
How do transit passengers choose their routes in a complex network of lines and services? How can we estimate the distribution of passenger flows and the performance of transit systems? These are some of the questions that transit assignment models…
TPC-H benchmark of DuckDB and Hyper on native files
In this blog post, we examine the performance of two popular SQL engines for querying large files: Tableau Hyper / Proprietary License DuckDB / MIT License These engines have gained popularity due to their efficiency, ease of use, and Python APIs.……
TPC-H benchmark of Hyper and DuckDB on Windows and Linux OS
Update Apr 12, 2023 - It seems that Windows 11's poor performance may be due to conflicting BIOS/OS settings when dual-booting. We are investigating... Additionally, I have corrected the version of Windows 11 in the post from Home to Professional.…
TPC-H benchmark of Hyper, DuckDB and Datafusion on Parquet files
Update Apr 14, 2023 - An issue has been opened on the DataFusion GitHub repository regarding its poor reported performance compared to DuckDB and Hyper in this specific case: #5942. While there may be multiple factors contributing to this unexpected…
Dijkstra's algorithm in Cython, part 3/3
Running time of Dijkstra's algorithm on DIMACS networks with various implementations in Python. This post is the last part of a three-part series: first part second part In the present post, we compare the in-house implementation of Dijkstra's…
Dijkstra's algorithm in Cython, part 2/3
This post is the second part of a three-part series. In the first part, we looked at the Cython implementation of Dijkstra's algorithm. In the current post, we are going to compare different priority queue implementations, using Dijkstra's…
Dijkstra's algorithm in Cython, part 1/3
In this post, we are going to present an implementation of Dijkstra's algorithm in Cython. Dijkstra's algorithm is a shortest path algorithm. It was conceived by Edsger W. Dijkstra in 1956, and published in 1959 [1].…
A Cython implementation of a priority queue
Credit: Musée de l'illusion, Lyon [picture taken by myself] In this post, we describe a basic Cython implementation of a priority queue. A priority queue is an important data structure in…
Visualizing some polynomial roots with Datashader
Last week-end I found this interesting tweet by sara: The above figure shows all the complex roots from the various polynomials of degree 10 with coefficients in the set $\left\{ -1, 1 \right\}$. It made me think of Bohemian matrix…
Forward and reverse stars in Cython
This notebook is the following of a previous one, where we looked at the forward and reverse star representations of a sparse directed graph in pure Python: Forward and reverse star representation of a digraph. The motivation is to access the…
Forward and reverse star representation of a digraph
In this Python notebook, we are going to focus on a graph representation of directed graphs : the forward star representation [and its opposite, the reverse star]. The motivation here is to access a network topology and associated data efficiently,…
Query Parquet files with DuckDB and Tableau Hyper engines
In this notebook, we are going to query some Parquet files with the following SQL engines: DuckDB : an in-process SQL OLAP database management system. We are going to use its Python Client API [MIT license]. Tableau Hyper : an in-memory data…
Download some benchmark road networks for Shortest Paths algorithms
Updated September 26, 2022 bugfix The goal of this Python notebook is to download and prepare a suite of benchmark networks for some shortest path algorithms. We would like to experiment with some simple directed graphs with non-negative weights. We…
Euler's number and the uniform sum distribution
Last year I stumbled upon this tweet from @fermatslibrary [1]: I find it a little bit intriguing for Euler's number $e$ to appear here! But actually, it is not uncommon to encounter $e$ in probability theory, as explained by Stefanie Reichert…
Testing DuckDB performance with Discogs data
This notebook is a small example of using DuckDB with the Python API. What is DuckDB? DuckDB is an in-process SQL OLAP Database Management System It is a relational DBMS that supports SQL. OLAP stands for Online analytical processing,…
Dynamic TaskGroup in Airflow 2.0
TaskGroup feature in Airflow 2.0 - Dynamic creationIn this article we will uncover a way to use Airflow new feature called TaskGroup which allow you to manage your dependencies in a dynamic way. Many articles are showing you how to use them in a…
Dynamic TaskGroup Scalability in Airflow 2.0
Dynamic TaskGroup Scalability in Airflow 2.0 - Handle big DAGsIn the previous article I showed you how to instantiate TaskGroup in a Dynamic way. We will now see how we face the challenge of using it at a larger scale. This is not an easy task : you…
Reading a SQL table by chunks with Pandas
In this short Python notebook, we want to load a table from a relational database and write it into a CSV file. In order to that, we temporarily store the data into a Pandas dataframe. Pandas is used to load the data with read_sql() and later to…
Apache Airflow 2.0 : How it works
Apache Airflow 2.0 & how it works : Scheduling and beyondAirflow became in the last recent years a major actor for scheduling a wide variety of actions. From running basic Python script to API calls up to acting as an ETL tools, it can address…
Loading data from CSV files into a Tableau Hyper extract
Hyper is Tableau’s in-memory data engine technology, designed for fast data ingest and analytical query processing on large or complex data sets. In the present notebook, we are going to create a Tableau Hyper extract from CSV files in Python.…
More Heapsort in Cython
This post/notebook is the follow-up to a recent one : Heapsort with Numba and Cython, where we implemented heapsort in Python/Numba and Cython and compared the execution time with NumPy heapsort. However, heapsort in NumPy is written in C++…
Loading data from PostgreSQL to Pandas with ConnectorX
ConnectorX is a library, written in Rust, that enables fast and memory-efficient data loading from various databases to different dataframes. We refer to this interesting paper, in which the authors provide a detailed analysis of the pandas.read_sql…
Export data as fast as possible : from HANA to CSV
What is the fastest method to export HANA to CSV ?I use a HANA 2.0 database. I want to export from HANA to CSV. As source, a table or a sql query, as target an external client, of course as fast as possible and using a command line (i’m on…
Heapsort with Numba and Cython
Updated April 20, 2022 following usefull comments by @scoder. Thank you very much for your pull requests! Heapsort is a classical sorting algorithm. We are going into a little bit of theory about the algorithm, but refer to Corman et al. [1] for…
Using Tableau to detect outlying trends and ruptures
Python Tableau detects outlying trends using Tabpy. To do that, we are going to get the outliers of a dataset, the change points and a piecewise approximation. For detailled information about the source code, you can look at this article : Outlier…
Plotting population density with datashader
In this short post, we are using the Global Human Settlement Layer from the European Commission: This spatial raster dataset depicts the distribution of population, expressed as the number of people per cell. The downloaded file has a worldwide…
Applying a row-wise function to a Pandas dataframe
More than 3 years ago, we posted a comparative study about Looping over Pandas data using a CPU. Because a lot of things evolved since 2018, this post is kind of an update. For example Pandas tag version was 0.23.3 at that time, it is now 1.4.0.…
A Parallel loop in Python with Joblib.Parallel
The goal of this post is to perform an embarrassingly parallel loop in Python, with the same code running on different platforms [Linux and Windows]. From wikipedia, here is a definition of embarassingly parallel: In parallel computing, an…
Python Spatial Join with GeoPandas (and GEOS)
Updated Sep 13, 2021 The purpose of this post is to perform an "efficient" spatial join in Python. What is a spatial join? Here is the definition from wiki.gis.com: A Spatial join is a GIS operation that affixes data from one feature layer’s…
Built-in Expectations in Great Expectations
Great expectation is a Python tool for data testing, documentation, and profiling. Here is a figure from the documentation describing its purpose: Great Expectations makes it easy to include data testing in your ML pipeline, when dealing with…
Goldbach's Comet with Numba and Datashader
Updated Jul 29, 2021 This Python notebook is about computing and plotting Goldbach function. It requires some basic mathematical knowledge, nothing fancy! The main point is to perfom some computations with Numba and some efficient plotting with…
Le tour de france history web scraping
This file downloads raw data about every rider of every Tour de France (from 1903 up to 2020). This data will then be postprocessed and stored in CSV format. Executing this notebook might take some minutes. 1) Retrieve urls for data extract First we…
Some Pre-commit git hooks for Python
Pre-commit hooks are a great way to automatically check and clean the code. They are executed when committing changes to git . This can be useful when several people are working on the same package with different code styles, but also to help…
Quick data exploration with pandas, matplotlib and seaborn
In this JupyterLab Python notebook we are going to look at the rate of coronavirus [COVID-19] cases in french departments [administrative divisions of France]. The data source is the french government's open data. We are going to perform a few…
Optuna and XGBoost on a tabular dataset
Updated Sep 16, 2021 following a comment by @k_nzw about XGBoostPruningCallback The purpose of this Python notebook is to give a simple example of hyperparameter optimization [HPO] using Optuna and XGBoost. We are going to perform a regression…
Saving a tf.keras model with data normalization
Training a DL model might take some time, and make use of some special hardware. So you may want to save the model for using it later, or on another computer. In this short Python notebook, we are going to create a very simple tensorflow.keras…
Benford's law and the population of french cities
In this Python notebook, we are going to look at Benford's law, which predicts the leading digit distribution, when dealing with some real-world collections of numbers. This distribution usually occurs when the numbers are rather smoothly…
Merge Sort with Cython and Numba
In this post, we present an implementation of the classic merge sort algorithm in Python on NumPy arrays, and make it run reasonably "fast" using Cython and Numba. We are going to compare the run time with the numpy.sort(kind='mergesort')…
Logistic regression with JAX
JAX is a Python package for automatic differentiation from Google Research. It is a really powerful and efficient library. JAX can automatically differentiate some Python code [supports the reverse- and forward-mode]. It can also speed up the…
Minimizing continuous non-convex functions with Optuna
In this post, we are going to deal with single-objective continuous optimization problems, using the open-source Optuna Python package. Here is a very short description of this library from their github repository: Optuna is an automatic…
Lunch break, ridge plots with Bokeh
Bokeh is a great visualization Python library. In this short post, we are going to use it to create a ridge plot. For that purpose, we use the COVID-19 death data from Johns Hopkins University, and plot the daily normalized death rate (100000 *…
Lunch break, fetching AROME temperature forecast in Lyon
Since a "small" heat wave is coming, I would like to get some temperature forecast for the next hours in my neighborhood, from my JupyterLab notebook. We are going to fetch the results from the Météo-France AROME 0.01 model. Here is the AROME item…
Outlier and Change Point Detection in Data Integration Performance Metrics
Data integration involves combining data residing in different sources, and providing users with a unified view of them. In this post, we are interested in detecting performance drift of large and complex daily data integration processes performed…
Lunch break, plotting excess death in french department zones with Python
Daily deaths data are provided by INSEE - the national institute of statistics and economic studies. Here is the link of the page displaying these data, and here is a short description: During the Covid-19 pandemic, INSEE is reporting the number of…
A Quick study of air quality in Lyon with Python
The aim of this post is to use Python to fetch air quality data from a web service and to create a few plots. We are going to: look at some daily data from earlier this year, before and after the lockdown, in the city of Lyon, France compare 2020…
Fitting a logistic curve to time series in Python
In this notebook we are going to fit a logistic curve to time series stored in Pandas, using a simple linear regression from scikit-learn to find the coefficients of the logistic curve. Disclaimer: although we are going to use some COVID-19 data in…
Cython and Numba applied to a simple algorithm: Insertion sort
The aim of this notebook is to show a basic example of Cython and Numba, applied to a simple algorithm: Insertion sort. As we will see, the code transformation from Python to Cython or Python to Numba can be really easy [specifically for the…
Lunch break: plotting traffic injuries with datashader
Well I love the datashader Python package and I am always happy to use it on some interesting datasets. I recently came across a traffic injury database for french roads, which happens to have some geographical coordinates. This comes from the…
Fetching AROME weather forecasts and plotting temperatures
Accurate weather forecasts might be very usefull for various types of models. In this post, we are going to download the latest available weather forecasts for France and plot some temperature fields, using different Python libraries: Arome is…
Some cool open-source Python packages for Machine Learning Ep 4
DS There is a very rich ecosystem of Python libraries related to ML. Here is a list of some “active”, open-source packages that may be useful for ML day-to-day activities. This post is following these ones: Some cool open-source Python packages…
Some cool open-source Python packages for Machine Learning Ep 3
There is a very rich ecosystem of Python libraries related to ML. Here is a list of some “active”, open-source packages that may be useful for ML day-to-day activities. This post is following these ones: Some cool open-source Python packages…
First try of auto-sklearn
DS Since we are big users of scikit-learn and XGBoost, we wanted to try a package that would automate the process of building a machine learning model with these tools. Here is the introduction to auto-sklearn from its github.io website:…
Some cool open-source Python packages for Machine Learning Ep 2
There is a very rich ecosystem of Python libraries related to ML. Here is a list of some “active”, open-source packages that may be useful for ML day-to-day activities. This post is following that one: Some cool open-source Python packages for…
Loading data into a Pandas DataFrame - a performance study
Because doing machine learning implies trying many options and algorithms with different parameters, from data cleaning to model validation, the Python programmers will often load a full dataset into a Pandas dataframe, without actually…
Some cool open-source Python packages for Machine Learning Ep 1
There is a very rich ecosystem of Python libraries related to ML. Here is a list of some "active", open-source packages that may be useful for ML day-to-day activities. Of course, this list is far from being exhaustive and should evolve as fast…
GPU Analytics Ep 3, Apply a function to the rows of a dataframe
The goal of this post is to compare the execution time between Pandas (CPU) and RAPIDS (GPU) dataframes, when applying a simple mathematical function to the rows of a dataframe. Since the row-wise applied function is a re-projection of geographical…
Nighttime Lights with Rasterio and Datashader
daIn this post, we are going to plot some satellite GeoTIFF data in Python. The data is provided by NOAA (single GeoTIFF: F16_20100111-20110731_rad_v4.avg_vis.tif): The Operational Linescan System (OLS) flown on the Defense Meteorological Satellite…
Symmetric Chaos with Datashader and Numba
Map equation and coefficient values are taken from here. Some mathematical explainations can be found here, by Mike Field and Martin Golubitsky. import numpy as np import pandas as pd import datashader as ds from datashader import transfer_functions…
Plotting Hopalong attractor with Datashader and Numba
What is an attractor? Definition from wikipedia: In the mathematical field of dynamical systems, an attractor is a set of numerical values toward which a system tends to evolve, for a wide variety of starting conditions of the system. System values…
Looping over Pandas data
I recently stumbled on this interesting post on RealPython (excellent website by the way!): Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects This post has different subjects related to Pandas: creating a datetime column…
Pandas Time Series example with some historical land temperatures
Monthly averaged historical temperatures in France and over the global land surface The aim of this notebook is just to play with time series along with a couple of statistical and plotting libraries. Imports %matplotlib inline import pandas as pd…