Python

TPC-H benchmark of DuckDB and Hyper on native files

TPC-H benchmark of DuckDB and Hyper on native files In this blog post, we examine the performance of two popular SQL engines for querying large files: Tableau Hyper / Proprietary License DuckDB / MIT License These engines have gained popularity due…


tpch_sf100_duckdb_vs_hyper_total_202304

TPC-H benchmark of Hyper and DuckDB on Windows and Linux OS

TPC-H benchmark of Hyper and DuckDB on Windows and Linux OS Update Apr 12, 2023 - It seems that Windows 11's poor performance may be due to conflicting BIOS/OS settings when dual-booting. We are investigating... Additionally, I have corrected the…


TPC-H benchmark of Hyper, DuckDB and Datafusion on Parquet files

TPC-H benchmark of Hyper, DuckDB and DataFusion on Parquet files Update Apr 14, 2023 - An issue has been opened on the DataFusion GitHub repository regarding its poor reported performance compared to DuckDB and Hyper in this specific case: #5942.…


Dijkstra's algorithm in Cython, part 3/3

Dijkstra's algorithm in Cython, part 3/3 Running time of Dijkstra's algorithm on DIMACS networks with various implementations in Python. This post is the last part of a three-part series: first part second part In the present post, we compare the…


Dijkstra's algorithm in Cython, part 2/3

Dijkstra's algorithm in Cython, part 2/3 This post is the second part of a three-part series. In the first part, we looked at the Cython implementation of Dijkstra's algorithm. In the current post, we are going to compare different priority queue…


Dijkstra's algorithm in Cython, part 1/3

Dijkstra's algorithm in Cython, part 1/3 In this post, we are going to present an implementation of Dijkstra's algorithm in Cython. Dijkstra's algorithm is a shortest path algorithm. It was conceived by Edsger W. Dijkstra in 1956, and published in…


A Cython implementation of a priority queue

A Cython implementation of a priority queue Credit: Musée de l'illusion, Lyon [picture taken by myself] In this post, we describe a basic Cython implementation of a priority queue. A priority…


Visualizing some polynomial roots with Datashader

Visualizing some polynomial roots with Datashader Last week-end I found this interesting tweet by sara: The above figure shows all the complex roots from the various polynomials of degree 10 with coefficients in the set $\left\{ -1, 1…


Forward and reverse stars in Cython

Forward and reverse stars in Cython This notebook is the following of a previous one, where we looked at the forward and reverse star representations of a sparse directed graph in pure Python: Forward and reverse star representation of a digraph.…


Forward and reverse star representation of a digraph

Forward and reverse star representation of a digraph In this Python notebook, we are going to focus on a graph representation of directed graphs : the forward star representation [and its opposite, the reverse star]. The motivation here is to…


Query Parquet files with DuckDB and Tableau Hyper engines

Query Parquet files with DuckDB and Tableau Hyper engines In this notebook, we are going to query some Parquet files with the following SQL engines: DuckDB : an in-process SQL OLAP database management system. We are going to use its Python Client…


Download some benchmark road networks for Shortest Paths algorithms

Download some benchmark road networks for Shortest Paths algorithms Updated September 26, 2022 bugfix The goal of this Python notebook is to download and prepare a suite of benchmark networks for some shortest path algorithms. We would like to…


Euler's number and the uniform sum distribution

Euler's number and the uniform sum distribution Last year I stumbled upon this tweet from @fermatslibrary [1]: I find it a little bit intriguing for Euler's number $e$ to appear here! But actually, it is not uncommon to encounter $e$ in…


Testing DuckDB with Discogs data

Testing DuckDB with Discogs data This notebook is a small example of using DuckDB with the Python API. What is DuckDB? DuckDB is an in-process SQL OLAP Database Management System It is a relational DBMS that supports SQL. OLAP stands for…


Dynamic TaskGroup instanciation in Airflow 2.0

TaskGroup feature in Airflow 2.0 - Dynamic creationIn this article we will uncover a way to use Airflow new feature called TaskGroup which allow you to manage your dependencies in a dynamic way. Many articles are showing you how to use them in a…


Dynamic TaskGroup Scalability in Airflow 2.0

Dynamic TaskGroup Scalability in Airflow 2.0 - Handle big DAGsIn the previous article I showed you how to instantiate TaskGroup in a Dynamic way. We will now see how we face the challenge of using it at a larger scale. This is not an easy task : you…


Reading a SQL table by chunks with Pandas

Reading a SQL table by chunks with Pandas In this short Python notebook, we want to load a table from a relational database and write it into a CSV file. In order to that, we temporarily store the data into a Pandas dataframe. Pandas is used to load…


Airflow 2.0 - How it works

Airflow 2.0 & how it works : Scheduling and beyondAirflow became in the last recent years a major actor for scheduling a wide variety of actions. From running basic Python script to API calls up to acting as an ETL tools, it can address nearly…


Loading data from CSV files into a Tableau Hyper extract

Loading data from CSV files into a Tableau Hyper extract Hyper is Tableau’s in-memory data engine technology, designed for fast data ingest and analytical query processing on large or complex data sets. In the present notebook, we are going to…


More Heapsort in Cython

More Heapsort in Cython This post/notebook is the follow-up to a recent one : Heapsort with Numba and Cython, where we implemented heapsort in Python/Numba and Cython and compared the execution time with NumPy heapsort. However, heapsort in NumPy…


Loading data from PostgreSQL to Pandas with ConnectorX

Loading data from PostgreSQL to Pandas with ConnectorX ConnectorX is a library, written in Rust, that enables fast and memory-efficient data loading from various databases to different dataframes. We refer to this interesting paper, in which the…


Export data as fast as possible : from HANA to CSV

What is the fastest method to export HANA data (table or query result) to CSV ?I use a HANA 2.0 database. I want to export a table or a sql query from the database to an external client as fast as possible and using a command line (i’m on…


Heapsort with Numba and Cython

Heapsort with Numba and Cython Updated April 20, 2022 following usefull comments by @scoder. Thank you very much for your pull requests! Heapsort is a classical sorting algorithm. We are going into a little bit of theory about the algorithm, but…


Using Tableau to detect outlying trends and ruptures

Python Tableau detects outlying trends using Tabpy. To do that, we are going to get the outliers of a dataset, the change points and a piecewise approximation. For detailled information about the source code, you can look at this article : Outlier…


Plotting population density with datashader

In this short post, we are using the Global Human Settlement Layer from the European Commission: This spatial raster dataset depicts the distribution of population, expressed as the number of people per cell. The downloaded file has a worldwide…


Applying a row-wise function to a Pandas dataframe

More than 3 years ago, we posted a comparative study about Looping over Pandas data using a CPU. Because a lot of things evolved since 2018, this post is kind of an update. For example Pandas tag version was 0.23.3 at that time, it is now 1.4.0.…


A Parallel loop in Python with Joblib.Parallel

The goal of this post is to perform an embarrassingly parallel loop in Python, with the same code running on different platforms [Linux and Windows]. From wikipedia, here is a definition of embarassingly parallel: In parallel computing, an…


python spatial join

Python Spatial Join with GeoPandas (and GEOS)

Updated Sep 13, 2021 The purpose of this post is to perform an "efficient" spatial join in Python. What is a spatial join? Here is the definition from wiki.gis.com: A Spatial join is a GIS operation that affixes data from one feature layer’s…


Built-in Expectations in Great Expectations

Great expectation is a Python tool for data testing, documentation, and profiling. Here is a figure from the documentation describing its purpose: Great Expectations makes it easy to include data testing in your ML pipeline, when dealing with…


Numba and Datashader

Goldbach's Comet with Numba and Datashader

Updated Jul 29, 2021 This Python notebook is about computing and plotting Goldbach function. It requires some basic mathematical knowledge, nothing fancy! The main point is to perfom some computations with Numba and some efficient plotting with…


Le tour de france history web scraping

LeTour data set This file downloads raw data about every rider of every Tour de France (from 1903 up to 2020). This data will then be postprocessed and stored in CSV format. Executing this notebook might take some minutes. 1) Retrieve urls for data…


Some Pre-commit git hooks for Python

Pre-commit hooks are a great way to automatically check and clean the code. They are executed when committing changes to git . This can be useful when several people are working on the same package with different code styles, but also to help…


data exploration with pandas matplotlib and seaborn

Quick data exploration with pandas, matplotlib and seaborn

In this JupyterLab Python notebook we are going to look at the rate of coronavirus [COVID-19] cases in french departments [administrative divisions of France]. The data source is the french government's open data. We are going to perform a few…


Optuna and XGBoost

Optuna and XGBoost on a tabular dataset

Updated Sep 16, 2021 following a comment by @k_nzw about XGBoostPruningCallback The purpose of this Python notebook is to give a simple example of hyperparameter optimization [HPO] using Optuna and XGBoost. We are going to perform a regression…


Saving a tf.keras model with data normalization

Training a DL model might take some time, and make use of some special hardware. So you may want to save the model for using it later, or on another computer. In this short Python notebook, we are going to create a very simple tensorflow.keras…


Benford's law and the population of french cities

In this Python notebook, we are going to look at Benford's law, which predicts the leading digit distribution, when dealing with some real-world collections of numbers. This distribution usually occurs when the numbers are rather smoothly…


Merge Sort with Cython and Numba

In this post, we present an implementation of the classic merge sort algorithm in Python on NumPy arrays, and make it run reasonably "fast" using Cython and Numba. We are going to compare the run time with the numpy.sort(kind='mergesort')…


Logistic regression with JAX

Logistic regression with JAX

JAX is a Python package for automatic differentiation from Google Research. It is a really powerful and efficient library. JAX can automatically differentiate some Python code [supports the reverse- and forward-mode]. It can also speed up the…


Lunch break, ridge plots with Bokeh

Bokeh is a great visualization Python library. In this short post, we are going to use it to create a ridge plot. For that purpose, we use the COVID-19 death data from Johns Hopkins University, and plot the daily normalized death rate (100000 *…


Lunch break, fetching AROME temperature forecast in Lyon

Since a "small" heat wave is coming, I would like to get some temperature forecast for the next hours in my neighborhood, from my JupyterLab notebook. We are going to fetch the results from the Météo-France AROME 0.01 model. Here is the AROME item…


Outlier and Change Point Detection

Outlier and Change Point Detection in Data Integration Performance Metrics

Data integration involves combining data residing in different sources, and providing users with a unified view of them. In this post, we are interested in detecting performance drift of large and complex daily data integration processes performed…


Lunch break, plotting excess death in french department zones with Python

Daily deaths data are provided by INSEE - the national institute of statistics and economic studies. Here is the link of the page displaying these data, and here is a short description: During the Covid-19 pandemic, INSEE is reporting the number of…


A Quick study of air quality in Lyon with Python

The aim of this post is to use Python to fetch air quality data from a web service and to create a few plots. We are going to: look at some daily data from earlier this year, before and after the lockdown, in the city of Lyon, France compare 2020…


Fitting a logistic curve to time series in Python

In this notebook we are going to fit a logistic curve to time series stored in Pandas, using a simple linear regression from scikit-learn to find the coefficients of the logistic curve. Disclaimer: although we are going to use some COVID-19 data in…


Cython and Numba applied to a simple algorithm: Insertion sort

The aim of this notebook is to show a basic example of Cython and Numba, applied to a simple algorithm: Insertion sort. As we will see, the code transformation from Python to Cython or Python to Numba can be really easy [specifically for the…


Lunch break: plotting traffic injuries with datashader

Well I love the datashader Python package and I am always happy to use it on some interesting datasets. I recently came across a traffic injury database for french roads, which happens to have some geographical coordinates. This comes from the…


Fetching AROME weather forecasts and plotting temperatures

Accurate weather forecasts might be very usefull for various types of models. In this post, we are going to download the latest available weather forecasts for France and plot some temperature fields, using different Python libraries: Arome is…


Some cool open-source Python packages for Machine Learning Ep 4

DS There is a very rich ecosystem of Python libraries related to ML. Here is a list of some “active”, open-source packages that may be useful for ML day-to-day activities. This post is following these ones: Some cool open-source Python packages…


Some cool open-source Python packages for Machine Learning Ep 3

There is a very rich ecosystem of Python libraries related to ML. Here is a list of some “active”, open-source packages that may be useful for ML day-to-day activities. This post is following these ones: Some cool open-source Python packages…


First try of auto-sklearn

DS Since we are big users of scikit-learn and XGBoost, we wanted to try a package that would automate the process of building a machine learning model with these tools. Here is the introduction to auto-sklearn from its github.io website:…


Some cool open-source Python packages for Machine Learning Ep 2

There is a very rich ecosystem of Python libraries related to ML. Here is a list of some “active”, open-source packages that may be useful for ML day-to-day activities. This post is following that one: Some cool open-source Python packages for…


Loading data into a Pandas DataFrame - a performance study

Because doing machine learning implies trying many options and algorithms with different parameters, from data cleaning to model validation, the Python programmers will often load a full dataset into a Pandas dataframe, without actually…


Some cool open-source Python packages for Machine Learning Ep 1

There is a very rich ecosystem of Python libraries related to ML. Here is a list of some "active", open-source packages that may be useful for ML day-to-day activities. Of course, this list is far from being exhaustive and should evolve as fast…


GPU Analytics Ep 3, Apply a function to the rows of a dataframe

The goal of this post is to compare the execution time between Pandas (CPU) and RAPIDS (GPU) dataframes, when applying a simple mathematical function to the rows of a dataframe. Since the row-wise applied function is a re-projection of geographical…


Nighttime Lights with Rasterio and Datashader

daIn this post, we are going to plot some satellite GeoTIFF data in Python. The data is provided by NOAA (single GeoTIFF: F16_20100111-20110731_rad_v4.avg_vis.tif): The Operational Linescan System (OLS) flown on the Defense Meteorological Satellite…


Datashader

Symmetric Chaos with Datashader and Numba

Map equation and coefficient values are taken from here. Some mathematical explainations can be found here, by Mike Field and Martin Golubitsky. import numpy as np import pandas as pd import datashader as ds from datashader import transfer_functions…


Plotting Hopalong attractor with Datashader and Numba

What is an attractor? Definition from wikipedia: In the mathematical field of dynamical systems, an attractor is a set of numerical values toward which a system tends to evolve, for a wide variety of starting conditions of the system. System values…


Looping over Pandas data

I recently stumbled on this interesting post on RealPython (excellent website by the way!): Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects This post has different subjects related to Pandas: creating a datetime column…


Pandas Time Series example with some historical land temperatures

Monthly averaged historical temperatures in France and over the global land surface The aim of this notebook is just to play with time series along with a couple of statistical and plotting libraries. Imports %matplotlib inline import pandas as pd…