SQL

Parquet file sorting test

Update Nov 17, 2023 - Added results using the latest DataFusion version. Some time ago, we came across an intriguing Parquet sorting test shared by Mimoune Djouallah on Twitter @mim_djo. The test involves reading a Parquet file, sorting the table,…


Vector similarity search with pgvector

In the realm of vector databases, pgvector emerges as a noteworthy open-source extension tailored for Postgres databases. This extension equips Postgres with the capability to efficiently perform vector similarity searches, a powerful technique…


SQL Server vector search

Some time ago i read a blog article from a senior Microsoft azure programmer Davide Mauri : The article interest me because the current « semantic search » natively available in sql server since SQL Server 2016 is not wildly spread and honestly…


TPC-H benchmark of DuckDB and Hyper on native files

In this blog post, we examine the performance of two popular SQL engines for querying large files: Tableau Hyper / Proprietary License DuckDB / MIT License These engines have gained popularity due to their efficiency, ease of use, and Python APIs.……


tpch_sf100_duckdb_vs_hyper_total_202304

TPC-H benchmark of Hyper and DuckDB on Windows and Linux OS

Update Apr 12, 2023 - It seems that Windows 11's poor performance may be due to conflicting BIOS/OS settings when dual-booting. We are investigating... Additionally, I have corrected the version of Windows 11 in the post from Home to Professional.…


TPC-H benchmark of Hyper, DuckDB and Datafusion on Parquet files

Update Apr 14, 2023 - An issue has been opened on the DataFusion GitHub repository regarding its poor reported performance compared to DuckDB and Hyper in this specific case: #5942. While there may be multiple factors contributing to this unexpected…


TPCH SF10 MSSQL 2022

TPCH SF10 : Query 13 and SQL Server Collations Performance Impact

After benchmarking several cloud databases (Snowflake, BigQuery, SingleStore, Databricks) using TPCH SF10 data, after benchmarking DuckDB and Tableau Hyper on my own machine, I ask to myselft : « hey, why not testing using the official SMP Databases…


Query Parquet files with DuckDB and Tableau Hyper engines

In this notebook, we are going to query some Parquet files with the following SQL engines: DuckDB : an in-process SQL OLAP database management system. We are going to use its Python Client API [MIT license]. Tableau Hyper : an in-memory data…


Testing DuckDB performance with Discogs data

This notebook is a small example of using DuckDB with the Python API. What is DuckDB? DuckDB is an in-process SQL OLAP Database Management System It is a relational DBMS that supports SQL. OLAP stands for Online analytical processing,…


Reading a SQL table by chunks with Pandas

In this short Python notebook, we want to load a table from a relational database and write it into a CSV file. In order to that, we temporarily store the data into a Pandas dataframe. Pandas is used to load the data with read_sql() and later to…


T-SQL Bad Practices

150 T-SQL Bad Practices

Many times i was asking for given best practices for T-SQL code but I don’t like best practices advices. They can be deprecated faster than you think and they can be a good advice in one case and very bad advise for another case. T-SQL Bad…


SQL Server CLR Functions vs SQL 2019 Function Inlining

Booster vos performances SQL Server en utilisant des fonctions compilées CLR C# à l'intérieur même de votre SGBDR préféré à la place des fonctions SQL classiques ! Ce tutorial montre toutes les étapes pour : écrire une…


GPU Analytics Ep 2, Load some data from OmniSci into a GPU dataframe

Although the post title is about loading some data from a GPU database into a GPU dataframe, most of it is about running JupyterLab on a GPU AWS instance, which is a little bit cumbersome to set up. Finally, once JupyterLab is running on our…


GPU Analytics Ep 1, GPU installation of OmniSci on AWS

In this post, we are going to install the OmniSci 4.6 GPU database on an Ubuntu 18.04 AWS instance. These are the actual command lines I entered when performing the installation. But let's start by introducing the motivation behind GPU databases:…