I was very interested in 2 articles of Mimoune Djouallah (aka mim) that compare Snowflake, BigQuery, Databricks, SingleStore, PowerBI Datamart and DuckDB using tpch benckmark with different scale factor (TPCH-SF10 then SF100)

TPCH benchmark is a world reference test that use 8 tables around a « normalized » star schema. The Scale Factor (1,10,100, 1000…) determine the size of the database mostly determined by the 2 fact tables : orders and lineitem.

The TPCH-SF10 will give you :

  • lineitem : 60M rows
  • orders : 15M rows
  • part : 2M rows
  • customer : 1.5MRows
  • partsupp : 8M rows
  • supplier : 100K rows
  • nation + region : 30 rows

Mimoune Djouallah have saved a link to source data in parquet files and tpch queries in a github repo, so it is easy to reproduce the bench. Great thanks !

I decided to test in first Duckdb on my laptop and after on a more recent hardware.

My laptop hardware and software

Hardware

  • CPU : i7-9750H
  • RAM : 32GB DDR4
  • Storage : M2-Nvme Samsung 970 pro (PCI3.0)

Software :

  • OS : Windows 10
  • DuckDB CLI 0.7.1

Data loading

I made 2 tests :

  1. I load the data into a duckdb database (as opposed to mimoune that use directly parquet files in his test through sql views) and use sort
  2. I use the mim parquet files as source

 

create schema tpch;
create table tpch.region as (select * from 'D:\OpenData\TPCH\data\10\region.parquet');
create table tpch.nation as (select * from 'D:\OpenData\TPCH\data\10\nation.parquet');
create table tpch.part as (select * from 'D:\OpenData\TPCH\data\10\part.parquet');
create table tpch.supplier as (select * from 'D:\OpenData\TPCH\data\10\supplier.parquet');
create table tpch.partsupp as (select * from 'D:\OpenData\TPCH\data\10\partsupp.parquet');
create table tpch.orders as (select * from 'D:\OpenData\TPCH\data\10\orders.parquet' order by 1,2);
create table tpch.lineitem as (select * from 'D:\OpenData\TPCH\data\10\lineitem.parquet' order by 1);
CREATE SCHEMA parquetviews;
CREATE VIEW parquetviews.partsupp AS SELECT * FROM 'D:\OpenData\TPCH\data\10\partsupp.parquet';
CREATE VIEW parquetviews.part AS SELECT * FROM 'D:\OpenData\TPCH\data\10\part.parquet';
CREATE VIEW parquetviews.supplier AS SELECT * FROM 'D:\OpenData\TPCH\data\10\supplier.parquet';
CREATE VIEW parquetviews.nation AS SELECT * FROM 'D:\OpenData\TPCH\data\10\nation.parquet';
CREATE VIEW parquetviews.region AS SELECT * FROM 'D:\OpenData\TPCH\data\10\region.parquet';
CREATE VIEW parquetviews.lineitem AS SELECT * FROM 'D:\OpenData\TPCH\data\10\lineitem.parquet';
CREATE VIEW parquetviews.orders AS SELECT * FROM 'D:\OpenData\TPCH\data\10\orders.parquet';
CREATE VIEW parquetviews.customer AS SELECT * FROM 'D:\OpenData\TPCH\data\10\customer.parquet';

The test with my laptop

For my test I use the CLI (mimoune use the python interface of DuckDB).

During the test :

  • My laptop was plugged
  • No thermal throttling occurs.

It is important because in other cases the cpu speed can be divided by 2.

The results

I used the same queries than Mimoune and use the profiling option of duckdb to return each query runtime in a json file.

I use duckdb to retrieve the results 🙂

SELECT 
    'DuckDB' tested_database, 
    'TPCH-SF10' test,
    regexp_extract("extra-info",'(Query[0-9]+)') query, 
    sum(result) result
FROM read_json('.\*.json',auto_detect=TRUE) 
GROUP BY GROUPING SETS ((query),())
ORDER BY query;

Results with a duckdb storage

┌─────────────────┬───────────┬─────────┬───────────┐
│ tested_database │   test    │  query  │  result   │
│     varchar     │  varchar  │ varchar │  double   │
├─────────────────┼───────────┼─────────┼───────────┤
│ DuckDB          │ TPCH-SF10 │         │ 40.718909 │
│ DuckDB          │ TPCH-SF10 │ Query01 │  1.677107 │
│ DuckDB          │ TPCH-SF10 │ Query02 │   0.25911 │
│ DuckDB          │ TPCH-SF10 │ Query03 │  0.784943 │
│ DuckDB          │ TPCH-SF10 │ Query04 │  0.706406 │
│ DuckDB          │ TPCH-SF10 │ Query05 │  0.832699 │
│ DuckDB          │ TPCH-SF10 │ Query06 │  0.466226 │
│ DuckDB          │ TPCH-SF10 │ Query07 │  2.733678 │
│ DuckDB          │ TPCH-SF10 │ Query08 │  0.717187 │
│ DuckDB          │ TPCH-SF10 │ Query09 │ 10.821196 │
│ DuckDB          │ TPCH-SF10 │ Query10 │  1.764814 │
│ DuckDB          │ TPCH-SF10 │ Query11 │  0.108359 │
│ DuckDB          │ TPCH-SF10 │ Query12 │  0.577429 │
│ DuckDB          │ TPCH-SF10 │ Query13 │  3.714737 │
│ DuckDB          │ TPCH-SF10 │ Query14 │   0.55995 │
│ DuckDB          │ TPCH-SF10 │ Query15 │  1.508817 │
│ DuckDB          │ TPCH-SF10 │ Query16 │  0.511634 │
│ DuckDB          │ TPCH-SF10 │ Query17 │  0.777818 │
│ DuckDB          │ TPCH-SF10 │ Query18 │  6.415835 │
│ DuckDB          │ TPCH-SF10 │ Query19 │    1.3254 │
│ DuckDB          │ TPCH-SF10 │ Query20 │  0.427272 │
│ DuckDB          │ TPCH-SF10 │ Query21 │  3.471633 │
│ DuckDB          │ TPCH-SF10 │ Query22 │  0.556659 │
├─────────────────┴───────────┴─────────┴───────────┤
│ 23 rows                                 4 columns │
└───────────────────────────────────────────────────┘

Results with mim parquet files as source

┌─────────────────┬───────────┬─────────┬────────────────────┐
│ tested_database │   test    │  query  │       result       │
│     varchar     │  varchar  │ varchar │       double       │
├─────────────────┼───────────┼─────────┼────────────────────┤
│ DuckDB          │ TPCH-SF10 │         │ 57.636711000000005 │
│ DuckDB          │ TPCH-SF10 │ Query01 │           2.900287 │
│ DuckDB          │ TPCH-SF10 │ Query02 │           0.543852 │
│ DuckDB          │ TPCH-SF10 │ Query03 │           1.775366 │
│ DuckDB          │ TPCH-SF10 │ Query04 │           1.564374 │
│ DuckDB          │ TPCH-SF10 │ Query05 │           1.759468 │
│ DuckDB          │ TPCH-SF10 │ Query06 │           1.415111 │
│ DuckDB          │ TPCH-SF10 │ Query07 │           3.884999 │
│ DuckDB          │ TPCH-SF10 │ Query08 │           2.180971 │
│ DuckDB          │ TPCH-SF10 │ Query09 │          10.586766 │
│ DuckDB          │ TPCH-SF10 │ Query10 │           2.291321 │
│ DuckDB          │ TPCH-SF10 │ Query11 │           0.324976 │
│ DuckDB          │ TPCH-SF10 │ Query12 │           2.139106 │
│ DuckDB          │ TPCH-SF10 │ Query13 │           2.170702 │
│ DuckDB          │ TPCH-SF10 │ Query14 │           1.584083 │
│ DuckDB          │ TPCH-SF10 │ Query15 │           3.157587 │
│ DuckDB          │ TPCH-SF10 │ Query16 │            0.53972 │
│ DuckDB          │ TPCH-SF10 │ Query17 │           2.051124 │
│ DuckDB          │ TPCH-SF10 │ Query18 │           5.567497 │
│ DuckDB          │ TPCH-SF10 │ Query19 │           2.740431 │
│ DuckDB          │ TPCH-SF10 │ Query20 │           1.814404 │
│ DuckDB          │ TPCH-SF10 │ Query21 │           6.074643 │
│ DuckDB          │ TPCH-SF10 │ Query22 │           0.569923 │
├─────────────────┴───────────┴─────────┴────────────────────┤
│ 23 rows                                          4 columns │
└────────────────────────────────────────────────────────────┘

A more modern laptop hardware (2022)

Hardware

  • CPU : AMD Ryzen 7 5700U
  • RAM : 16GB DDR4
  • Storage : M2-Nvme (PCI4.0)

Software :

  • OS : Windows 10
  • DuckDB CLI 0.7.1

Results with a duckdb storage

┌─────────────────┬───────────┬─────────┬───────────┐
│ tested_database │   test    │  query  │  result   │
│     varchar     │  varchar  │ varchar │  double   │
├─────────────────┼───────────┼─────────┼───────────┤
│ DuckDB          │ TPCH-SF10 │         │ 27.987798 │
│ DuckDB          │ TPCH-SF10 │ Query01 │  2.389645 │
│ DuckDB          │ TPCH-SF10 │ Query02 │   0.36502 │
│ DuckDB          │ TPCH-SF10 │ Query03 │  0.611162 │
│ DuckDB          │ TPCH-SF10 │ Query04 │  1.455928 │
│ DuckDB          │ TPCH-SF10 │ Query05 │  0.581546 │
│ DuckDB          │ TPCH-SF10 │ Query06 │  0.258996 │
│ DuckDB          │ TPCH-SF10 │ Query07 │  1.602176 │
│ DuckDB          │ TPCH-SF10 │ Query08 │  0.527786 │
│ DuckDB          │ TPCH-SF10 │ Query09 │  6.819389 │
│ DuckDB          │ TPCH-SF10 │ Query10 │  1.086569 │
│ DuckDB          │ TPCH-SF10 │ Query11 │  0.143364 │
│ DuckDB          │ TPCH-SF10 │ Query12 │  0.476493 │
│ DuckDB          │ TPCH-SF10 │ Query13 │  1.714431 │
│ DuckDB          │ TPCH-SF10 │ Query14 │  0.292762 │
│ DuckDB          │ TPCH-SF10 │ Query15 │  0.766771 │
│ DuckDB          │ TPCH-SF10 │ Query16 │  0.403847 │
│ DuckDB          │ TPCH-SF10 │ Query17 │  0.354821 │
│ DuckDB          │ TPCH-SF10 │ Query18 │  4.045392 │
│ DuckDB          │ TPCH-SF10 │ Query19 │  0.814581 │
│ DuckDB          │ TPCH-SF10 │ Query20 │  0.364441 │
│ DuckDB          │ TPCH-SF10 │ Query21 │  2.532056 │
│ DuckDB          │ TPCH-SF10 │ Query22 │  0.380622 │
├─────────────────┴───────────┴─────────┴───────────┤
│ 23 rows                                 4 columns │
└───────────────────────────────────────────────────┘

Results with mim parquet files as source

┌─────────────────┬───────────┬─────────┬────────────────────┐
│ tested_database │   test    │  query  │       result       │
│     varchar     │  varchar  │ varchar │       double       │
├─────────────────┼───────────┼─────────┼────────────────────┤
│ DuckDB          │ TPCH-SF10 │         │ 50.005154000000005 │
│ DuckDB          │ TPCH-SF10 │ Query01 │           2.805503 │
│ DuckDB          │ TPCH-SF10 │ Query02 │           0.514466 │
│ DuckDB          │ TPCH-SF10 │ Query03 │           1.803213 │
│ DuckDB          │ TPCH-SF10 │ Query04 │           1.466071 │
│ DuckDB          │ TPCH-SF10 │ Query05 │           1.676453 │
│ DuckDB          │ TPCH-SF10 │ Query06 │           1.178159 │
│ DuckDB          │ TPCH-SF10 │ Query07 │           3.305139 │
│ DuckDB          │ TPCH-SF10 │ Query08 │           1.880564 │
│ DuckDB          │ TPCH-SF10 │ Query09 │           9.190401 │
│ DuckDB          │ TPCH-SF10 │ Query10 │           1.708687 │
│ DuckDB          │ TPCH-SF10 │ Query11 │           0.240704 │
│ DuckDB          │ TPCH-SF10 │ Query12 │           1.687026 │
│ DuckDB          │ TPCH-SF10 │ Query13 │            1.97579 │
│ DuckDB          │ TPCH-SF10 │ Query14 │           1.197712 │
│ DuckDB          │ TPCH-SF10 │ Query15 │           2.432435 │
│ DuckDB          │ TPCH-SF10 │ Query16 │           0.416182 │
│ DuckDB          │ TPCH-SF10 │ Query17 │            1.56092 │
│ DuckDB          │ TPCH-SF10 │ Query18 │           4.862814 │
│ DuckDB          │ TPCH-SF10 │ Query19 │           2.142431 │
│ DuckDB          │ TPCH-SF10 │ Query20 │           1.373759 │
│ DuckDB          │ TPCH-SF10 │ Query21 │           6.064724 │
│ DuckDB          │ TPCH-SF10 │ Query22 │           0.522001 │
├─────────────────┴───────────┴─────────┴────────────────────┤
│ 23 rows                                          4 columns │
└────────────────────────────────────────────────────────────┘

 

 

.

The Mimoune Djouallah results with my complement :

TPCH-SF10-DuckDB_vs_Cloud_Databases

Conclusion

I don’t really know if the quite important performance improvement comes from :

  • hardware (surely)
  • storage (duckdb database vs parquet)
  • test conditions (thermal throttling or not…)
  • data sorting (probably)

but at the end, using DuckDB, my laptop was able to surpass Databrick ! and a more recent PC (less than 1000€) is faster than BigQuery, SingleStore and Databrick.

Of course this is not so simple :

  • DuckDB storage is not in a stable version : it should become stable in V1.0 and he current version is v0.7.1 : some months to wait.
  • DuckDB is designed for local usage and Databricks/BigQuery/SingleStore provide a multi-users experience and much greater ingestion capabilities. But sometimes, no need to use a cluster to deal with 5GB (or even 50GB) of data.
  • May be I biased the test by sorting data but this is a common practice in real life to improve performance. So why not ?

Next step : ask François to test DuckDB with a more recent hardware (i9/DDR5/PCI4) on a linux box 🙂