TPCH SF10 using DuckDB vs SnowFlake, Bigquery, SingleStore and Databricks

I was very interested in 2 articles of Mimoune Djouallah (aka mim) that compare Snowflake, BigQuery, Databricks, SingleStore, PowerBI Datamart and DuckDB using tpch benckmark with different scale factor (TPCH-SF10 then SF100) TPCH benchmark is a world reference test that use 8 tables around a "normalized" star schema. The Scale Factor (1,10,100, 1000...) determine the size of the database mostly determined by the 2 fact tables : orders and lineitem. The TPCH-SF10 will give you :

lineitem : 60M rows
orders : 15M rows
part : 2M rows
customer : 1.5MRows
partsupp : 8M rows
supplier : 100K rows
nation + region : 30 rows

Mimoune Djouallah have saved a link to source data in parquet files and tpch queries in a github repo, so it is easy to reproduce the bench. Great thanks ! I decided to test in first Duckdb on my laptop and after on a more recent hardware.

My laptop hardware and software

Hardware

CPU : i7-9750H
RAM : 32GB DDR4
Storage : M2-Nvme Samsung 970 pro (PCI3.0)

Software :

OS : Windows 10
DuckDB CLI 0.

Ryzen_Memory_DDR4

Ryzen7_Mainboard_PCI4

CPUz-RYZEN7_5700U

2023-03-12_02h05_13

2023-03-12_02h04_21

CPU-Z cpu

7.1

Data loading

I made 2 tests :

I load the data into a duckdb database (as opposed to mimoune that use directly parquet files in his test through sql views) and use sort
I use the mim parquet files as source

create schema tpch;

create table tpch.region as (select * from 'D:\OpenData\TPCH\data\10\region.parquet');
create table tpch.nation as (select * from 'D:\OpenData\TPCH\data\10\nation.parquet');
create table tpch.part as (select * from 'D:\OpenData\TPCH\data\10\part.parquet');
create table tpch.supplier as (select * from 'D:\OpenData\TPCH\data\10\supplier.parquet');
create table tpch.partsupp as (select * from 'D:\OpenData\TPCH\data\10\partsupp.parquet');
create table tpch.orders as (select * from 'D:\OpenData\TPCH\data\10\orders.parquet' order by 1,2);
create table tpch.lineitem as (select * from 'D:\OpenData\TPCH\data\10\lineitem.parquet' order by 1);

CREATE SCHEMA parquetviews;

CREATE VIEW parquetviews.partsupp AS SELECT * FROM 'D:\OpenData\TPCH\data\10\partsupp.parquet';
CREATE VIEW parquetviews.part AS SELECT * FROM 'D:\OpenData\TPCH\data\10\part.parquet';
CREATE VIEW parquetviews.supplier AS SELECT * FROM 'D:\OpenData\TPCH\data\10\supplier.parquet';
CREATE VIEW parquetviews.nation AS SELECT * FROM 'D:\OpenData\TPCH\data\10\nation.parquet';
CREATE VIEW parquetviews.region AS SELECT * FROM 'D:\OpenData\TPCH\data\10\region.parquet';
CREATE VIEW parquetviews.lineitem AS SELECT * FROM 'D:\OpenData\TPCH\data\10\lineitem.parquet';
CREATE VIEW parquetviews.orders AS SELECT * FROM 'D:\OpenData\TPCH\data\10\orders.parquet';
CREATE VIEW parquetviews.customer AS SELECT * FROM 'D:\OpenData\TPCH\data\10\customer.parquet';

The test with my laptop

For my test I use the CLI (mimoune use the python interface of DuckDB). During the test :

My laptop was plugged
No thermal throttling occurs.

It is important because in other cases the cpu speed can be divided by 2.

The results

I used the same queries than Mimoune and use the profiling option of duckdb to return each query runtime in a json file. I use duckdb to retrieve the results :-)

SELECT 
    'DuckDB' tested_database, 
    'TPCH-SF10' test,
    regexp_extract("extra-info",'(Query[0-9]+)') query, 
    sum(result) result
FROM read_json('.\*.json',auto_detect=TRUE) 
GROUP BY GROUPING SETS ((query),())
ORDER BY query;

Results with a duckdb storage

┌─────────────────┬───────────┬─────────┬───────────┐ │ tested_database │ test │ query │ result │ │ varchar │ varchar │ varchar │ double │ ├─────────────────┼───────────┼─────────┼───────────┤ │ DuckDB │ TPCH-SF10 │ │ 40.718909 │ │ DuckDB │ TPCH-SF10 │ Query01 │ 1.677107 │ │ DuckDB │ TPCH-SF10 │ Query02 │ 0.25911 │ │ DuckDB │ TPCH-SF10 │ Query03 │ 0.784943 │ │ DuckDB │ TPCH-SF10 │ Query04 │ 0.706406 │ │ DuckDB │ TPCH-SF10 │ Query05 │ 0.832699 │ │ DuckDB │ TPCH-SF10 │ Query06 │ 0.466226 │ │ DuckDB │ TPCH-SF10 │ Query07 │ 2.733678 │ │ DuckDB │ TPCH-SF10 │ Query08 │ 0.717187 │ │ DuckDB │ TPCH-SF10 │ Query09 │ 10.821196 │ │ DuckDB │ TPCH-SF10 │ Query10 │ 1.764814 │ │ DuckDB │ TPCH-SF10 │ Query11 │ 0.108359 │ │ DuckDB │ TPCH-SF10 │ Query12 │ 0.577429 │ │ DuckDB │ TPCH-SF10 │ Query13 │ 3.714737 │ │ DuckDB │ TPCH-SF10 │ Query14 │ 0.55995 │ │ DuckDB │ TPCH-SF10 │ Query15 │ 1.508817 │ │ DuckDB │ TPCH-SF10 │ Query16 │ 0.511634 │ │ DuckDB │ TPCH-SF10 │ Query17 │ 0.777818 │ │ DuckDB │ TPCH-SF10 │ Query18 │ 6.415835 │ │ DuckDB │ TPCH-SF10 │ Query19 │ 1.3254 │ │ DuckDB │ TPCH-SF10 │ Query20 │ 0.427272 │ │ DuckDB │ TPCH-SF10 │ Query21 │ 3.471633 │ │ DuckDB │ TPCH-SF10 │ Query22 │ 0.556659 │ ├─────────────────┴───────────┴─────────┴───────────┤ │ 23 rows 4 columns │ └───────────────────────────────────────────────────┘

Results with mim parquet files as source

┌─────────────────┬───────────┬─────────┬────────────────────┐ │ tested_database │ test │ query │ result │ │ varchar │ varchar │ varchar │ double │ ├─────────────────┼───────────┼─────────┼────────────────────┤ │ DuckDB │ TPCH-SF10 │ │ 57.636711000000005 │ │ DuckDB │ TPCH-SF10 │ Query01 │ 2.900287 │ │ DuckDB │ TPCH-SF10 │ Query02 │ 0.543852 │ │ DuckDB │ TPCH-SF10 │ Query03 │ 1.775366 │ │ DuckDB │ TPCH-SF10 │ Query04 │ 1.564374 │ │ DuckDB │ TPCH-SF10 │ Query05 │ 1.759468 │ │ DuckDB │ TPCH-SF10 │ Query06 │ 1.415111 │ │ DuckDB │ TPCH-SF10 │ Query07 │ 3.884999 │ │ DuckDB │ TPCH-SF10 │ Query08 │ 2.180971 │ │ DuckDB │ TPCH-SF10 │ Query09 │ 10.586766 │ │ DuckDB │ TPCH-SF10 │ Query10 │ 2.291321 │ │ DuckDB │ TPCH-SF10 │ Query11 │ 0.324976 │ │ DuckDB │ TPCH-SF10 │ Query12 │ 2.139106 │ │ DuckDB │ TPCH-SF10 │ Query13 │ 2.170702 │ │ DuckDB │ TPCH-SF10 │ Query14 │ 1.584083 │ │ DuckDB │ TPCH-SF10 │ Query15 │ 3.157587 │ │ DuckDB │ TPCH-SF10 │ Query16 │ 0.53972 │ │ DuckDB │ TPCH-SF10 │ Query17 │ 2.051124 │ │ DuckDB │ TPCH-SF10 │ Query18 │ 5.567497 │ │ DuckDB │ TPCH-SF10 │ Query19 │ 2.740431 │ │ DuckDB │ TPCH-SF10 │ Query20 │ 1.814404 │ │ DuckDB │ TPCH-SF10 │ Query21 │ 6.074643 │ │ DuckDB │ TPCH-SF10 │ Query22 │ 0.569923 │ ├─────────────────┴───────────┴─────────┴────────────────────┤ │ 23 rows 4 columns │ └────────────────────────────────────────────────────────────┘

A more modern laptop hardware (2022)

Hardware

CPU : AMD Ryzen 7 5700U
RAM : 16GB DDR4
Storage : M2-Nvme (PCI4.0)

Software :

OS : Windows 10
DuckDB CLI 0.7.1

Results with a duckdb storage

┌─────────────────┬───────────┬─────────┬───────────┐ │ tested_database │ test │ query │ result │ │ varchar │ varchar │ varchar │ double │ ├─────────────────┼───────────┼─────────┼───────────┤ │ DuckDB │ TPCH-SF10 │ │ 27.987798 │ │ DuckDB │ TPCH-SF10 │ Query01 │ 2.389645 │ │ DuckDB │ TPCH-SF10 │ Query02 │ 0.36502 │ │ DuckDB │ TPCH-SF10 │ Query03 │ 0.611162 │ │ DuckDB │ TPCH-SF10 │ Query04 │ 1.455928 │ │ DuckDB │ TPCH-SF10 │ Query05 │ 0.581546 │ │ DuckDB │ TPCH-SF10 │ Query06 │ 0.258996 │ │ DuckDB │ TPCH-SF10 │ Query07 │ 1.602176 │ │ DuckDB │ TPCH-SF10 │ Query08 │ 0.527786 │ │ DuckDB │ TPCH-SF10 │ Query09 │ 6.819389 │ │ DuckDB │ TPCH-SF10 │ Query10 │ 1.086569 │ │ DuckDB │ TPCH-SF10 │ Query11 │ 0.143364 │ │ DuckDB │ TPCH-SF10 │ Query12 │ 0.476493 │ │ DuckDB │ TPCH-SF10 │ Query13 │ 1.714431 │ │ DuckDB │ TPCH-SF10 │ Query14 │ 0.292762 │ │ DuckDB │ TPCH-SF10 │ Query15 │ 0.766771 │ │ DuckDB │ TPCH-SF10 │ Query16 │ 0.403847 │ │ DuckDB │ TPCH-SF10 │ Query17 │ 0.354821 │ │ DuckDB │ TPCH-SF10 │ Query18 │ 4.045392 │ │ DuckDB │ TPCH-SF10 │ Query19 │ 0.814581 │ │ DuckDB │ TPCH-SF10 │ Query20 │ 0.364441 │ │ DuckDB │ TPCH-SF10 │ Query21 │ 2.532056 │ │ DuckDB │ TPCH-SF10 │ Query22 │ 0.380622 │ ├─────────────────┴───────────┴─────────┴───────────┤ │ 23 rows 4 columns │ └───────────────────────────────────────────────────┘

Results with mim parquet files as source

┌─────────────────┬───────────┬─────────┬────────────────────┐ │ tested_database │ test │ query │ result │ │ varchar │ varchar │ varchar │ double │ ├─────────────────┼───────────┼─────────┼────────────────────┤ │ DuckDB │ TPCH-SF10 │ │ 50.005154000000005 │ │ DuckDB │ TPCH-SF10 │ Query01 │ 2.805503 │ │ DuckDB │ TPCH-SF10 │ Query02 │ 0.514466 │ │ DuckDB │ TPCH-SF10 │ Query03 │ 1.803213 │ │ DuckDB │ TPCH-SF10 │ Query04 │ 1.466071 │ │ DuckDB │ TPCH-SF10 │ Query05 │ 1.676453 │ │ DuckDB │ TPCH-SF10 │ Query06 │ 1.178159 │ │ DuckDB │ TPCH-SF10 │ Query07 │ 3.305139 │ │ DuckDB │ TPCH-SF10 │ Query08 │ 1.880564 │ │ DuckDB │ TPCH-SF10 │ Query09 │ 9.190401 │ │ DuckDB │ TPCH-SF10 │ Query10 │ 1.708687 │ │ DuckDB │ TPCH-SF10 │ Query11 │ 0.240704 │ │ DuckDB │ TPCH-SF10 │ Query12 │ 1.687026 │ │ DuckDB │ TPCH-SF10 │ Query13 │ 1.97579 │ │ DuckDB │ TPCH-SF10 │ Query14 │ 1.197712 │ │ DuckDB │ TPCH-SF10 │ Query15 │ 2.432435 │ │ DuckDB │ TPCH-SF10 │ Query16 │ 0.416182 │ │ DuckDB │ TPCH-SF10 │ Query17 │ 1.56092 │ │ DuckDB │ TPCH-SF10 │ Query18 │ 4.862814 │ │ DuckDB │ TPCH-SF10 │ Query19 │ 2.142431 │ │ DuckDB │ TPCH-SF10 │ Query20 │ 1.373759 │ │ DuckDB │ TPCH-SF10 │ Query21 │ 6.064724 │ │ DuckDB │ TPCH-SF10 │ Query22 │ 0.522001 │ ├─────────────────┴───────────┴─────────┴────────────────────┤ │ 23 rows 4 columns │ └────────────────────────────────────────────────────────────┘

The Mimoune Djouallah results with my complement :

TPCH-SF10-DuckDB_vs_Cloud_Databases

Conclusion

I don't really know if the quite important performance improvement comes from :

hardware (surely)
storage (duckdb database vs parquet)
test conditions (thermal throttling or not...)
data sorting (probably)

but at the end, using DuckDB, my laptop was able to surpass Databrick ! and a more recent PC (less than 1000€) is faster than BigQuery, SingleStore and Databrick. Of course this is not so simple :

DuckDB storage is not in a stable version : it should become stable in V1.0 and he current version is v0.7.1 : some months to wait.
DuckDB is designed for local usage and Databricks/BigQuery/SingleStore provide a multi-users experience and much greater ingestion capabilities. But sometimes, no need to use a cluster to deal with 5GB (or even 50GB) of data.
May be I biased the test by sorting data but this is a common practice in real life to improve performance. So why not ?

Next step : ask François to test DuckDB with a more recent hardware (i9/DDR5/PCI4) on a linux box :-)

Nos services

TPCH SF10 using DuckDB vs SnowFlake, Bigquery, SingleStore and Databricks

My laptop hardware and software

Hardware

Software :

Data loading

The test with my laptop

The results

A more modern laptop hardware (2022)

Hardware

Software :

The Mimoune Djouallah results with my complement :

Conclusion

Articles similaires

TPCH with Snowflake : SF100

TPCH SF10 : Tableau Hyper Engine vs DuckDB vs Snowflake vs BigQuery vs Databricks vs SingleStore

Export data as fast as possible : from HANA to CSV

Besoin d'aide sur ce sujet ?