After benchmarking several cloud databases (Snowflake, BigQuery, SingleStore, Databricks) using TPCH SF10 data, after benchmarking DuckDB and Tableau Hyper on my own machine, I ask to myselft : « hey, why not testing using the official SMP Databases champion of TPCH for years : SQL Server ! »

So I start to bulk insert the data into SQL Server 2022 (CU1), on my good laptop where I already test DuckDB and Hyper.

 

After a small period of test I found that SQL Server 2022 was a great competitor BUT one query, the Query 13, have a special behavior compare to other competitors.

In this article I will explain my findings and may be so performance tricks for others (I hope)

The TPCH Query 13

TPCH Query 13 on SQL Server 2022 without any tuning

For sure Query 13 of TPCH is horrible but the same query, on the same computer take between 2 and 3s using DuckDB or Tableau Hyper whereas with SQL Server 2022 it take between 18s à 20s and consume a tons of CPU (184s)

WHY such a difference !

TPCH Q13 MSSQL no tuning : execution plan

The actual execution plan show that the time is spent in the columnstore index scan.

Strange because, this columnstore index is quite small and used to go very fast !

Collations

Collation refers to the rules that define how character data is sorted and compared in SQL Server. Collations can be set at the instance, database, column, or expression level, and they can have a significant impact on query performance and the behavior of string comparison operations.

As a French guy, my instance, database and by the way columns of my TPCH database are in FRENCH_CI_AS collation.

The collation FRENCH_CI_AS is a linguistic collation that takes into account the linguistic and cultural rules of the French language. It is case-insensitive, accent-sensitive, and width-insensitive, which means that it ignores differences in uppercase and lowercase letters, but distinguishes between accented and unaccented characters.

The collation FRENCH_100_BIN, on the other hand, is a binary collation, which means that it sorts and compares characters based on their underlying binary values. In other words, it performs a direct byte-by-byte comparison of characters without taking into account any linguistic or cultural considerations. This collation is case-sensitive, accent-sensitive, and width-sensitive, which means that it distinguishes between uppercase and lowercase letters, accented and unaccented characters, and characters of different widths (such as half-width and full-width).

In general, when comparing and sorting text data in SQL Server, it is recommended to use a linguistic collation like FRENCH_CI_AS, as it takes into account the linguistic and cultural rules of the language and can provide more accurate and natural results. However, in some cases, a binary collation like FRENCH_100_BIN may be useful for specific performance or compatibility reasons.

COLLATE FRENCH_100_BIN

TPCH Query 13 MSSQL using a COLLATE with Binary collation

EXCELLENT !
The COLLATE FRENCH_100_BIN divided the elapsed time and the CPU time by 7 :
From 19.7s to 2.8s

The execution plan of the TPCH Query 13 MSSQL tuned

The execution plan have also change, introducing a filter node where the comparison is done in a binary way (much faster)

What if the column is already stored with a BINARY COLLATION

Binary Collation on column vs collate in query vs linguistic collation

On a simple query that emulate the hard part of TPCH Query 13, we can see that performance is even better when the column have a binary collation directly defined and stored compare to a collate predicate in the query.

Let's Bench !

I use my laptop as bench platform. Easier to compare with others databases I have already benchmarked on the same laptop :-)

Conclusion

I have now a better understanding of the influence of the collation on the performances. For my TPCH test I had two choices :

  • Rebuild all my database using a Binary Collation and keep queries untouched
  • Touch some queries to collate columns when needed

I supposed that others databases do a binary choice for theirs collations for performances reason. SQL Server let you the choice, it’s up to you now !