After benchmarking several cloud databases (Snowflake, BigQuery, SingleStore, Databricks) using TPCH SF10 data, after benchmarking DuckDB and Tableau Hyper on my own machine, I ask to myselft : « hey, why not testing using the official SMP Databases champion of TPCH for years : SQL Server ! »
So I start to bulk insert the data into SQL Server 2022 (CU1), on my good laptop where I already test DuckDB and Hyper.
After a small period of test I found that SQL Server 2022 was a great competitor BUT one query, the Query 13, have a special behavior compare to other competitors.
In this article I will explain my findings and may be so performance tricks for others (I hope)
The TPCH Query 13
Collations
Collation refers to the rules that define how character data is sorted and compared in SQL Server. Collations can be set at the instance, database, column, or expression level, and they can have a significant impact on query performance and the behavior of string comparison operations.
As a French guy, my instance, database and by the way columns of my TPCH database are in FRENCH_CI_AS collation.
The collation FRENCH_CI_AS is a linguistic collation that takes into account the linguistic and cultural rules of the French language. It is case-insensitive, accent-sensitive, and width-insensitive, which means that it ignores differences in uppercase and lowercase letters, but distinguishes between accented and unaccented characters.
The collation FRENCH_100_BIN, on the other hand, is a binary collation, which means that it sorts and compares characters based on their underlying binary values. In other words, it performs a direct byte-by-byte comparison of characters without taking into account any linguistic or cultural considerations. This collation is case-sensitive, accent-sensitive, and width-sensitive, which means that it distinguishes between uppercase and lowercase letters, accented and unaccented characters, and characters of different widths (such as half-width and full-width).
In general, when comparing and sorting text data in SQL Server, it is recommended to use a linguistic collation like FRENCH_CI_AS, as it takes into account the linguistic and cultural rules of the language and can provide more accurate and natural results. However, in some cases, a binary collation like FRENCH_100_BIN may be useful for specific performance or compatibility reasons.
COLLATE FRENCH_100_BIN
What if the column is already stored with a BINARY COLLATION
Let's Bench !
I use my laptop as bench platform. Easier to compare with others databases I have already benchmarked on the same laptop 🙂



Conclusion
I have now a better understanding of the influence of the collation on the performances. For my TPCH test I had two choices :
- Rebuild all my database using a Binary Collation and keep queries untouched
- Touch some queries to collate columns when needed
I supposed that others databases do a binary choice for theirs collations for performances reason. SQL Server let you the choice, it’s up to you now !