Performance
Semantic Vector Search with SQL Server and the help of KMeans indexing
In the previous article Semantic Vector Search with SQL Server we discuss about the possibility to make semantic search with vectors that comes from OpenAI text embedding model « text-embedding-ada-002 ». We also have seen how the dimensional…
SQL Server vector search
Some time ago i read a blog article from a senior Microsoft azure programmer Davide Mauri : The article interest me because the current « semantic search » natively available in sql server since SQL Server 2016 is not wildly spread and honestly…
TPCH SF10 : Query 13 and SQL Server Collations Performance Impact
After benchmarking several cloud databases (Snowflake, BigQuery, SingleStore, Databricks) using TPCH SF10 data, after benchmarking DuckDB and Tableau Hyper on my own machine, I ask to myselft : « hey, why not testing using the official SMP Databases…
SQL Server Extended Events Dashboard
SQL Server Extended Events is a powerful feature in Microsoft SQL Server that enables database administrators to capture and analyze events that occur within the server. Extended Events provides a lightweight and customizable infrastructure that can…
Cloud Comparator : CPU, RAM, Price of VMs
Cloud-Mercato.com is a mine of information about pricing in the cloud world. The following dashboard is based on a small extraction of their data regarding cpu/ram/prices of VM of dozens of cloud providers. The IAAS cloud comparator :…
Intel Processors Comparator
Intel Processor have several characteristics beyond frequency and number of cores. The Intel Processors Comparator is a dashboard that was designed to help you to compare processors and allow you to filter elements by number of cores, processor…
Talend Quick Tips : Increase BigQuery Talend timeout in components
Increase BigQuery Talend timeout within componentsIn this article I will show you how to change a critical setting in Talend : the timeout. We will see how to change it for BigQuery components, since it’s not available in the Studio……
More Heapsort in Cython
This post/notebook is the follow-up to a recent one : Heapsort with Numba and Cython, where we implemented heapsort in Python/Numba and Cython and compared the execution time with NumPy heapsort. However, heapsort in NumPy is written in C++…
Loading data from PostgreSQL to Pandas with ConnectorX
ConnectorX is a library, written in Rust, that enables fast and memory-efficient data loading from various databases to different dataframes. We refer to this interesting paper, in which the authors provide a detailed analysis of the pandas.read_sql…
Export data as fast as possible : from HANA to CSV
What is the fastest method to export HANA to CSV ?I use a HANA 2.0 database. I want to export from HANA to CSV. As source, a table or a sql query, as target an external client, of course as fast as possible and using a command line (i’m on…
Heapsort with Numba and Cython
Updated April 20, 2022 following usefull comments by @scoder. Thank you very much for your pull requests! Heapsort is a classical sorting algorithm. We are going into a little bit of theory about the algorithm, but refer to Corman et al. [1] for…
Tableau 2022 Optimize Workbook Feature : first test
Performance« Optimize Workbook » is a new feature of Tableau 2022.1. The idea behind this feature is to provide and checks good/bad practices on your workbooks before publication. BI tools can be complex to optimized when many calculated fields,…
Tableau Performance Tips #8 : Avoid using Tableau Groups
Performance Tableau Custom Groups : a really cool and useful feature ! Tableau allow users to build their own custom groups very easily. This is a convenient way to regroup data on elements that users want to see regrouped. When you have small data…
SQL Server Trace Flags Classification
Performance Trace flags are used to set specific server characteristics or to alter a particular behavior. For example, trace flag 3226 is a commonly used startup trace flag which suppresses successful backup messages in the error log. Trace flags…
Tableau Performance Tips #7 : Prefer to use a one row datasource to display static metadata informations
PerformanceA problem of choice Tableau does not have really customisable buttons. I mean you can put image on your dashboard but you cannot display the text you want with the formatting you want like that. One easy wait to put static…
Tableau Performance Tips #6 : Avoid using NOW() for filtering or selecting against a fact datasource
PerformanceDid you heard about database result cache ?Some databases implement a result cache (or query cache depending the name), it is a cache for the results of some queries. Oracle, Exasol, HANA implement a result/query cache. MySQL &…
Tableau Performance Tips #5 : Sort only with element(s) in the view
PerformanceYou want to sort data ? using a field ? Avoid to use a sort field that is not in the view !A performance issue that is not well known is that Tableau sorts is not done by the datasource but by tableau itself. And this even if you ask a…
Compression des données XML dans SQL Server
PerformanceAlexandre Blois Problème : On stocke des données XML dans une table qui commence à grossir sérieusement et il devient important de réfléchir à des solutions pour essayer de gagner un peu de place de stockage pour éviter que la table…
Tableau Performance Tip #4 : Avoid using a big datasource to display semi-constant informations
PerformanceTableau performance and constant calculated fieldMore than often you will want to display semi-constant information like the current timestamp, the Tableau Username, a chosen currency or a simple single information. For that the easy way…
Tableau Performance Tips #3 : Avoid small list of values to be in the context
PerformanceThe issue of using list of values that are context's linked Let’s begin with a definition of what a list of values is. A list of values is linked to a filter. The filter can be in the context or not. Hum… May i…
Tableau Performance Tips #2 : Avoid total and sub-totals when using a count distinct metric aggregate
PerformanceThe problem : Computing-Totals is done sequentiallyIf you use a total (or worse) sub-totals when you have metric(s) that is count distinct, this will lead to a second (or several) sequential pass for each subtotal level to retrieve data.…
Applying a row-wise function to a Pandas dataframe
More than 3 years ago, we posted a comparative study about Looping over Pandas data using a CPU. Because a lot of things evolved since 2018, this post is kind of an update. For example Pandas tag version was 0.23.3 at that time, it is now 1.4.0.…
Tableau Performance Tips #1 : Tableau Performance Recording
PerformanceIn this article we will discover how to diagnose your performances with the Tableau performance recording. Both Tableau Desktop and Tableau Server can use the performance recorder. We will learn : what performance informations can be…
A Parallel loop in Python with Joblib.Parallel
The goal of this post is to perform an embarrassingly parallel loop in Python, with the same code running on different platforms [Linux and Windows]. From wikipedia, here is a definition of embarassingly parallel: In parallel computing, an…
Tableau Server performance impacted by version history depth of datasources and workbooks
Tableau Server Performance impacted by the revision/version history depth of objects After several tests on real word tableau production environment (+1000 workbooks , +100 shared datasource) we discover that object version history have an…
Optuna and XGBoost on a tabular dataset
Updated Sep 16, 2021 following a comment by @k_nzw about XGBoostPruningCallback The purpose of this Python notebook is to give a simple example of hyperparameter optimization [HPO] using Optuna and XGBoost. We are going to perform a regression…
SQL Server CLR Functions vs SQL 2019 Function Inlining
Booster vos performances SQL Server en utilisant des fonctions compilées CLR C# à l'intérieur même de votre SGBDR préféré à la place des fonctions SQL classiques ! Ce tutorial montre toutes les étapes pour : écrire une…
Merge Sort with Cython and Numba
In this post, we present an implementation of the classic merge sort algorithm in Python on NumPy arrays, and make it run reasonably "fast" using Cython and Numba. We are going to compare the run time with the numpy.sort(kind='mergesort')…
Cython and Numba applied to a simple algorithm: Insertion sort
The aim of this notebook is to show a basic example of Cython and Numba, applied to a simple algorithm: Insertion sort. As we will see, the code transformation from Python to Cython or Python to Numba can be really easy [specifically for the…
Loading data into a Pandas DataFrame - a performance study
Because doing machine learning implies trying many options and algorithms with different parameters, from data cleaning to model validation, the Python programmers will often load a full dataset into a Pandas dataframe, without actually…
GPU Analytics Ep 3, Apply a function to the rows of a dataframe
The goal of this post is to compare the execution time between Pandas (CPU) and RAPIDS (GPU) dataframes, when applying a simple mathematical function to the rows of a dataframe. Since the row-wise applied function is a re-projection of geographical…
GPU Analytics Ep 2, Load some data from OmniSci into a GPU dataframe
Although the post title is about loading some data from a GPU database into a GPU dataframe, most of it is about running JupyterLab on a GPU AWS instance, which is a little bit cumbersome to set up. Finally, once JupyterLab is running on our…
GPU Analytics Ep 1, GPU installation of OmniSci on AWS
In this post, we are going to install the OmniSci 4.6 GPU database on an Ubuntu 18.04 AWS instance. These are the actual command lines I entered when performing the installation. But let's start by introducing the motivation behind GPU databases:…
Looping over Pandas data
I recently stumbled on this interesting post on RealPython (excellent website by the way!): Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects This post has different subjects related to Pandas: creating a datetime column…