Performance

Semantic Vector Search with SQL Server and the help of KMeans indexing

In the previous article Semantic Vector Search with SQL Server we discuss about the possibility to make semantic search with vectors that comes from OpenAI text embedding model « text-embedding-ada-002 ». We also have seen how the dimensional…


SQL Server vector search

Some time ago i read a blog article from a senior Microsoft azure programmer Davide Mauri : The article interest me because the current « semantic search » natively available in sql server since SQL Server 2016 is not wildly spread and honestly…


TPCH SF10 MSSQL 2022

TPCH SF10 : Query 13 and SQL Server Collations Performance Impact

After benchmarking several cloud databases (Snowflake, BigQuery, SingleStore, Databricks) using TPCH SF10 data, after benchmarking DuckDB and Tableau Hyper on my own machine, I ask to myselft : « hey, why not testing using the official SMP Databases…


SQL Server Extended Events Dashboard

SQL Server Extended Events Dashboard

SQL Server Extended Events is a powerful feature in Microsoft SQL Server that enables database administrators to capture and analyze events that occur within the server. Extended Events provides a lightweight and customizable infrastructure that can…


Cloud Comparator : CPU, RAM, Price of VMs

Cloud-Mercato.com is a mine of information about pricing in the cloud world. The following dashboard is based on a small extraction of their data regarding cpu/ram/prices of VM of dozens of cloud providers. The IAAS cloud comparator :…


Intel Processors Comparator

Intel Processor have several characteristics beyond frequency and number of cores. The Intel Processors Comparator is a dashboard that was designed to help you to compare processors and allow you to filter elements by number of cores, processor…


timeout

Talend Quick Tips : Increase BigQuery Talend timeout in components

Increase BigQuery Talend timeout within componentsIn this article I will show you how to change a critical setting in Talend : the timeout. We will see how to change it for BigQuery components, since it’s not available in the Studio……


More Heapsort in Cython

This post/notebook is the follow-up to a recent one : Heapsort with Numba and Cython, where we implemented heapsort in Python/Numba and Cython and compared the execution time with NumPy heapsort. However, heapsort in NumPy is written in C++…


Loading data from PostgreSQL to Pandas with ConnectorX

ConnectorX is a library, written in Rust, that enables fast and memory-efficient data loading from various databases to different dataframes. We refer to this interesting paper, in which the authors provide a detailed analysis of the pandas.read_sql…


Export data as fast as possible : from HANA to CSV

What is the fastest method to export HANA to CSV ?I use a HANA 2.0 database. I want to export from HANA to CSV. As source, a table or a sql query, as target an external client, of course as fast as possible and using a command line (i’m on…


Heapsort with Numba and Cython

Updated April 20, 2022 following usefull comments by @scoder. Thank you very much for your pull requests! Heapsort is a classical sorting algorithm. We are going into a little bit of theory about the algorithm, but refer to Corman et al. [1] for…


Tableau 2022 Optimize Workbook Feature : first test

Performance« Optimize Workbook » is a new feature of Tableau 2022.1. The idea behind this feature is to provide and checks good/bad practices on your workbooks before publication. BI tools can be complex to optimized when many calculated fields,…


Tableau Custom Group

Tableau Performance Tips #8 : Avoid using Tableau Groups

Performance Tableau Custom Groups : a really cool and useful feature ! Tableau allow users to build their own custom groups very easily. This is a convenient way to regroup data on elements that users want to see regrouped. When you have small data…


SQL Server Trace Flags Classification

Performance Trace flags are used to set specific server characteristics or to alter a particular behavior. For example, trace flag 3226 is a commonly used startup trace flag which suppresses successful backup messages in the error log. Trace flags…


Tooltips to display static metadata informations

Tableau Performance Tips #7 : Prefer to use a one row datasource to display static metadata informations

PerformanceA problem of choice Tableau does not have really customisable buttons. I mean you can put image on your dashboard but you cannot display the text you want with the formatting you want like that. One easy wait to put static…


Tableau Performance Tips #6 : Avoid using NOW() for filtering or selecting against a fact datasource

PerformanceDid you heard about database result cache ?Some databases implement a result cache (or query cache depending the name), it is a cache for the results of some queries. Oracle, Exasol, HANA implement a result/query cache.  MySQL &…


Tableau Performance Bad Practice : use a sort on an element not in the view

Tableau Performance Tips #5 : Sort only with element(s) in the view

PerformanceYou want to sort data ? using a field ? Avoid to use a sort field that is not in the view !A performance issue that is not well known is that Tableau sorts is not done by the datasource but by tableau itself. And this even if you ask a…


Compression des données XML dans SQL Server

PerformanceAlexandre Blois Problème : On stocke des données XML dans une table qui commence à grossir sérieusement et il devient important de réfléchir à des solutions pour essayer de gagner un peu de place de stockage pour éviter que la table…


Tableau Performance Recording for NOW() function

Tableau Performance Tip #4 : Avoid using a big datasource to display semi-constant informations

PerformanceTableau performance and constant calculated fieldMore than often you will want to display semi-constant information like the current timestamp, the Tableau Username, a chosen currency or a simple single information. For that the easy way…


Performance record with LoV in the context

Tableau Performance Tips #3 : Avoid small list of values to be in the context

PerformanceThe issue of using list of values that are context's linked Let’s begin with a definition of what a list of values is. A list of values is linked to a filter. The filter can be in the context or not. Hum… May i…


Tableau Performance and Sub-Totals

Tableau Performance Tips #2 : Avoid total and sub-totals when using a count distinct metric aggregate

PerformanceThe problem : Computing-Totals is done sequentiallyIf you use a total (or worse) sub-totals when you have metric(s) that is count distinct, this will lead to a second (or several) sequential pass for each subtotal level to retrieve data.…


Applying a row-wise function to a Pandas dataframe

More than 3 years ago, we posted a comparative study about Looping over Pandas data using a CPU. Because a lot of things evolved since 2018, this post is kind of an update. For example Pandas tag version was 0.23.3 at that time, it is now 1.4.0.…


tableau performance recording dashboard

Tableau Performance Tips #1 : Tableau Performance Recording

PerformanceIn this article we will discover how to diagnose your performances with the Tableau performance recording. Both Tableau Desktop and Tableau Server can use the performance recorder. We will learn : what performance informations can be…


A Parallel loop in Python with Joblib.Parallel

The goal of this post is to perform an embarrassingly parallel loop in Python, with the same code running on different platforms [Linux and Windows]. From wikipedia, here is a definition of embarassingly parallel: In parallel computing, an…


Tableau Server performance impacted by version history depth of datasources and workbooks

Tableau Server Performance impacted by the revision/version history depth of objects After several tests on real word tableau production environment (+1000 workbooks , +100 shared datasource) we discover that object version history have an…


Optuna and XGBoost

Optuna and XGBoost on a tabular dataset

Updated Sep 16, 2021 following a comment by @k_nzw about XGBoostPruningCallback The purpose of this Python notebook is to give a simple example of hyperparameter optimization [HPO] using Optuna and XGBoost. We are going to perform a regression…


SQL Server CLR Functions vs SQL 2019 Function Inlining

Booster vos performances SQL Server en utilisant des fonctions compilées CLR C# à l'intérieur même de votre SGBDR préféré à la place des fonctions SQL classiques ! Ce tutorial montre toutes les étapes pour : écrire une…


Merge Sort with Cython and Numba

In this post, we present an implementation of the classic merge sort algorithm in Python on NumPy arrays, and make it run reasonably "fast" using Cython and Numba. We are going to compare the run time with the numpy.sort(kind='mergesort')…


Cython and Numba applied to a simple algorithm: Insertion sort

The aim of this notebook is to show a basic example of Cython and Numba, applied to a simple algorithm: Insertion sort. As we will see, the code transformation from Python to Cython or Python to Numba can be really easy [specifically for the…


Loading data into a Pandas DataFrame - a performance study

Because doing machine learning implies trying many options and algorithms with different parameters, from data cleaning to model validation, the Python programmers will often load a full dataset into a Pandas dataframe, without actually…


GPU Analytics Ep 3, Apply a function to the rows of a dataframe

The goal of this post is to compare the execution time between Pandas (CPU) and RAPIDS (GPU) dataframes, when applying a simple mathematical function to the rows of a dataframe. Since the row-wise applied function is a re-projection of geographical…


GPU Analytics Ep 2, Load some data from OmniSci into a GPU dataframe

Although the post title is about loading some data from a GPU database into a GPU dataframe, most of it is about running JupyterLab on a GPU AWS instance, which is a little bit cumbersome to set up. Finally, once JupyterLab is running on our…


GPU Analytics Ep 1, GPU installation of OmniSci on AWS

In this post, we are going to install the OmniSci 4.6 GPU database on an Ubuntu 18.04 AWS instance. These are the actual command lines I entered when performing the installation. But let's start by introducing the motivation behind GPU databases:…


Looping over Pandas data

I recently stumbled on this interesting post on RealPython (excellent website by the way!): Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects This post has different subjects related to Pandas: creating a datetime column…