In the previous article **Semantic Vector Search with SQL Server **we discuss about the possibility to make semantic search with vectors that comes from OpenAI text embedding model « text-embedding-ada-002 ». We also have seen how the** dimensional reduction** has made it possible to improve performance and above all to reduce cpu consumption compared to research on large vectors (1536 in this case).

In this blog post, I would like to discuss another approach** to improve vector search performance with SQL Server by using KMeans to reduce search space.**

## Using KMeans to segment space

As i already said, SQL Server currently does not have Vector datatype nor index able to index arrays like we have. The only solution for us to improve performance of the search is to build tables and programmatic stuff to make the search faster.

We want to improve performance of the search in our 25000 wikipedia articles store in our database table. For that, one technic is to reduce the search space. **Kmeans** could help us by computing x **centroids** and also giving us the **relationship** between all articles we have in the database and the centroids.

In the image (an exemple coming from an article wrote by **François Pacull**) we can see centroids in blue and the green points could be articles. This is a simplification of the problem because we have 1536 dimensions for the text embedding vector ada-002 but the idea is here.

## How to compute KMeans ? Python to the rescue !

There is no easy way to compute KMeans in SQL. Despite my research on the subject I have not found an implementation of KMeans in SQL. Maybe someone can enlighten me but after several hours of research **I resigned myself to using Python and SciKit Learn**. This language is, let’s face it, much more suitable for this type of calculation.

Our objective is to build structures (tables in our case) :

- One with centroids (id and vectors components).
- Another with the mapping centroids ==> articles

We use several modules that you will have to install beforehand:

pip install scikit-learn pip install joblib pip install sqlalchemy pip install pymssql pip install pandas

import numpy as np import pandas as pd from sklearn.cluster import KMeans from sqlalchemy import create_engine,text from pandas import DataFrame from joblib import Parallel, delayed import json k=1 # Number of SubVectors (only one for simplicity, several in memory pressure is to high with Kmeans) nb_clusters=32 # Number of Clusters

%%time server="localhost" database="ExternalData" connect_string = f"mssql+pymssql://{server}/{database}" engine = create_engine(connect_string) sql='''SELECT id article_id, content_vector FROM wikipedia_articles_embeddings''' df = pd.read_sql_query(sql=text(sql), con=engine.connect())

def parse_vector(x): return np.fromstring(x[1:-1], sep=",") def prepare_data(df): df["content_vector"] = Parallel(n_jobs=-1)(delayed(parse_vector)(x) for x in df["content_vector"]) return df

%%time # Convert string list content_vector to numpy ndarrays prepared_data = prepare_data(df)

def compute_assign_centroids(df, k, nb_clusters): all_centroids = [] mapping = [] vector_length = len(df.loc[0, "content_vector"]) subvector_length = vector_length // k # For each subvector of each content_vector, compute centroids for i in range(k): start = i * subvector_length end = start + subvector_length subvector = np.stack(df["content_vector"].apply(lambda x: x[start:end])) kmeans = KMeans(n_clusters=nb_clusters, random_state=0, n_init="auto") y = kmeans.fit_predict(subvector) # Map articles to theirs centroids df["centroid_id_sv_"+str(i)] = y # For each centroid, add its subvector_id and vector_value to the DataFrame for centroid_id, centroid in enumerate(kmeans.cluster_centers_): for subvector_id, vector_value in enumerate(centroid): all_centroids.append([centroid_id, i, subvector_id, vector_value]) return DataFrame(all_centroids, columns=["centroid_id", "subvector_id","vector_value_id","vector_value"])

%%time centroids = compute_assign_centroids(prepared_data, k,nb_clusters)

#Write back results to SQL centroids_table_schema={"centroid_id":Integer(),"subvector_id":Integer(),"vector_value_id":Integer(), "vector_value":Float()} mapping_articles_centroids_schema={"article_id":Integer(),"centroid_id_sv_0":Integer()} centroids.to_sql(f'wikipedia_articles_centroids_{nb_clusters}_{k}', engine, if_exists='replace', index=False, chunksize=1000, method="multi",dtype=centroids_table_schema) prepared_data_rel = prepared_data.drop(columns=['content_vector']) prepared_data_rel.to_sql(f'wikipedia_mapping_articles_centroids_{nb_clusters}_{k}', engine, if_exists='replace', index=False, chunksize=1000, method="multi",dtype=mapping_articles_centroids_schema)

**Performance Profiling of the Python program**

With the profiling of each steps we can see that **KMeans fit_predict is the most CPU intentive** **step (68s)** but quite efficient (10s).

Pandas is not known to be the fastest module to read or write data to SQL, may be other modules like polars or connectorX can do better job or may be pandas tuning could be done using arrow…. But i hadn’t enough time for that at the time of writing.

## Return to SQL

We have launch the program twice : once with** nb_clusters=32 and another with nb_clusters=64. So we have 4 tables : 2 centroids tables and 2 maps(centroid_id,article_id) tables**

## SQL Search function using Kmeans tables

The idea is

- to normalize our question vector (aka Xq).
- Search against centroids. It means we will compute distance of Xq with all the centroids (only 64 here)
- Limit the list of returned centroids to top 4 ones (like an nprobe for pythonist)
- Search against all vectors that belong to the top 4 centroids (this will limit the search to 1500-2000 vectors instead of 25000

CREATE function [dbo].[SimilarContentArticlesKmeans64](@vector nvarchar(max)) returns table as return with cteVector as ( select cast([key] as int) as [vector_value_id], cast([value] as float) as [vector_value] from openjson(@vector) ), cteSimilarKmeans64 as ( select top 4 centroids.centroid_id, sum(centroids.[vector_value] * v.[vector_value]) / (sqrt(sum(centroids.[vector_value] *centroids.[vector_value]))*sqrt(sum(v.[vector_value] *v.[vector_value]))) as cosine_distance from [dbo].[wikipedia_articles_centroids_64_1] centroids inner join cteVector v on v.vector_value_id = centroids.vector_value_id group by centroids.centroid_id order by cosine_distance desc ), cteSimilar as ( select top (50) v2.article_id, sum(v1.[vector_value] * v2.[vector_value]) as cosine_distance from cteVector v1 inner join dbo.wikipedia_articles_embeddings_contents_vector v2 on v1.vector_value_id = v2.vector_value_id inner join [dbo].[wikipedia_mapping_articles_centroids_64_1] map on (map.article_id=v2.article_id) inner join cteSimilarKmeans64 topcentroids on (topcentroids.centroid_id=map.centroid_id_sv_0) group by v2.article_id order by cosine_distance desc ) select a.id, a.title, a.url, r.cosine_distance from cteSimilar r inner join dbo.wikipedia_articles_embeddings a on r.article_id = a.id GO

## Calling the search function

Use ExternalData GO drop table if exists #SimilarityUsing_All_1536Length_Vectors; drop table if exists #SimilarityUsing_Top_Centroids64k_Vectors; drop table if exists #SimilarityUsing_Top_Centroids32k_Vectors; DECLARE @OpenAIKey NVARCHAR(128)=N'OpenAI_API_KEY'; DECLARE @Xq VARCHAR(MAX); DECLARE @IJSON AS NVARCHAR(MAX); DECLARE @TopListFinal INT= 10; EXEC [dbo].[usp_OpenAI_Embedding] @Model = N'text-embedding-ada-002', @OpenAIKey = @OpenAIKey , @InputText = N'James Bond 007', @ResponseBody = @Xq OUTPUT; SELECT @Xq as N'@ResponseBody'; select top(@TopListFinal) * into #SimilarityUsing_All_1536Length_Vectors from [dbo].SimilarContentArticles(json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc; select top(@TopListFinal) * into #SimilarityUsing_Top_Centroids64k_Vectors from [dbo].[SimilarContentArticlesKmeans64](json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc option(maxdop 1); select top(@TopListFinal) * into #SimilarityUsing_Top_Centroids32k_Vectors from [dbo].[SimilarContentArticlesKmeans32](json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc option(maxdop 1); select * from #SimilarityUsing_All_1536Length_Vectors order by cosine_distance desc; select * from #SimilarityUsing_Top_Centroids64k_Vectors order by cosine_distance desc; select * from #SimilarityUsing_Top_Centroids32k_Vectors order by cosine_distance desc; SELECT 1.0*count(*)/@TopListFinal v64_in_v1536 from #SimilarityUsing_Top_Centroids64k_Vectors v64 inner join #SimilarityUsing_All_1536Length_Vectors v1536 on v64.id=v1536.id SELECT 1.0*count(*)/@TopListFinal v32_in_v1536 from #SimilarityUsing_Top_Centroids32k_Vectors v32 inner join #SimilarityUsing_All_1536Length_Vectors v1536 on v32.id=v1536.id go

## Results for search "The foundation series by Isaac Asimov"

## Multiple searches results

## How to keep KMeans Tables up to date with living data ?

A question will quickly arise: how to maintain the data of the KMeans tables which serve as the basis for our in-house indexing ?

As a first approach, I saw things in 2 stages :

- At first only the mapping table would be modified via SQL only. Take the cs of a new article, we would use the embedding calculated by OpenAI to determine its vector. We would then calculate its attachment centroid (calculation of distance and the table of centroids) and we would add a record in the centroids-articles mapping table. The centroids table would not be affected.
- In a second step, in batch mode, we would rebuild the 2 tables via Python (Sklearn.Cluster.KMeans)

## Conclusion

As a conclusion, we can say that using KMeans and centroids as an indexing layer works fine. The quality of searches is quite good and depends on the number of centroids and the number of top centroids used (nprobe).

However, we are far from the performance of SQL Server’s native indexes. In fact, the native b-tree or columnstore indexes use optimized structures and the search operators that use them are compiled in C++, thus offering performances 10 to 20x higher than those obtained with a solution based on an SQL procedure.

However, one avenue remains possible to try to improve performance: natively compiled stored procedures and in-memory tables…

To be continued in the next episode !