In the previous article Semantic Vector Search with SQL Server we discuss about the possibility to make semantic search with vectors that comes from OpenAI text embedding model « text-embedding-ada-002 ». We also have seen how the dimensional reduction has made it possible to improve performance and above all to reduce cpu consumption compared to research on large vectors (1536 in this case).

In this blog post, I would like to discuss another approach to improve vector search performance with SQL Server by using KMeans to reduce search space.

Using KMeans to segment space

As i already said, SQL Server currently does not have Vector datatype nor index able to index arrays like we have. The only solution for us to improve performance of the search is to build tables and programmatic stuff to make the search faster.

We want to improve performance of the search in our 25000 wikipedia articles store in our database table. For that, one technic is to reduce the search space. Kmeans could help us by computing x centroids and also giving us the relationship between all articles we have in the database and the centroids.

In the image (an exemple coming from an article wrote by  François Pacull) we can see centroids in blue and the green points could be articles. This is a simplification of the problem because we have 1536 dimensions for the text embedding vector ada-002 but the idea is here.

How to compute KMeans ? Python to the rescue !

There is no easy way to compute KMeans in SQL. Despite my research on the subject I have not found an implementation of KMeans in SQL. Maybe someone can enlighten me but after several hours of research I resigned myself to using Python and SciKit Learn. This language is, let’s face it, much more suitable for this type of calculation.

Our objective is to build structures (tables in our case) :

  • One with centroids (id and vectors components).
  • Another with the mapping centroids ==> articles

We use several modules that you will have to install beforehand:

pip install scikit-learn
pip install joblib
pip install sqlalchemy
pip install pymssql
pip install pandas
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sqlalchemy import create_engine,text
from pandas import DataFrame
from joblib import Parallel, delayed
import json

k=1 # Number of SubVectors (only one for simplicity, several in memory pressure is to high with Kmeans)
nb_clusters=32 # Number of Clusters

%%time
server="localhost"
database="ExternalData"

connect_string = f"mssql+pymssql://{server}/{database}"
engine = create_engine(connect_string)
sql='''SELECT id article_id, content_vector FROM wikipedia_articles_embeddings'''
df = pd.read_sql_query(sql=text(sql), con=engine.connect())
def parse_vector(x):
    return np.fromstring(x[1:-1], sep=",")

def prepare_data(df):
    df["content_vector"] = Parallel(n_jobs=-1)(delayed(parse_vector)(x) for x in df["content_vector"])
    return df
%%time
# Convert string list content_vector to numpy ndarrays
prepared_data = prepare_data(df)
def compute_assign_centroids(df, k, nb_clusters):
    all_centroids = []
    mapping = []
    vector_length = len(df.loc[0, "content_vector"])
    subvector_length = vector_length // k
    
    # For each subvector of each content_vector, compute centroids
    for i in range(k):
        start = i * subvector_length
        end = start + subvector_length
        subvector = np.stack(df["content_vector"].apply(lambda x: x[start:end]))
        kmeans = KMeans(n_clusters=nb_clusters, random_state=0, n_init="auto")
        
        y = kmeans.fit_predict(subvector) # Map articles to theirs centroids
        df["centroid_id_sv_"+str(i)] = y
        
        # For each centroid, add its subvector_id and vector_value to the DataFrame
        for centroid_id, centroid in enumerate(kmeans.cluster_centers_):
            for subvector_id, vector_value in enumerate(centroid):
                all_centroids.append([centroid_id, i, subvector_id, vector_value])      
            
    return DataFrame(all_centroids, columns=["centroid_id", "subvector_id","vector_value_id","vector_value"])
%%time
centroids = compute_assign_centroids(prepared_data, k,nb_clusters)
#Write back results to SQL
centroids_table_schema={"centroid_id":Integer(),"subvector_id":Integer(),"vector_value_id":Integer(), "vector_value":Float()}
mapping_articles_centroids_schema={"article_id":Integer(),"centroid_id_sv_0":Integer()}

centroids.to_sql(f'wikipedia_articles_centroids_{nb_clusters}_{k}', engine, if_exists='replace', index=False, chunksize=1000, method="multi",dtype=centroids_table_schema)
prepared_data_rel = prepared_data.drop(columns=['content_vector'])
prepared_data_rel.to_sql(f'wikipedia_mapping_articles_centroids_{nb_clusters}_{k}', engine, if_exists='replace', index=False, chunksize=1000, method="multi",dtype=mapping_articles_centroids_schema)

Performance Profiling of the Python program

With the profiling of each steps we can see that KMeans fit_predict is the most CPU intentive step (68s) but quite efficient (10s).
Pandas is not known to be the fastest module to read or write data to SQL, may be other modules like polars or connectorX can do better job or may be pandas tuning could be done using arrow…. But i hadn’t enough time for that at the time of writing.

Return to SQL

We have launch the program twice : once with nb_clusters=32 and another with nb_clusters=64. So we have 4 tables : 2 centroids tables and 2 maps(centroid_id,article_id) tables

The 4 tables in blue in the diagram are the newly generated tables.

Let’s give a look at tables sizes (in blue). Remember that we used columnstore compression for some tables (in red). That explain why with much more rows the space storage remain so low.

SQL Search function using Kmeans tables

The idea is

  1. to normalize our question vector (aka Xq).
  2. Search against centroids. It means we will compute distance of Xq with all the centroids (only 64 here)
  3. Limit the list of returned centroids to top 4 ones (like an nprobe for pythonist)
  4. Search against all vectors that belong to the top 4 centroids (this will limit the search to 1500-2000 vectors instead of 25000
CREATE     function [dbo].[SimilarContentArticlesKmeans64](@vector nvarchar(max))
returns table
as
return with cteVector as
(
    select 
        cast([key] as int) as [vector_value_id],
        cast([value] as float) as [vector_value]
    from 
        openjson(@vector)
),
cteSimilarKmeans64 as
(
select top 4
    centroids.centroid_id, 
    sum(centroids.[vector_value] * v.[vector_value]) /
    (sqrt(sum(centroids.[vector_value] *centroids.[vector_value]))*sqrt(sum(v.[vector_value] *v.[vector_value])))
    as cosine_distance
from 
    [dbo].[wikipedia_articles_centroids_64_1] centroids
inner join 
    cteVector v on v.vector_value_id = centroids.vector_value_id
group by
    centroids.centroid_id
order by
    cosine_distance desc
),
cteSimilar as
(
select top (50)
    v2.article_id, 
    sum(v1.[vector_value] * v2.[vector_value]) as cosine_distance
from 
    cteVector v1 
    inner join dbo.wikipedia_articles_embeddings_contents_vector v2 on v1.vector_value_id = v2.vector_value_id
    inner join [dbo].[wikipedia_mapping_articles_centroids_64_1] map on (map.article_id=v2.article_id)
    inner join cteSimilarKmeans64 topcentroids on (topcentroids.centroid_id=map.centroid_id_sv_0)

group by
    v2.article_id
order by
    cosine_distance desc
)
select 
    a.id,
    a.title,
    a.url,
    r.cosine_distance
from 
    cteSimilar r
inner join 
    dbo.wikipedia_articles_embeddings a on r.article_id = a.id
GO


Calling the search function

Use ExternalData
GO

drop table if exists #SimilarityUsing_All_1536Length_Vectors;
drop table if exists #SimilarityUsing_Top_Centroids64k_Vectors;
drop table if exists #SimilarityUsing_Top_Centroids32k_Vectors;



DECLARE @OpenAIKey NVARCHAR(128)=N'OpenAI_API_KEY';

DECLARE @Xq VARCHAR(MAX);
DECLARE @IJSON AS NVARCHAR(MAX);
DECLARE @TopListFinal INT= 10;


EXEC	[dbo].[usp_OpenAI_Embedding]
        @Model = N'text-embedding-ada-002',
        @OpenAIKey = @OpenAIKey ,
        @InputText = N'James Bond 007',
        @ResponseBody = @Xq OUTPUT;

SELECT	@Xq as N'@ResponseBody';


select top(@TopListFinal) * into #SimilarityUsing_All_1536Length_Vectors   from [dbo].SimilarContentArticles(json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc;
select top(@TopListFinal) * into #SimilarityUsing_Top_Centroids64k_Vectors from [dbo].[SimilarContentArticlesKmeans64](json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc option(maxdop 1);
select top(@TopListFinal) * into #SimilarityUsing_Top_Centroids32k_Vectors from [dbo].[SimilarContentArticlesKmeans32](json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc option(maxdop 1);

select * from #SimilarityUsing_All_1536Length_Vectors	order by cosine_distance desc;
select * from #SimilarityUsing_Top_Centroids64k_Vectors order by cosine_distance desc;
select * from #SimilarityUsing_Top_Centroids32k_Vectors order by cosine_distance desc;

SELECT 1.0*count(*)/@TopListFinal v64_in_v1536 from #SimilarityUsing_Top_Centroids64k_Vectors v64 inner join #SimilarityUsing_All_1536Length_Vectors v1536 on v64.id=v1536.id

SELECT 1.0*count(*)/@TopListFinal v32_in_v1536 from #SimilarityUsing_Top_Centroids32k_Vectors v32 inner join #SimilarityUsing_All_1536Length_Vectors v1536 on v32.id=v1536.id

go

Results for search "The foundation series by Isaac Asimov"

Multiple searches results

How to keep KMeans Tables up to date with living data ?

A question will quickly arise: how to maintain the data of the KMeans tables which serve as the basis for our in-house indexing ?

As a first approach, I saw things in 2 stages :

  • At first only the mapping table would be modified via SQL only. Take the cs of a new article, we would use the embedding calculated by OpenAI to determine its vector. We would then calculate its attachment centroid (calculation of distance and the table of centroids) and we would add a record in the centroids-articles mapping table. The centroids table would not be affected.
  • In a second step, in batch mode, we would rebuild the 2 tables via Python (Sklearn.Cluster.KMeans)

Conclusion

As a conclusion, we can say that using KMeans and centroids as an indexing layer works fine. The quality of searches is quite good and depends on the number of centroids and the number of top centroids used (nprobe).
However, we are far from the performance of SQL Server’s native indexes. In fact, the native b-tree or columnstore indexes use optimized structures and the search operators that use them are compiled in C++, thus offering performances 10 to 20x higher than those obtained with a solution based on an SQL procedure.
However, one avenue remains possible to try to improve performance: natively compiled stored procedures and in-memory tables…
To be continued in the next episode !