Some time ago i read a blog article from a senior Microsoft azure programmer Davide Mauri : The article interest me because the current « semantic search » natively available in sql server since SQL Server 2016 is not wildly spread and honestly don’t bring a high value compare to fulltext index.
The Davide Mauri’s article was more interesting. It explain how to use OpenAI 1536 length vectors with SQL Server to find similarity between a search term and documents in your database.
You must first encode your documents (or parts of your documents). For this article the document corpus was 25000 wikipedia articles, with a title, an url and a content text.
The encoder model was the text-embedding-ada-002 from OpenAI : it will return a 1536 length vector of reals datatype for each article.
SQL Server does not have a vector nor an array datatype, so you will need to normalize the vector into a key-value table if you want to exploit vector data. The final key-value table store the vectors (1536 long) of 25000 wikipedia article in a 38.4M rows (25000 x 1536) table.
Davide use Azure and SQL Database with sp_invoke_external_rest_endpoint store procedure to call openA text embedding encoder of a search question. Personnaly, for my test i prefered to stay on my machine. So I use OpenAI python module to get vectors of my questions.
Performance of Similarity Search with SQL Server : SIMD + ColumnStore
As Davide Mauri said, SQL Server can use ColumnStore Index since 2012 and even SIMD AVX-512 since the 2022 (if your cpu allow it). I say also that batch mode help a lot the type of search Query.
I decided to see performance with my laptop and use a simple python notebook to retrieved the embedding vector of The question Davide also used : « The Foundation Series of Isaac Asimov ». I am a huge fan of this writer 🙂
CREATE function [dbo].[SimilarContentArticles](@vector nvarchar(max)) returns table as return with cteVector as ( select cast([key] as int) as [vector_value_id], cast([value] as float) as [vector_value] from openjson(@vector) ), cteSimilar as ( select top (50) v2.article_id, sum(v1.[vector_value] * v2.[vector_value]) / ( sqrt(sum(v1.[vector_value] * v1.[vector_value])) * sqrt(sum(v2.[vector_value] * v2.[vector_value])) ) as cosine_distance from cteVector v1 inner join dbo.wikipedia_articles_embeddings_contents_vector v2 on v1.vector_value_id = v2.vector_value_id group by v2.article_id order by cosine_distance desc ) select a.id, a.title, a.url, r.cosine_distance from cteSimilar r inner join dbo.wikipedia_articles_embeddings a on r.article_id = a.id
Davide build a smart search function that compute the cosine distance of the question vector with ALL vectors stored in the corpus pivoted table.
When used, the search function is relatively fast on my laptop (0.4s) but use quite a lot cpu : +4.2s. It will be difficult to scale : more documents with many users could be quite challenging and will need lots of cpu ressources.
Simplification of norms computation
Note : as OpenAI said the ada-002 embedding are normalized : that means that sqrt(sum(v1.[vector_value] * v1.[vector_value])) and sqrt(sum(v2.[vector_value] * v2.[vector_value])) are both equal to 1
Ask a question and retrieve an OpenAI text-embedding-ada-002 vector from SQL Server (On-Prem)
David use an Azure SQL Database specific store procedure to retrieve OpenAI vector sp_invoke_external_rest_endpoint . This is a great store procedure but unfortunately unavailable on my on-prem SQL Server.
One alternative that will work anywhere is to use python inside SQL with the sp_execute_external_script.
Here a function that will directly use OpenAI API without having to deal with pass-through Azure-OpenAI interface.
CREATE PROCEDURE [dbo].[usp_OpenAI_Embedding] @Model VARCHAR(400) = "text-embedding-ada-002", @OpenAIKey NVARCHAR(128), @InputText NVARCHAR(MAX) = NULL, @ResponseBody NVARCHAR(MAX) = NULL OUTPUT AS EXECUTE sp_execute_external_script @language = N'Python', @script = N' import openai openai.api_key = OpenAIKey text = InputText.replace("\n", " ") ResponseBody = str(openai.Embedding.create(input = [text], model=Model)) ', @params = N'@Model VARCHAR(400), @OpenAIKey NVARCHAR(128), @InputText NVARCHAR(MAX), @ResponseBody NVARCHAR(MAX) OUTPUT', @Model = @Model, @OpenAIKey = @OpenAIKey, @InputText = @InputText, @ResponseBody = @ResponseBody OUTPUT
The call to the search function
DECLARE @OpenAIKey NVARCHAR(128)=N'Your-OpenAI-Key-Here'; DECLARE @Xq VARCHAR(MAX); DECLARE @IJSON AS NVARCHAR(MAX); DECLARE @TopListFinal INT= 10; EXEC [dbo].[usp_OpenAI_Embedding] @Model = N'text-embedding-ada-002', @OpenAIKey = @OpenAIKey , @InputText = N'The Foundation Serie by Isaac Asimov', @ResponseBody = @Xq OUTPUT; SELECT @Xq as N'@ResponseBody'; select top(@TopListFinal) * into #SimilarityUsing_All_1536Length_Vectors from dbo.SimilarContentArticles(json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc;
No Vector Indexes in SQL Server (at least for 2022 version) : but we can build aggregates/index tables to reduce the dimensionality and make the query lighter
It is impossible to build your own indexes with SQL Server like you would be able with PostgreSQL, but we could try to reduce the dimensionality of the vector (and then the table) to see the effect on performance and also the quality of the answer.
I decided to create 3 tables that will reduce the dimensionality of the vector from 1536 to 64, 32 and finally 16 components.
1536 ==> 64 components :
DROP TABLE IF EXISTS wikipedia_articles_embeddings_contents_vector_index_avg64 SELECT [article_id] ,[vector_value_id]%64 vector_value_id ,avg([vector_value]) vector_value INTO wikipedia_articles_embeddings_contents_vector_index_avg64 FROM [dbo].[wikipedia_articles_embeddings_contents_vector] WHERE 1=0 GROUP BY [article_id] ,[vector_value_id]%64 ORDER BY article_id,vector_value_id; CREATE CLUSTERED COLUMNSTORE INDEX i0 on wikipedia_articles_embeddings_contents_vector_index_avg64; INSERT INTO wikipedia_articles_embeddings_contents_vector_index_avg64 WITH(TABLOCK) SELECT [article_id] ,[vector_value_id]%64 vector_value_id ,avg([vector_value]) vector_value FROM [dbo].[wikipedia_articles_embeddings_contents_vector] GROUP BY [article_id] ,[vector_value_id]%64 ORDER BY article_id,vector_value_id OPTION (MAXDOP 1); -- to preserve order in the columnstore and optimized row group elimination
1536 ==> 32 components
DROP TABLE IF EXISTS wikipedia_articles_embeddings_contents_vector_index_avg32 SELECT [article_id] ,[vector_value_id]%32 vector_value_id ,avg([vector_value]) vector_value INTO wikipedia_articles_embeddings_contents_vector_index_avg32 FROM [dbo].[wikipedia_articles_embeddings_contents_vector] WHERE 1=0 GROUP BY [article_id] ,[vector_value_id]%32 ORDER BY article_id,vector_value_id; CREATE CLUSTERED COLUMNSTORE INDEX i0 on wikipedia_articles_embeddings_contents_vector_index_avg32; INSERT INTO wikipedia_articles_embeddings_contents_vector_index_avg32 WITH(TABLOCK) SELECT [article_id] ,[vector_value_id]%32 vector_value_id ,avg([vector_value]) vector_value FROM [dbo].[wikipedia_articles_embeddings_contents_vector] GROUP BY [article_id] ,[vector_value_id]%32 ORDER BY article_id,vector_value_id OPTION (MAXDOP 1);
1536 ==> 16 components
DROP TABLE IF EXISTS wikipedia_articles_embeddings_contents_vector_index_avg16 SELECT [article_id] ,[vector_value_id]%16 vector_value_id ,avg([vector_value]) vector_value INTO wikipedia_articles_embeddings_contents_vector_index_avg16 FROM [dbo].[wikipedia_articles_embeddings_contents_vector] WHERE 1=0 GROUP BY [article_id] ,[vector_value_id]%16 ORDER BY article_id,vector_value_id; CREATE CLUSTERED COLUMNSTORE INDEX i0 on wikipedia_articles_embeddings_contents_vector_index_avg16; INSERT INTO wikipedia_articles_embeddings_contents_vector_index_avg16 WITH(TABLOCK) SELECT [article_id] ,[vector_value_id]%16 vector_value_id ,avg([vector_value]) vector_value FROM [dbo].[wikipedia_articles_embeddings_contents_vector] GROUP BY [article_id] ,[vector_value_id]%16 ORDER BY article_id,vector_value_id OPTION (MAXDOP 1);
After that i need to create 3 functions that will use the aggregates tables. But to reduce the effect, each function will search in the aggregate table first to find top x percent of articles that are « candidates » using the « reducted cosine distance » and will do a full semantic (1536) on the candidates.
The reduce the quality degradation of the pre-filtering with a reduced size vector (64,32,16 instead of 1536) i use a variable size of the candidates population :
- For the 64 length vector : top 3 percent of the initial pouplation (3% of 25000)
- For the 32 length vector : top 5 percent
- For the 16 length vector : top 10 percent
As we said earlier, text-embedding-ada-002 vector of OpenAI are normalized : that means that their norms equal 1 thus we can simplify the Cosine Distance Computation by removing the under part of the division.
Lets see code of one function : the other are the same (exept to Top x percent)
CREATE function [dbo].[SimilarContentArticlesAvg64](@vector nvarchar(max)) returns table as return with cteVector as ( select cast([key] as int) as [vector_value_id], cast([value] as float) as [vector_value] from openjson(@vector) ), cteVectorAvg64 as ( SELECT [vector_value_id]%64 [vector_value_id], avg([vector_value]) [vector_value] FROM cteVector GROUP BY [vector_value_id]%64 ), cteSimilarAvg64 as ( select top 3 percent v2.article_id, sum(v1.[vector_value] * v2.[vector_value]) / (sqrt(sum(v1.[vector_value] *v1.[vector_value]))*sqrt(sum(v2.[vector_value] *v2.[vector_value]))) as cosine_distance from cteVectorAvg64 v1 inner join dbo.wikipedia_articles_embeddings_contents_vector_index_avg64 v2 on v1.vector_value_id = v2.vector_value_id group by v2.article_id order by cosine_distance desc ), cteSimilar as ( select top (50) v2.article_id, sum(v1.[vector_value] * v2.[vector_value]) as cosine_distance -- simplification possible for the denominator (=1 with full length ada-002 openAI vector) from cteVector v1 inner join dbo.wikipedia_articles_embeddings_contents_vector v2 on v1.vector_value_id = v2.vector_value_id inner join cteSimilarAvg64 vindex on (vindex.article_id=v2.article_id) group by v2.article_id order by cosine_distance desc ) select a.id, a.title, a.url, r.cosine_distance from cteSimilar r inner join dbo.wikipedia_articles_embeddings a on r.article_id = a.id
Functions Usage
The all in one call to differente pieces :
drop table if exists #SimilarityUsing_All_1536Length_Vectors; drop table if exists #SimilarityUsing_Top02percent_64Length_Vectors; drop table if exists #SimilarityUsing_Top05percent_32Length_Vectors; drop table if exists #SimilarityUsing_Top10percent_16Length_Vectors; DECLARE @OpenAIKey NVARCHAR(128)=N'OpenAI_API_KEY'; DECLARE @Xq VARCHAR(MAX); DECLARE @IJSON AS NVARCHAR(MAX); DECLARE @TopListFinal INT= 10; EXEC [dbo].[usp_OpenAI_Embedding] @Model = N'text-embedding-ada-002', @OpenAIKey = @OpenAIKey , @InputText = N'The foundation series by Isaac Asimov', @ResponseBody = @Xq OUTPUT; SELECT @Xq as N'@ResponseBody'; select top(@TopListFinal) * into #SimilarityUsing_All_1536Length_Vectors from dbo.SimilarContentArticles(json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc; select top(@TopListFinal) * into #SimilarityUsing_Top02percent_64Length_Vectors from [dbo].[SimilarContentArticlesAvg64](json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc; -- For the last call i force maxdop=1 (no parallel to reduce cpu usage) select top(@TopListFinal) * into #SimilarityUsing_Top05percent_32Length_Vectors from [dbo].[SimilarContentArticlesAvg32](json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc option (maxdop 1); select top(@TopListFinal) * into #SimilarityUsing_Top10percent_16Length_Vectors from [dbo].[SimilarContentArticlesAvg16](json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc option (maxdop 1); select * from #SimilarityUsing_All_1536Length_Vectors order by cosine_distance desc; select * from #SimilarityUsing_Top02percent_64Length_Vectors order by cosine_distance desc; select * from #SimilarityUsing_Top05percent_32Length_Vectors order by cosine_distance desc; select * from #SimilarityUsing_Top10percent_16Length_Vectors order by cosine_distance desc; SELECT 1.0*count(*)/@TopListFinal v64_in_v1536 from #SimilarityUsing_Top02percent_64Length_Vectors v64 inner join #SimilarityUsing_All_1536Length_Vectors v1536 on v64.id=v1536.id SELECT 1.0*count(*)/@TopListFinal v32_in_v1536 from #SimilarityUsing_Top05percent_32Length_Vectors v32 inner join #SimilarityUsing_All_1536Length_Vectors v1536 on v32.id=v1536.id SELECT 1.0*count(*)/@TopListFinal v16_in_v1536 from #SimilarityUsing_Top10percent_16Length_Vectors v16 inner join #SimilarityUsing_All_1536Length_Vectors v1536 on v16.id=v1536.id go
Performance and Quality Results
Performance Results without optimization norms=1
scenario | CPU Time | Elapsed Time | Quality |
Original (1536 len) | 4704 | 492 | 100% |
prefilter with 64len vectors and 3 percents candidates | 984 | 174 | 80% |
prefilter with 32len vectors and 5 percents candidates | 735 | 154 | 60% |
prefilter with 16len vectors and 10 percents candidates | 751 | 179 | 30% |
Performance Results with optimization norms=1
scenario (remove computation of 1536 vector norms for cosine distance : norms=1) | CPU Time | Elapsed Time | Quality |
Original (1536 len) | 2188 | 249 | 100% |
prefilter with 64len vectors and 3 percents candidates | 860 | 157 | 80% |
prefilter with 32len vectors and 5 percents candidates | 187 | 193 | 60% |
prefilter with 16len vectors and 10 percents candidates | 235 | 230 | 30% |
Data Results
Conclusion
This exercice was very interesting. I must admit that i have take time before making Machine Learning Services with Python works with my SQL Server 2022 even if documentation is better you still need to make manual windows grant on 2 files (C:\Program Files\Python310\Lib\site-packages\six.py and C:\Program Files\Python310\Lib\site-packages\typing_extensions.py) that remain out off scope of install scripts.
On the other hand, the semantic/vectorized search is very powerfull. You can even search from another language than english and it continue to work.
The performance can be improve by reducing computation for the cosine distance when vector norms is equal to 1. You can also improve performance at the expense of quality using dimensionality redution.
This is only a first shoot but i think other and better method would be available to reduce the pressure on the database for each question by « indexing » data. The indexing technic would be better with indexes but meanwhile we can use smart store procedure and aggregates tables to do the job.
What is also excellent is the ability of the solution (OpenAI embedding ada-002) to manage typo and even better multi-language !