Some time ago i read a blog article from a senior Microsoft azure programmer Davide Mauri : The article interest me because the current « semantic search » natively available in sql server since SQL Server 2016 is not wildly spread and honestly don’t bring a high value compare to fulltext index.

The Davide Mauri’s article was more interesting. It explain how to use OpenAI 1536 length vectors with SQL Server to find similarity between a search term and documents in your database.

You must first encode your documents (or parts of your documents). For this article the document corpus was 25000 wikipedia articles, with a title, an url and a content text.
The encoder model was the text-embedding-ada-002 from OpenAI : it will return a 1536 length vector of reals datatype for each article.

SQL Server does not have a vector nor an array datatype, so you will need to normalize the vector into a key-value table if you want to exploit vector data. The final key-value table store the vectors (1536 long) of 25000 wikipedia article in a 38.4M rows (25000 x 1536) table.

Davide use Azure and SQL Database with sp_invoke_external_rest_endpoint store procedure to call openA text embedding encoder of a search question. Personnaly, for my test i prefered to stay on my machine. So I use OpenAI python module to get vectors of my questions.

Initial Data : 25000 wikipedia articles and their vectors or 1536 components store in a 25000 rows table (with a nvarchar(max) column for vectors)

Normalized data (pivoted vector components) as a key-value table of 38M rows

Performance of Similarity Search with SQL Server : SIMD + ColumnStore

As Davide Mauri said, SQL Server can use ColumnStore Index since 2012 and even SIMD AVX-512 since the 2022 (if your cpu allow it). I say also that batch mode help a lot the type of search Query.
I decided to see performance with my laptop and use a simple python notebook to retrieved the embedding vector of The question Davide also used  : « The Foundation Series of Isaac Asimov ». I am a huge fan of this writer 🙂

CREATE function [dbo].[SimilarContentArticles](@vector nvarchar(max))
returns table
as
return with cteVector as
(
    select 
        cast([key] as int) as [vector_value_id],
        cast([value] as float) as [vector_value]
    from 
        openjson(@vector)
),
cteSimilar as
(
select top (50)
    v2.article_id, 
    sum(v1.[vector_value] * v2.[vector_value]) / 
        (
            sqrt(sum(v1.[vector_value] * v1.[vector_value])) 
            * 
            sqrt(sum(v2.[vector_value] * v2.[vector_value]))
        ) as cosine_distance
from 
    cteVector v1
inner join 
    dbo.wikipedia_articles_embeddings_contents_vector v2 on v1.vector_value_id = v2.vector_value_id
group by
    v2.article_id
order by
    cosine_distance desc
)
select 
    a.id,
    a.title,
    a.url,
    r.cosine_distance
from 
    cteSimilar r
inner join 
    dbo.wikipedia_articles_embeddings a on r.article_id = a.id

Davide build a smart search function that compute the cosine distance of the question vector with ALL vectors stored in the corpus pivoted table.

When used, the search function is relatively fast on my laptop (0.4s) but use quite a lot cpu : +4.2s. It will be difficult to scale : more documents with many users could be quite challenging and will need  lots of cpu ressources.

Simplification of norms computation

Note : as OpenAI said the ada-002 embedding are normalized : that means that sqrt(sum(v1.[vector_value] * v1.[vector_value])) and sqrt(sum(v2.[vector_value] * v2.[vector_value])) are both equal to 1

Ask a question and retrieve an OpenAI text-embedding-ada-002 vector from SQL Server (On-Prem)

David use an Azure SQL Database specific store procedure to retrieve OpenAI vector sp_invoke_external_rest_endpoint . This is a great store procedure but unfortunately unavailable on my on-prem SQL Server.

One alternative that will work anywhere is to use python inside SQL with the sp_execute_external_script.

Here a function that will directly use OpenAI API without having to deal with pass-through Azure-OpenAI interface.

CREATE PROCEDURE [dbo].[usp_OpenAI_Embedding]
    @Model VARCHAR(400) = "text-embedding-ada-002",
    @OpenAIKey NVARCHAR(128),
    @InputText NVARCHAR(MAX) = NULL,
    @ResponseBody NVARCHAR(MAX) = NULL OUTPUT
AS
    EXECUTE sp_execute_external_script @language = N'Python',
    @script = N'
import openai
openai.api_key = OpenAIKey
text = InputText.replace("\n", " ")
ResponseBody = str(openai.Embedding.create(input = [text], model=Model))
    ',
        @params = N'@Model VARCHAR(400),
                    @OpenAIKey NVARCHAR(128),
                    @InputText NVARCHAR(MAX),
                    @ResponseBody NVARCHAR(MAX) OUTPUT',
        
        @Model = @Model,
        @OpenAIKey = @OpenAIKey,
        @InputText = @InputText,
        @ResponseBody = @ResponseBody OUTPUT

 

The call to the search function

DECLARE @OpenAIKey NVARCHAR(128)=N'Your-OpenAI-Key-Here';

DECLARE @Xq VARCHAR(MAX);
DECLARE @IJSON AS NVARCHAR(MAX);
DECLARE @TopListFinal INT= 10;


EXEC	[dbo].[usp_OpenAI_Embedding]
        @Model = N'text-embedding-ada-002',
        @OpenAIKey = @OpenAIKey ,
        @InputText = N'The Foundation Serie by Isaac Asimov',
        @ResponseBody = @Xq OUTPUT;

SELECT	@Xq as N'@ResponseBody';


select top(@TopListFinal) * into #SimilarityUsing_All_1536Length_Vectors from dbo.SimilarContentArticles(json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc;

A portion of the explain plan show us that 38M rows are quickly joined with the 1536 rows of the « search question », but the aggregation (cosine computation) is not costless.

No Vector Indexes in SQL Server (at least for 2022 version) : but we can build aggregates/index tables to reduce the dimensionality and make the query lighter

It is impossible to build your own indexes with SQL Server like you would be able with PostgreSQL, but we could try to reduce the dimensionality of the vector (and then the table) to see the effect on performance and also the quality of the answer.

I decided to create 3 tables  that will reduce the dimensionality of the vector from 1536 to 64, 32 and finally 16 components.

1536 ==> 64 components :

 

DROP TABLE IF EXISTS wikipedia_articles_embeddings_contents_vector_index_avg64


SELECT [article_id]
      ,[vector_value_id]%64 vector_value_id
      ,avg([vector_value]) vector_value
INTO wikipedia_articles_embeddings_contents_vector_index_avg64
FROM [dbo].[wikipedia_articles_embeddings_contents_vector]
WHERE 1=0
GROUP BY
  [article_id]
  ,[vector_value_id]%64
ORDER BY
article_id,vector_value_id;

CREATE CLUSTERED COLUMNSTORE INDEX i0 on wikipedia_articles_embeddings_contents_vector_index_avg64;

INSERT INTO wikipedia_articles_embeddings_contents_vector_index_avg64 WITH(TABLOCK)
SELECT [article_id]
      ,[vector_value_id]%64 vector_value_id
      ,avg([vector_value]) vector_value
FROM [dbo].[wikipedia_articles_embeddings_contents_vector]
GROUP BY
  [article_id]
  ,[vector_value_id]%64
ORDER BY
article_id,vector_value_id
OPTION (MAXDOP 1); -- to preserve order in the columnstore and optimized row group elimination

1536 ==> 32 components

 

DROP TABLE IF EXISTS wikipedia_articles_embeddings_contents_vector_index_avg32


SELECT [article_id]
      ,[vector_value_id]%32 vector_value_id
      ,avg([vector_value]) vector_value
INTO wikipedia_articles_embeddings_contents_vector_index_avg32
FROM [dbo].[wikipedia_articles_embeddings_contents_vector]
WHERE 1=0
GROUP BY
  [article_id]
  ,[vector_value_id]%32
ORDER BY
article_id,vector_value_id;

CREATE CLUSTERED COLUMNSTORE INDEX i0 on wikipedia_articles_embeddings_contents_vector_index_avg32;

INSERT INTO wikipedia_articles_embeddings_contents_vector_index_avg32 WITH(TABLOCK)
SELECT [article_id]
      ,[vector_value_id]%32 vector_value_id
      ,avg([vector_value]) vector_value
FROM [dbo].[wikipedia_articles_embeddings_contents_vector]
GROUP BY
  [article_id]
  ,[vector_value_id]%32
ORDER BY
article_id,vector_value_id
OPTION (MAXDOP 1);

1536 ==> 16 components

 

DROP TABLE IF EXISTS wikipedia_articles_embeddings_contents_vector_index_avg16


SELECT [article_id]
      ,[vector_value_id]%16 vector_value_id
      ,avg([vector_value]) vector_value
INTO wikipedia_articles_embeddings_contents_vector_index_avg16
FROM [dbo].[wikipedia_articles_embeddings_contents_vector]
WHERE 1=0
GROUP BY
  [article_id]
  ,[vector_value_id]%16
ORDER BY
article_id,vector_value_id;

CREATE CLUSTERED COLUMNSTORE INDEX i0 on wikipedia_articles_embeddings_contents_vector_index_avg16;

INSERT INTO wikipedia_articles_embeddings_contents_vector_index_avg16 WITH(TABLOCK)
SELECT [article_id]
      ,[vector_value_id]%16 vector_value_id
      ,avg([vector_value]) vector_value
FROM [dbo].[wikipedia_articles_embeddings_contents_vector]
GROUP BY
  [article_id]
  ,[vector_value_id]%16
ORDER BY
article_id,vector_value_id
OPTION (MAXDOP 1);

After that i need to create 3 functions that will use the aggregates tables. But to reduce the effect, each function will search in the aggregate table first to find top x percent of articles that are « candidates » using the « reducted cosine distance » and will do a full semantic (1536) on the candidates.

The reduce the quality degradation of the pre-filtering with a reduced size vector (64,32,16 instead of 1536) i use a variable size of the candidates population :

  • For the 64 length vector : top 3 percent of the initial pouplation (3% of 25000)
  • For the 32 length vector : top 5 percent
  • For the 16 length vector : top 10 percent

As we said earlier, text-embedding-ada-002 vector of OpenAI are normalized : that means that their norms equal 1 thus we can simplify the Cosine Distance Computation by removing the under part of the division.

Lets see code of one function : the other are the same (exept to Top x percent)

CREATE   function [dbo].[SimilarContentArticlesAvg64](@vector nvarchar(max))
returns table
as
return with cteVector as
(
    select 
        cast([key] as int) as [vector_value_id],
        cast([value] as float) as [vector_value]
    from 
        openjson(@vector)
),
cteVectorAvg64 as
(
SELECT [vector_value_id]%64 [vector_value_id],
avg([vector_value]) [vector_value]
FROM cteVector
GROUP BY [vector_value_id]%64
),
cteSimilarAvg64 as
(
select top 3 percent
    v2.article_id, 
    sum(v1.[vector_value] * v2.[vector_value]) / (sqrt(sum(v1.[vector_value] *v1.[vector_value]))*sqrt(sum(v2.[vector_value] *v2.[vector_value]))) as cosine_distance
from 
    cteVectorAvg64 v1
inner join 
    dbo.wikipedia_articles_embeddings_contents_vector_index_avg64 v2 on v1.vector_value_id = v2.vector_value_id
group by
    v2.article_id
order by
    cosine_distance desc
),
cteSimilar as
(
select top (50)
    v2.article_id, 
    sum(v1.[vector_value] * v2.[vector_value]) as cosine_distance -- simplification possible for the denominator (=1 with full length ada-002 openAI vector)
from 
    cteVector v1 
    inner join dbo.wikipedia_articles_embeddings_contents_vector v2 on v1.vector_value_id = v2.vector_value_id
    inner join cteSimilarAvg64 vindex on (vindex.article_id=v2.article_id)

group by
    v2.article_id
order by
    cosine_distance desc
)
select 
    a.id,
    a.title,
    a.url,
    r.cosine_distance
from 
    cteSimilar r
inner join 
    dbo.wikipedia_articles_embeddings a on r.article_id = a.id

Functions Usage

The all in one call to differente pieces :

drop table if exists #SimilarityUsing_All_1536Length_Vectors;
drop table if exists #SimilarityUsing_Top02percent_64Length_Vectors;
drop table if exists #SimilarityUsing_Top05percent_32Length_Vectors;
drop table if exists #SimilarityUsing_Top10percent_16Length_Vectors;


DECLARE @OpenAIKey NVARCHAR(128)=N'OpenAI_API_KEY';

DECLARE @Xq VARCHAR(MAX);
DECLARE @IJSON AS NVARCHAR(MAX);
DECLARE @TopListFinal INT= 10;


EXEC	[dbo].[usp_OpenAI_Embedding]
        @Model = N'text-embedding-ada-002',
        @OpenAIKey = @OpenAIKey ,
        @InputText = N'The foundation series by Isaac Asimov',
        @ResponseBody = @Xq OUTPUT;

SELECT	@Xq as N'@ResponseBody';


select top(@TopListFinal) * into #SimilarityUsing_All_1536Length_Vectors from dbo.SimilarContentArticles(json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc;
select top(@TopListFinal) * into #SimilarityUsing_Top02percent_64Length_Vectors from [dbo].[SimilarContentArticlesAvg64](json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc;

-- For the last call i force maxdop=1 (no parallel to reduce cpu usage)
select top(@TopListFinal) * into #SimilarityUsing_Top05percent_32Length_Vectors from [dbo].[SimilarContentArticlesAvg32](json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc option (maxdop 1);
select top(@TopListFinal) * into #SimilarityUsing_Top10percent_16Length_Vectors from [dbo].[SimilarContentArticlesAvg16](json_query(@Xq, '$.data[0].embedding')) as r order by cosine_distance desc option (maxdop 1);

select * from #SimilarityUsing_All_1536Length_Vectors        order by cosine_distance desc;
select * from #SimilarityUsing_Top02percent_64Length_Vectors order by cosine_distance desc;
select * from #SimilarityUsing_Top05percent_32Length_Vectors order by cosine_distance desc;
select * from #SimilarityUsing_Top10percent_16Length_Vectors order by cosine_distance desc;

SELECT 1.0*count(*)/@TopListFinal v64_in_v1536 from #SimilarityUsing_Top02percent_64Length_Vectors v64 inner join #SimilarityUsing_All_1536Length_Vectors v1536 on v64.id=v1536.id
SELECT 1.0*count(*)/@TopListFinal v32_in_v1536 from #SimilarityUsing_Top05percent_32Length_Vectors v32 inner join #SimilarityUsing_All_1536Length_Vectors v1536 on v32.id=v1536.id
SELECT 1.0*count(*)/@TopListFinal v16_in_v1536 from #SimilarityUsing_Top10percent_16Length_Vectors v16 inner join #SimilarityUsing_All_1536Length_Vectors v1536 on v16.id=v1536.id
go

Performance and Quality Results

Performance Results without optimization norms=1

scenario CPU Time Elapsed Time Quality
Original (1536 len) 4704 492 100%
prefilter with 64len vectors and 3 percents candidates 984 174 80%
prefilter with 32len vectors and 5 percents candidates 735 154 60%
prefilter with 16len vectors and 10 percents candidates 751 179 30%

Performance Results with optimization norms=1

scenario (remove computation of 1536 vector norms for cosine distance : norms=1) CPU Time Elapsed Time Quality
Original (1536 len) 2188 249 100%
prefilter with 64len vectors and 3 percents candidates 860 157 80%
prefilter with 32len vectors and 5 percents candidates 187 193 60%
prefilter with 16len vectors and 10 percents candidates 235 230 30%

Data Results

Pre-Search With 64 Lentgh Vectors then Search with 1536 Length Vectors on 3 Percents of pre-selected candidates
Pre-Search With 32 Lentgh Vectors then Search with 1536 Length Vectors on 5 Percents of pre-selected candidates
Pre-Search With 16 Lentgh Vectors then Search with 1536 Length Vectors on 10 Percents of pre-selected candidates

Conclusion

This exercice was very interesting. I must admit that i have take time before making Machine Learning Services with Python works with my SQL Server 2022 even if documentation is better you still need to make manual windows grant on 2 files (C:\Program Files\Python310\Lib\site-packages\six.py and C:\Program Files\Python310\Lib\site-packages\typing_extensions.py) that remain out off scope of install scripts.

On the other hand, the semantic/vectorized search is very powerfull. You can even search from another language than english and it continue to work.
The performance can be improve by reducing computation for the cosine distance when vector norms is equal to 1. You can also improve performance at the expense of quality using dimensionality redution.

This is only a first shoot but i think other and better method would be available to reduce the pressure on the database for each question by « indexing » data. The indexing technic would be better with indexes but meanwhile we can use smart store procedure and aggregates tables to do the job.

What is also excellent is the ability of the solution (OpenAI embedding ada-002) to manage typo and even better multi-language !

SQL Server Vector Searches