Table of Contents

Introduction

You can find all the C# code samples here: Embeddings GitHub Repository

Before we jump into vector quantization in Azure AI Search, I suggest checking out these earlier posts. They’ll give you the foundation to fully understand today’s topic:

The Challenge of High-Dimensional Vectors

There’s an old saying:

If you don’t know what it’s about, it’s about the money.

In the world of high-dimensional vectors, that couldn’t be more accurate! And if we adjust the saying for vector databases, we might put it this way: “It’s about money, search speed, and search accuracy.”

Let’s move from the saying to the numbers that really matter.

Storage footprint

I think that you will be working very often with embeddings of 1536 dimensions so lets use this number in our example. As you already know from the previous posts, an embedding is simply an array of floating-point numbers – a vector in mathematical terms. With that in mind, we can calculate the size of a single vector as: 1536 × 4 bytes = 6144 bytes, which is approximately 6 KB.

The size of a 1536 dimensional vector.

So… for storing 100K vectors we need 600 MB and for 10M we need almost 60 GB. In practice, these numbers are higher because this calculation only accounts for the vectors themselves (the Vector Index).

In reality, raw vector data resides on high-speed SSDs (Disk Storage), the vector index needs to stay in RAM for fast lookups so generally the smaller the size of vectors, the better (especially when working with large data sets).

Additional RAM is also needed to store data about the HNSW graph (assuming this is the algorithm being used).

Computation overhead

We have already discussed various vector distance algorithms here, but the key point here is the speed of those calculations. If we use more primitive data types such as int8 or bool instead of float then we can improve the search speed (especially for an array of bool values!).

You’re probably thinking: the faster, the better. But there’s another factor to consider which is search accuracy (we will delve into that topic in a second).

What is Vector Quantization?

Put simply:

Vector quantization is a technique for compressing high-dimensional vectors by mapping them into a smaller set of representative values

Based on what we’ve already discussed, you can think of it as reducing the size of embeddings while still keeping enough information to perform accurate similarity searches.

Quantization techniques

There are 3 common quantization techniques, each presenting specific trade-offs in terms of storage efficiency, search speed, and search accuracy.

Scalar quantization

Scalar quantization technique shown with the 75% reduction in a vector size.
Scalar quantization technique

The idea behind scalar quantization is as follows. First, we determine the range of float values in our vector. In this example, I’ve deliberately chosen slightly artificial numbers to make the explanation easier. Once we know the minimum (-1.00) and maximum (1.00) values, we divide the entire range into 256 “buckets” since an int8 can represent 256 (2^8) distinct values. Each bucket corresponds to a specific interval within that range. For instance (values rounded for simplicity!):

  • “bucket” -128: from -1.000 to -0.992
  • “bucket” -127: from -0.992 to -0.984
  • “bucket” 0: from -0.004 to 0.004
  • “bucket” 126: from 0.984 to 0.992
  • “bucket” 127: from 0.992 to 1.000

ℹ️ These buckets could be represented using -128/+127 range or 0/+255 range.

In terms of the range of values, there’s one important detail worth remembering. In practice, many vector databases don’t simply take the raw MIN and MAX values. Instead, they often rely on percentiles to avoid distortion from extreme outliers. For example, Azure AI Search uses the 99th percentile (not configurable) to trim away extreme values. Other systems make this behavior configurable (like quantile property in Qdrant).

Binary quantization

Binary quantization technique shown with the 97% reduction in a vector size.
Binary quantization technique

In the previous example, we divided the range into 256 distinct “buckets” to compress the original data. With binary quantization, however, we only have 2. A typical way to implement binary quantization (assuming a range from –1.0 to +1.0) is to split the values into two buckets:

  • “bucket” 0: from –1.000 to -0.001
  • “bucket” 1: from 0.000 to 1.000

ℹ️ Binary quantization works especially well when used with embedding models whose values are naturally centered around 0 (for instance OpenAI models).

Product quantization

The 3rd common technique is product quantization. We won’t be discussing it in detail here because this method is not available for vector quantization in Azure AI Search, but in short: product quantization splits a high-dimensional vector into smaller sub-vectors and quantizes each one independently.

How Oversampling Improves Search Accuracy

Let’s clarify one concept that is, in my opinion, crucial when deciding whether to apply vector quantization in your project. We’ve already talked about many benefits of different compression techniques: lower storage requirements, more data fitting into RAM, reduced costs, better performance… it’s so good it almost doesn’t feel real, huh?

…but there’s one thing that takes a hit which is search accuracy.

The idea of oversampling in Azure AI Search vector search shown on the diagram.

Let’s look at a simple example. Imagine our vectors live in a 2D space (instead of a 1536‑dimensional one). For simplicity, we’ll use Euclidean distance (a straight-line distance between two points) to measure similarity in this example. On the screen, we can see seven vectors:

  • qV is our query vector.
  • V1–V6 are vectors in our index, where V1 is the closest to the query vector and V6 is the furthest

You applied compression expecting faster query speed… and indeed, the query was fast (…and sure, with six vectors you could compute the distances by hand with a ruler 📏 but hey… it’s just an example 🙂 ). But then something unexpected happened. You queried the index with KNearestNeighborsCount = 3:

new VectorizedQuery(queryVector)
{
    KNearestNeighborsCount = 3,
    Fields = { nameof(AiSearchVectorSearchDocumentModel.Vector) }
}

You naturally expected to get the records linked to vectors V1, V2, and V3. Instead, the search returned V4 instead of V3.

How could that happen?

When you compress vectors, you have to accept that some information is lost. That loss can distort distances (or other similarity metrics) and lead to less accurate results compared to a non-compressed vector search.

So… is there anything we can actually do about it? Fortunately, yes!

The idea is the following. Instead of asking the search engine to return only the TOP K – KNearestNeighborsCount results (for example, 3), we can ask it something like this (as if it was our good friend):

I know we lost some precision due to compression, but I still care about accuracy… could you please consider more vectors than the value I set in KNearestNeighborsCountwhen running that fast (initial) search? If I specify 3, please consider 6 – and THEN use the original, non‑compressed vectors to order just those 6 accurately.

Turning the story into a technical step-by-step explanation:

  1. The vector query runs on the compressed vectors: This fast pass may slightly distort distances (e.g., V4 appearing closer than V3).
  2. AI Search returns the top‑K oversampled candidates: With KNearestNeighborsCount = 3 and defaultOversampling = 2, the engine considers 6 vectors instead of 3.
  3. These 6 candidates are rescored using the original, non‑compressed vectors: This second pass restores the true distances between qV, V3, V4, and the rest.
  4. After rescoring, the results are reordered so the most relevant vectors appear first: V3 gets back into the correct TOP 3, even though compression initially pushed it out.

You’ll see in a moment how to configure all of this in Azure AI Search.

Vector Quantization in Azure AI Search

We already know what quantization is and we’ve covered the most common techniques used in practice as well as the idea of oversampling. Now we can shift our attention to how to configure vector quantization in Azure AI Search.

I’m going to use this index definition (used in previous posts too) below as a reference point. It doesn’t include any compression settings yet (notice the "vectorSearch.compressions": [] section at the bottom).

{
  "@odata.etag": "\"0x8DE4CEF0FC970A1\"",
  "name": "vector-search-index",
  "purviewEnabled": false,
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "searchable": false,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": true,
      "synonymMaps": []
    },
    {
      "name": "Phrase",
      "type": "Edm.String",
      "searchable": false,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "synonymMaps": []
    },
    {
      "name": "Tags",
      "type": "Collection(Edm.String)",
      "searchable": false,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "synonymMaps": []
    },
    {
      "name": "Vector",
      "type": "Collection(Edm.Single)",
      "searchable": true,
      "filterable": false,
      "retrievable": false,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "dimensions": 1536,
      "vectorSearchProfile": "vector-profile-01",
      "synonymMaps": []
    }
  ],
  "scoringProfiles": [],
  "suggesters": [],
  "analyzers": [],
  "normalizers": [],
  "tokenizers": [],
  "tokenFilters": [],
  "charFilters": [],
  "similarity": {
    "@odata.type": "#Microsoft.Azure.Search.BM25Similarity"
  },
  "vectorSearch": {
    "algorithms": [
      {
        "name": "hnsw-algorithm",
        "kind": "hnsw",
        "hnswParameters": {
          "metric": "cosine",
          "m": 4,
          "efConstruction": 400,
          "efSearch": 500
        }
      }
    ],
    "profiles": [
      {
        "name": "vector-profile-01",
        "algorithm": "hnsw-algorithm"
      }
    ],
    "vectorizers": [],
    "compressions": []
  }
}

We first need to define a compression method, give it a name, and then associate it with a vector profile.

❗❗❗ Important: You cannot add compression to an existing vector field. Compression requires either a new index or a new vector field.

Let’s start with a scalar quantization.

Scalar quantization in Azure AI Search

{
  "compressions": [
    {
      "name": "scalar-quantization-compression",
      "kind": "scalarQuantization",
      "scalarQuantizationParameters": {
        "quantizedDataType": "int8"
      },
      "rescoringOptions": {
        "enableRescoring": true,
        "defaultOversampling": 2,
        "rescoreStorageMethod": "preserveOriginals"
      }
    }
  ]
}

Let’s break down this definition into prime factors:

  • name: scalar-quantization-compression – the identifier we’ll use when linking this compression to a new vector profile.
  • kind: scalarQuantization – selected compression type, in this case its Scalar Quantization technique
  • quantizedDataType: int8 – tells the search service to apply scalar quantization using the int8 data type (currently it is the only value supported)
  • enableRescoring: true – this turns on the second pass – the step where Azure AI Search re-evaluates candidates using the original, non‑compressed vectors to restore accuracy.
  • defaultOversampling: 2 – if your query asks for K = 3, oversampling=2 means the engine will actually consider 3 × 2 = 6 candidates in the fast (compressed) search before rescoring them.
  • rescoreStorageMethod: preserveOriginals – Azure AI Search keeps the original vectors around specifically for the rescoring step. That’s how it restores the correct ordering after compression.

Once we have the scalar-quantization-compression defined we can create a new vector profile:

"profiles": [
  {
    "name": "hnsw-with-scalar-compression-profile",
    "algorithm": "hnsw-algorithm",
    "compression": "scalar-quantization-compression"
  }
]

A few additional information:

Controlling oversampling from the C# code:

var searchOptions = new SearchOptions
{
    VectorSearch = new VectorSearchOptions
    {
        Queries =
        {
            new VectorizedQuery(queryVector)
            {
                KNearestNeighborsCount = topK,
                Fields = { nameof(AiSearchVectorSearchDocumentModel.Vector) },
                Oversampling = 5.2
            },
        }
    }
};

You can also specify oversampling directly in your C# code using VectorSearchOptions. When you do, it overrides the defaultOversampling value defined in the index configuration.

Of course, you can use the Oversampling parameter only when compression is defined for that vector field. If the field isn’t compressed, Azure AI Search will return 400 – BadRequest error:

400 returned by Azure AI search when Oversampling property was set but compression for a given vector search field is not configured.

HNSW vs Exhaustive KNN

Rescoring works only when you use HNSW as your vector search algorithm. Oversampling simply doesn’t apply to Exhaustive KNN, because by definition that algorithm already evaluates every vector in the index. There’s nothing to “oversample”…

Binary quantization in Azure AI Search

{
  "compressions": [
    {
      "name": "binary-quantization-compression",
      "kind": "binaryQuantization",
      "truncationDimension": 1024,
      "rescoringOptions": {
        "enableRescoring": true,
        "defaultOversampling": 2,
        "rescoreStorageMethod": "discardOriginals"
      }
    }
  ]
}

Some of the fields are already familiar to you, so let’s focus on the remaining fields, such as truncationDimension, along with the discardOriginals option.

  • truncationDimension: 1024 – reduces the vector size from 1536 to 1024 (takes the first 1024 values). This setting is supported for text-embedding-3 embedding models “family” (and any other embedding models retrained using the Matryoshka Representation LearningMRL technique). It is applicable to HNSW algorithm only.
  • rescoreStorageMethod: discardOriginals – this mode skips the original vectors during rescoring and instead computes the dot product of the binary embeddings. The resulting search quality remains high, slightly lower than when using the original vectors, but still strong and reliable.

I owe you an additional explanation about the discardOriginals option. A reasonable question may come to mind: how can we discard the original vectors and still perform rescoring?

First, let’s clarify why you would choose this setting. The primary motivation is saving space and cost. If we configure it to rescore using original vectors, it must maintain a full-precision copy (in Disk Storage) alongside the compressed one.

But this leads to our second question: how can the system calculate a more accurate score if the original data is gone?

It calculates a dot product where it sums the floating-point values of your query only at the positions where the binary vector has a 1 bit. Because the query’s floating-point magnitudes are preserved, this math is much more granular than the initial bit-matching pass.

Bit-matching pass… what do I mean, you may ask?

When we have two binary vectors, we can measure their similarity using HammingDistance, which essentially answers the question: at how many positions do these two vectors differ? If the XOR operation came to mind at this point, then congratulations. As you may also know, the CPU is very happy to execute XOR, and its gratitude is expressed through a significant query speed boost. This is the main reason why search using binary quantization is much faster than when using scalar quantization.

There is one more CPU-native instruction involved in this calculation: POPCOUNT (Population Count). Think of it this way:

  • XOR: find all the bit differences e.q: 1011 xor 1101 = 0110
  • POPCOUNT: count all the differences 0110 = 2

Vector Index vs Disk Storage

I believe this is the right moment to explain the difference between the Vector Index and the Disk Storage.

Let’s assume we have ~170K float vectors with 1536 dimensions to be indexed. As we’ve already discussed, that’s roughly 1 GB of data with no compression. Let’s also assume that the remaining (non-vector) data is 10% of the vector data (like id, Phrase, Tags), giving us a total of approximately 1.1 GB.

First of all, let’s consider a scenario where vector quantization in Azure AI Search is not enabled.

Azure AI Search Vector Index and Disk Storage separation shown on a diagram.

As you can see on the left side, the HSNW graph stores all the original vectors. The HSNW graph also takes a little space (usually 1-20% of the total vectors size), so the total size is the sum of all the vectors plus the HSNW graph itself. On the right side, in the Disk Storage, we can see the same vectors taking up 1GB + 100MB of the remaining data (Data Retrieval) as well as a serializable copy of what is stored in the vector index (we cannot control it directly by setting any property but rather by lowering the size of a vector index).

And now you might be thinking… wait a second. Why would I store a copy of all the original vectors in DiskStorage (Data Retrieval) and waste so much space?

The answer is: in most scenarios, storing a copy of these vectors is redundant but we can control it easily using the stored property.

❗❗❗ Important: you should always set the stored property to false unless:

  • you perform partial document updates against your index (which is not a rare edge case, especially when using a hybrid indexing approach) using merge or mergeOrUpload methods. Let’s say that you update only Phrase property using a partial update. Behind the scenes Azure AI Search performs READ > APPLY the partial update > WRITE operations but… it cannot read the Vector field (and other fields marked as stored:false) and therefore it is silently erased.
  • you need the raw vector data returned via the retrievable:true property, which is very uncommon because you typically don’t care about the raw vector values in a response.

Let’s assume now that you configured stored as false and at the same time applied Scalar Quantization and enabled rescoring.

Azure AI Search Vector Index and Disk Storage separation shown on a diagram when scalar quantization and rescoring is enabled,

On the left side, we can see that we managed to reduce the space occupied by vectors by 75% thanks to Scalar Quantization (float – 4 bytes > int8 – 1 byte). As a result, instead of ~1GB, the vectors now take ~256MB of space. Now let’s focus on the right side, which might be a bit more challenging to understand.

We enabled rescoring (see enableRescoring: true) and chose Scalar Quantization with rescoreStorageMethod:preserveOriginals. That’s why we see “Original Vectors 1GB” in the picture. However, this is not the same storage space as in the previous picture (which is controlled by the stored property). This storage is used for rescoring, whereas the one we removed (again, the stored property) could not have been used for that purpose even if it existed.

A serializable copy of the ~256MB quantized vectors plus the HNSW graph is copied to Disk Storage as usual.

Let’s consider the last example, where stored is again false, rescoring is enabled but in addition we use Binary Quantization in conjunction with vector dimension reduction.

Azure AI Search Vector Index and Disk Storage separation shown on a diagram when binary quantization and rescoring is enabled plus vector dimension truncation (MRL).

Looking at the Vector Index, we can see that the initial 1GB was first reduced by 33% due to truncationDimension: 1024, which is 2/3 of the original vector dimension (1536), and then further reduced hrough Binary Quantization. The final result is ~21MB, which is roughly 2% of the initial vector size. We should, of course, also take the HNSW graph into account.

Looking at the Disk Storage, we can see that the original vectors disappeared thanks to rescoreStorageMethod: discardOriginals. The fact that they disappeared does not mean there is no rescoring – there is, but it no longer uses the original vectors. I have already explained in this post how the rescoring process works even without access to the original vectors.

It all looks so good that you’re probably already waiting for the answer to the real question: so what’s the catch? There isn’t any… just a few trade-offs!

This brings us to the best practices and trade-offs of vector quantization in Azure AI Search.

Best practices and Trade-offs

We’ve now acquired enough knowledge to consider some best practices of vector quantization in Azure AI Search and discuss the trade-offs.

Use Scalar Quantization when:

  1. You want strong compression (~75%) with almost no quality loss.
  2. You plan to use rescoring, especially with preserveOriginals, to maintain near-baseline accuracy.
  3. You want an easy to apply optimization that works well across most models without special requirements.

Use Binary Quantization when:

  1. You want maximum compression (>97%).
  2. Your vectors are high-dimensional, where binary quantization tends to preserve structure better.
  3. You can tolerate a small accuracy drop when using discardOriginals.
  4. You want to maximize search speed (>33% speed gain).
  5. Vectors produced by your embedding model are typically centered around 0 (OpenAI, Cohere models and other)

Use Dimension Truncation when:

  1. You have (or can adopt) an MRL-compatible embedding model designed for truncation.
  2. You need extreme compression, often <2% of the original size.
  3. Your embeddings have very high dimensionality

Other:

  • Dimension Truncation should not be the first step in your optimization strategy. It’s better to begin by optimizing your vector search with either Scalar Quantization or Binary Quantization, and then apply this technique on top if you need additional compression.
  • Set stored: false to easily save a significant amount of space (with the 2 common exceptions discussed in this post)
  • If search accuracy is your top priority, always use preserveOriginals. This should maintain full accuracy with no measurable loss.
  • If you don’t need rescoring, simply disable it to achieve maximum query speed assuming you accept a slight reduction in search accuracy.

ℹ️ You’ll find detailed statistics on the combinations of methods covered in this post here. I highly encourage you to check it out.

Summary

Vector quantization in Azure AI Search offers massive RAM savings (>97% for BQ) and lower costs (also due to Disk Storage savings when using discardOriginals option and stored:false). By carefully choosing between Scalar Quantization and Binary Quantization, configuring oversampling, and managing storage options, you can balance efficiency, accuracy, and scalability in production-grade vector search systems.

After walking through this post, I hope you feel confident not only about when to apply vector quantization in Azure AI Search, but also about how oversampling and rescoring help safeguard accuracy and why understanding the difference between the vector index (RAM) and disk storage is so important for both performance and cost.

We also touched on the fundamentals of vector dimension truncation and Matryoshka Representation Learning (MRL), giving you a glimpse of how these techniques fit into the bigger optimization picture.

I hope my explanation of this concept helps you improve that area of your project (especially since vector DB costs are still relatively high).

Thanks for you reading this post and see you in the next one!

Categorized in:

AI Services,