Table of Contents

Introduction

You can find all the C# code samples here: RAG GitHub Repository.

This is the next post in the RAG series so before reading about Contextual Retrieval for RAG it is worth to get familiar with these posts too:

Hey!

It is all about appropriate context when it comes to the quality of the answers your LLM powered application provides. There is even an entire discipline called Context Engineering to emphasise the importance of that process.

The rule is simple: the more relevant and accurate a piece of context you paste into the LLM prompt, the better the answer. As easy as that.

In this blog post, I would like to discuss Contextual Retrieval for RAG, which is a technique you should be familiar with because it proves to improve the quality of results.

Let’s jump straight to it!

The problem

When you build a RAG pipeline, one of the first challenges you usually encounter is data chunking. It directly translates to the quality of the search results. You usually start with Fixed-Size with Overlap chunking, and then, depending on the scenario, you stay with that approach or move to more advanced strategies like Semantic Chunking or Recursive Chunking.

But there is still one challenge to be solved, which is: how to find the right balance between the length of a single chunk (so it stays “sharp”) while at the same time pasing enough context to an LLM.

One way to mitigate that challenge is using Parent-Child (Hierarchical) chunking, where short child chunks are used to retain maximum search accuracy and precision, while the parent chunk, which is larger, is injected into the LLM prompt. The parent is not used during the search operation, so it’s like creating a non-clustered index in SQL Server with an INCLUDE statement for the parent field.

But with parent-child chunking, the child chunk itself does not carry the semantic meaning of the parent chunk.

The main issue that arises because of that is that the given child chunk may lack the needed context. What does it mean? Let me explain it base on the following example.

Contextual Retrieval for RAG, The problem and The Solution

Imagine a large HR PDF that contains various HR policies..

Document Header:Remote Work Policy – Warsaw Office (PL) – Effective 2026Introduction:This policy applies strictly to full-time staff based in Poland.
[Several pages of text] …
Section 4.2 (The Chunk):Employees are eligible for a £500 annual reimbursement to cover home office equipment, provided they have completed their 3-month probation period.

In a standard RAG pipeline, the system breaks the document into small pieces. One of those pieces is just the text from Section 4.2.

  • The Chunk in the Database: “Employees are eligible for a £500 annual reimbursement to cover home office equipment, provided they have completed their 3-month probation period.”
  • The User Query:Does the New York office get a home office reimbursement?

Why it fails?

  1. Missing Context: The chunk itself mentions “Employees,” but it doesn’t mention “Warsaw” or “PL”
  2. The Result: Because the embedding for “home office reimbursement” matches the chunk perfectly, the RAG system retrieves it. The LLM sees the text and confidently (but incorrectly) tells the New York employee: “Yes, employees get a £500 reimbursement after 3 months.” The result is vague at best and hallucinated at worst.

This is what I meant by saying that context is often lost. The chunk itself sometimes may miss the critical data.

The concept

In Contextual RAG, we don’t just embed the chunk. We prepend additional context to that specific chunk BEFORE creating an embedding.

The “Contextualized” Chunk

Before a single chunk is converted into a vector, an LLM generates a one-sentence “situational awareness” prefix for the chunk. The new combined text may look like:

“This chunk is part of the Warsaw Office Remote Work Policy (PL). Section 4.2: Employees are eligible for a £500 annual reimbursement to cover home office equipment…”

Why it succeeds:

  • Better Retrieval: When the New York employee asks their question, the vector for “New York” will no longer highly align with this chunk because the text now explicitly mentions “Warsaw” and “PL”
  • Accurate Generation: Even if the chunk is retrieved, the LLM now sees the “Warsaw” prefix and can correctly answer: “The £500 reimbursement is specifically for the Warsaw office; please check the New York handbook for local benefits.”

Contextual Retrieval

Once we know the motivation for using the Contextual Retrieval for RAG pattern and the mechanics of it, let’s move towards a real implementation. One of the most important elements of that pattern is the prompt structure. The simplest and most generic prompt you can use is something along the lines of:

### System Instructions
You are a retrieval-augmentation specialist. Your goal is to prepend a 1-sentence situational context to a text chunk.
Focus on: 
1. Who/What/Where (Entities).
2. Document Subject (The 'Global' context).
3. Critical IDs or Dates.

### Document
{{WHOLE_DOCUMENT}}

### Chunk
{{CHUNK_CONTENT}}

### Output Requirement
Provide ONLY the 1-sentence context. Do not include 'The context is...' or any preamble. 
Goal: Improve keyword and vector match for search.

At the same time, please remember that the more domain context you can provide within such a prompt, the better. It will be capable of capturing the essence of the document in reference to that particular chunk even more effectively.

Once we know what such a prompt could look like, let’s try to visualize the entire process of Contextual Retrieval for RAG step by step.

Contextual Enrichment Process: Step by Step
  1. Chunk the source document using the chunking strategy that best fits the needs of your project.
  2. For each chunk, invoke an LLM to “anchor” that particular chunk in the broader context of the source document.
  3. Here is THE KEY POINT: instead of creating a vector based on the original, context-lacking chunk, you generate the vector from the enriched chunk. Naturally, you also store that enriched chunk in the search index.
    • If you use Hybrid Search (with BM25), this enrichment effectively acts as a natural keyword expansion.
    • If you use Hybrid Search + Semantic Search (e.g. in Azure AI Search), the Cross‑Encoder also benefits from the additional context, allowing it to rerank results more effectively in the L2 phase of the retrieval pipeline.

As you can see, contextual retrieval affects both stages of a typical search setup (L1 + L2, read more about it here), which is Hybrid Search (BM25 + Vector) + Semantic Reranker (Cross-Encoder).

  • Vector Search – each vector carries more meaning by being enriched with that additional context
  • BM25 – the enriched chunk carries more keywords, including the ones which might be game changers for full-text search
  • Semantic Reranker – while comparing the user query with the TOP N documents from the L1 phase (max. is 50 in Azure AI Search), it can leverage the additional context information embedded into that enriched chunk to even better calculate relevancy

This is all great, you may say, but… if each of the enriched chunks is created based on analyzing its content in reference to the document content, that means sending large prompts N times, which translates into huge token usage. That would be true… if there was no prompt caching mechanism.

Prompt Caching

If there are 1 000 chunks created based on a single document, am I really paying for 1 000 LLM calls just to prepend one sentence?

Not necessarily! It’s a perfect moment to clarify how prompt caching works.

I use the gpt-4.1-mini model deployed in Microsoft Foundry, so I will refer to how that prompt caching works for Azure OpenAI models, but this technique is available in other LLM providers too.

Do you remember the structure of that prompt from the previous chapter? It starts with some instructions, then we inject document content, and then a single chunk. If we analyze what is the difference between each such prompt, what changes and what stays, the outcome will be the following:

  • Instructions (constant)
  • Document Content (constant)
  • Chunk (volatile)

Now, when we have these 1 000 chunks generated based on a specific document content, then the very first request does not leverage any prompt caching, but the 2nd one and all the subsequent ones do. Thanks to that, the same input tokens are not processed again, and hence the token usage is dramatically lower than in the case where there was no caching at all. It influences the speed of response generation too, so it’s lowered cost + latency reduction.

There are a few technical details to be aware of:

  • it works only for prompts with a minimum length of 1 024 tokens
  • the precision and granularity of that caching is 128 tokens, which means that the following number of tokens can be cached: 1 024 + (1 × 128), 1 024 + (2 × 128), etc.
  • you can check how many tokens were cached by analyzing a response
  • a single difference within the first 1 024 tokens results in a cache miss
  • prompt caches are usually cleared every 5-10 min
  • you don’t have to enable it explicitly, it is enabled by default

The final conclusion? Knowing that such a mechanism exists, structure your prompts in a way that repetitive content is always placed at the beginning.

Now once we know enough theory about contextual retrieval and prompt caching let’s see how it can be implemented in C#.

A real C# example

public class ContextualRetrievalExample
{
    private readonly ChatClient _chatClient;
    private readonly EmbeddingClient _embeddingClient;
    private readonly Tokenizer _tokenizer;

    public ContextualRetrievalExample()
    {
        var openAiClient = new AzureOpenAIClient(
            new Uri(Environment.GetEnvironmentVariable("AZURE_OPEN_AI_CLIENT_URI")!),
            new DefaultAzureCredential());

        _chatClient = openAiClient.GetChatClient(Environment.GetEnvironmentVariable("AZURE_OPEN_AI_EMBEDDING_CHAT_CLIENT_DEPLOYMENT_NAME")!);
        _embeddingClient = openAiClient.GetEmbeddingClient(Environment.GetEnvironmentVariable("AZURE_OPEN_AI_EMBEDDING_CLIENT_DEPLOYMENT_NAME")!   
        _tokenizer = TiktokenTokenizer.CreateForModel("text-embedding-ada-002");
	}
}

Let’s start with the classes that are needed. ChatClient is used to interact with an LLM deployed in Microsoft Foundry, EmbeddingClient is used to generate an embedding based on an enriched chunk, and Tokenizer is used to chunk data using Fixed-Size Chunking with Overlap (chunkSize: 512, overlap: 25%). The first two classes come from the Azure.AI.OpenAI NuGet package, whereas to use the third one and that specific tokenizer (text-embedding-ada-002), two NuGet packages are needed: Microsoft.ML.Tokenizers and Microsoft.ML.Tokenizers.Data.Cl100kBase. You can also see that, in order to establish a secure connection to models deployed in Microsoft Foundry, I do not use any API keys but rely on Entra authentication instead (read more about it here) and use DefaultAzureCredential (Azure.Identity NuGet).

The core logic for Contextual Retrieval (data ingestion phase) is following:

{
    var chunks = CreateChunks(markdown, chunkSize: 512, overlapPercentage: 25);
    foreach (var chunk in chunks)
    {
        var context = await GetContext(markdown, chunk);

        // Prepend the generated context sentence to the raw chunk — this is the core of the Contextual Retrieval pattern.
        // The model-generated sentence situates the chunk within the broader document, making it self-contained for retrieval.
        var enrichedChunk = new EnrichedChunk(context.Value, chunk);

        // Embed the enriched text (context + chunk) so the vector captures full situational meaning,
        // not just the isolated chunk's semantics.
        var embedding = (await _embeddingClient.GenerateEmbeddingAsync($"{enrichedChunk.Context} \n {enrichedChunk.Chunk}")).Value.ToFloats();

        // Store the enriched text and its vector. The enriched form benefits both retrieval paths:
        // vector search (semantic similarity) and BM25 full-text search (keyword overlap).
        var searchDocument = new SearchDocumentModel()
        {
            id = Guid.NewGuid().ToString(),
            EnrichedChunk = $"{enrichedChunk.Context} \n {enrichedChunk.Chunk}",
            EnrichedChunkVector = embedding.ToArray()
        };
}

Once the chunks are generated then for each chunk we get that additional context which is the essence of the contextual retrieval. This is how GetContext method looks like:

private async Task<GeneratedContext> GetContext(string documentContent, string chunkContent)
{
    var prompt = GetContextEnrichmentPrompt(documentContent, chunkContent);

    ChatCompletion chatCompletion = await _chatClient.CompleteChatAsync(new UserChatMessage(prompt));

    return new GeneratedContext(chatCompletion.Content[0].Text ?? "", chatCompletion.Usage.InputTokenDetails.CachedTokenCount);
}

You have already seen the structure of the prompt so there is no surprise that I pass documentContent (the const part) and chunkContent (the volatile part) to the GetContextEnrichmentPrompt method. You may also notice that I not only save the result but also chatCompletion.Usage.InputTokenDetails.CachedTokenCount and I will explain in a second why.

Then it’s just a matter of joining the context and the content of an individual chunk, and creating an embedding based on that.

The only remaining step is pushing that data into a search index e.g. in Azure AI Search and performing the search operation.

If you are interested in the details of how to perform a pure vector search, or a hybrid search, or a hybrid search + semantic reranking (all in Azure AI Search) then you can read these linked articles.

Let’s run the app now (you can find the markdown file I use in 07_ContextualRetrieval project, Data/remote_work_policy_pl.md folder).

This is how the first chunk and the generated context looks like:

The first chunk for Contextual Retrieval, no caching

Chunk 1 = Context + ‘Raw Chunk’. You can see the Context which was generated for that particular chunk. At the very bottom you can also see Cached tokens: 0.

Now, let’s analyze the 2nd chunk.

Thesecond chunk for Contextual Retrieval, with caching

We can draw 2 conclusions:

  • Context is not exactly the same as for the 1st chunk, which is expected. We invoke a prompt for each chunk to make that context adjusted to a specific chunk. Of course, it will look very similar but not exactly the same, which is the desired behavior.
  • Cached (input) tokens = 1664

Let’s analyze the number of the cached input tokens step by step to better understand prompt caching in Microsoft Foundry based on a real example.

Let’s use OpenAI tokenizer to verify the number of tokens each prompt contains, but let’s deliberately skip the part which changes (the volatile part) and focus only on what remains the same for all the chunks for a given document.

### System Instructions
You are a retrieval-augmentation specialist. Your goal is to prepend a 1-sentence situational context to a text chunk.
Focus on: 
1. Who/What/Where (Entities).
2. Document Subject (The 'Global' context).
3. Critical IDs or Dates.
4. Versions/References

### Document
{documentContent}

### Chunk

The very first character after ### Chunk is of course the exact place where each prompt differs.

Prompt Caching, OpenAI tokenizer result

We can see 1747 tokens. So why the cached input tokens is equal to 1664?

It’s because, as I mentioned before, prompt caching in Microsoft Foundry works in a way that first 1024 tokens are cached (the minimum value), and then it captures the same tokens with a stride of 128 tokens, and hence 1664 = (1024 + 5 * 128). The next value which would be captured by that prompt caching mechanism is 1792.

There is no magic! It’s pure math 🙂 and it works fully deterministically.

The risks of the Contextual Retrieval

I can imagine you feel encouraged to explore the contextual retrieval pattern in conjunction with prompt caching in your project, but before that I would like to mention some dangers to keep in mind.

  • Vectors may get “blurred” by adding the same or almost the same prefix to each of them. In other words they may lose their sharpness.
  • The Chunk Content-to-Context Ratio: If your chunk is 100 words and your “situational context” is 50 words, 33% of your vector weight is now almost identical for every chunk in that document. If you aren’t careful, the search might return 10 chunks from the same document because they all share that high-weight “Warsaw Office” prefix, even if the actual content isn’t what the user needs.

Think of it as adding a grain of salt to a tomato soup. You want to add just enough of it because otherwise the soup will be uneatable. Also keep in mind that when you add too much salt to every single bowl, you lose the unique flavor of the tomatoes. If your contextual prefix is too dominant, the vector database can no longer distinguish one chunk from another because they all “taste” like the document header rather than their own specific content.

There are also some potential implications on the BM25 algorithm:

  • Keyword Expansion: By adding additional context to every chunk, you are effectively performing keyword expansion. While this can help retrieval, it can also make the keyword search less effective. Since the same “contextual keywords” (like “Warsaw” or “Remote Work Policy”) now appear in hundreds of chunks, their Document Frequency increases significantly. In the BM25 algorithm, this causes the IDF (Inverse Document Frequency) for these terms to drop. Essentially, these words become “less unique” to the search engine. If a user searches for “Warsaw,” the engine might not give these chunks the high score you expect because the word is now seen as a common term rather than a specific discriminator.
  • The “Dilution” Risk (Field Length Normalization): BM25 rewards brevity. It assumes that if a keyword matches in a short chunk, that chunk is highly relevant. When we prepend a contextual prefix, we are making the “field” longer. If the context is too wordy, we risk “diluting” the original keywords. The search engine might see our highly relevant 100 word chunk as “fluffier” than it actually is because 30% of it is now repetitive metadata, potentially causing it to rank lower than a “sharper,” raw chunk.

These dangers may not materialize at all when using contextual retrieval, but please remember about 2 techniques to mitigate them just in case:

  • you can store a raw (not enriched) chunk in a separate field in Azure AI Search and use that field for BM25 search, while for Vector Search you would use an embedding created based on an enriched chunk
  • you can assign more weight to the Vector Search by leveraging the Weight property in VectorSearchOptions

Metadata enrichment

While discussing the Contextual Retrieval for RAG pattern, it’s worth remembering that in certain scenarios we can achieve much better results by simply using metadata filters. Sticking to the example of an HR document that describes rules for a specific office (e.g., Warsaw), it would be sufficient to have a Location field defined in the Azure AI Search index and then check a given user’s location (via an ordinary query to SQL Database or Cosmos DB) and apply a filter like Location eq 'Warsaw'.

That’s it! Problem solved. It’s faster, cheaper, and more reliable, so please don’t treat Contextual Retrieval for RAG as a holy grail but rather as a tool. If you see that simple metadata filtering is possible to implement, then just do it. You can mix it with other search configurations like hybrid search or pure vector search.

Please also remember that the easiest situation is when you can pull some metadata based on user context or in another simple way, for example, by knowing that a user is asking a question about a specific product from your shop, like P123_456.

But what if, while invoking the query, you cannot pull these additional metadata, or during the data ingestion pipeline you also have no way to easily tag documents with additional metadata?

Then you can use an LLM in both stages to classify the content and apply metadata tags. However, that would definitely be more expensive and resource + time consuming, and the metadata tags produced by the LLM become a double-edged weapon. Why? Because if the metadata tags aren’t accurate and you decide to use filters, you may filter out the desired results.

So one more time… the more metadata you can attach by leveraging additional data from your existing data sources and the more metadata you can pull easily (skipping any AI techniques) on-the-fly when a user invokes a query, the better.

Summary

My goal was to give you as much useful information about Contextual Retrieval for RAG as possible. I wanted to focus on the WHY and HOW, but I also decided that this topic is a perfect moment to introduce the concept of prompt caching in Microsoft Foundry (using OpenAI models).

The search phase for Contextual Retrieval is a topic which has already been covered in detail in other posts, so I deliberately skipped it and focused on the data ingestion phase.

I think that, considering all the pros and cons, the Contextual Retrieval pattern is definitely worth exploring, and having an understanding of various nuances can help you fine-tune your search pipeline even better.

Thanks a lot for reading.

See you in the next post!

Categorized in:

RAG,