Table of Contents

Introduction

You can find all the C# code samples here: RAG GitHub Repository.

This is the next post in the RAG series so before reading about Chunking Strategies for RAG you may have a look at these posts too:

Hey Everyone!

When you start learning about RAG one of the first challenges which arises is how to chunk your data.

In today’s post I would like to focus on the 3 chunking strategies for RAG which are very common (and mention the others):

  • Fixed-Size Chunking
  • Semantic Chunking
  • Parent-Child Chunking (a.k.a. Hierarchical Chunking)

I also want to show you how to implement this chunking in C#, because I have not found too many examples of chunking in C#. Most of them are in Python (including ‘ready to use’ libraries).

But before we start looking at the code, I would like to talk about theory and define what chunking is and why it influences your end results so much.

We will also talk about whether chunking strategies for RAG make any sense, since the context windows of the newest LLMs are so huge that it might be tempting to just “attach all of the documents to a given prompt and send it”.

Let’s start!

What chunking is all about

Imagine that you are asked to summarize a technical book that you have just read. There is one nuance, though: you can use max. 20 words. You did all you could to squeeze as much info as possible within these 20 words. But when you publish it, people start complaining because it is so generic that it’s hard to draw any conclusions. Let’s assume this book has 8 chapters. What if you could use 20 words to describe each chapter independently? I think we both agree the latter approach would give a more precise description of that book.

If we think about that summary from a technical perspective, we can say it’s some kind of compression. You convert the original data (the book you read plus all the conclusions and opinions you have formed) into a more compact form.

Writing this blog post is also some kind of compression because it’s just a simplification and summarization of the thoughts in my mind.

And now the universal rule appears > the more information we must compress (think about it as the length of that book) while also being forced to use a very “aggressive” compression (20 words to write a summary instead of 200), the more information and context is lost.

We could also say that:

  • we haven’t expressed the meaning of the original data well OR
  • we were forced to ignore some important nuances OR
  • the semantic meaning of the original data is not very accurate OR
  • the essence of the original data has not been preserved well

All I want you to remember at this point is the following trade-off. The larger the original data is, the more challenging it is to compress it into a compact format while preserving enough context. But that also means that there should be a certain combination of these two factors that leads to good results.

How is that connected to vectors and chunking?

You can perceive embeddings as a form of compression. But there is a technical catch: an embedding is essentially a mathematical average of semantic meaning. If you create one vector for a whole chapter, you are creating a “semantic average.” If a user asks a very specific question about one tiny detail in that chapter, the “average” vector of the whole chapter might be too far away in the vector space for the search to find it.

So now the most relevant question arises:

If vectorizing/creating embeddings is a form of data compression, then how big should the data we vectorize be?

One vector with 1536 dimensions based on the entire book? No, because we lose the specific details in the “average.” But what about vectorizing each word? That also doesn’t work because a single word has no context, think about the word “Bank.” Is it a river bank or a financial bank? We need the “Optimal Context Window.” We need enough text to keep the “signal” (the actual meaning) while removing the “noise” (irrelevant filler) that dilutes the vector.

As you can see, there are many questions, and that is why there are multiple chunking strategies for RAG, because the higher the quality of your vectors, the more accurate the search is, and therefore your RAG solution produces way better answers.

Is chunking needed at all?

Ehh, buddy, why are you even writing this long and boring introduction to something that no longer has any practical application in 2026? You may ask…

Chunking was great when LLMs were introduced and the context window was relatively small, like 4K tokens… but now!? A context window with 1M tokens is a standard!

I could paste the entire Lord of the Rings series into a single prompt and then ask a question: “Is the name of the main character Frodo? Answer Yes/No.

So what is the point of chunking data and then injecting just 3-8 chunks into the final prompt sent to an LLM if I can attach all my 20 PDFs and let the LLM do the rest!

These are all valid remarks, so let’s discuss it.

First of all, if you work with a relatively small data set, then sure… just use the brute- force method. If that works, then hey, you have just saved plenty of time implementing some complicated RAG solution (which I hope won’t be complicated anymore after reading this whole series about RAG).

But then you start uploading more and more documents… and the questions you ask start to be more and more specific, and gradually you can see a drop in the accuracy of answers, and you start seeing that the LLM starts to hallucinate. This is often because of the “lost in the middle” phenomenon, when you send too much text at once, the model’s attention mechanism starts to lose focus on the details buried in the center of your document. Not good, right?

At the same time, you get a task to keep the chat history inside your corporate infra and not use any 3rd-party solution (because which company would be keen to delegate keeping their sensitive data to solutions that were developed a year ago and which, in 90% of cases, won’t be on the market anymore in 1-3 years?). Now you have a problem, because the length of the prompts you send to an LLM has never been your concern, but now it should be because you are asked to store it.

Length of the prompts, hmm… you forgot that you mainly pay for the tokens you use, right? And it’s not just about the money. In Azure, you have TPM (Tokens Per Minute) limits. If every prompt is huge, you will hit your quota after only a few users, and your app will stop working for everyone else.

Maybe in the future, locally deployed LLMs will mitigate these token limits and costs (but other challenges will arise), but for now, we have to design for the cloud reality. At the same time, users of your app start complaining that responses are generated very slowly. This happens because the model takes much longer to “read” and process a massive prompt before it even starts writing the first word. Users could perhaps even wait a little longer… but these specific questions cannot be answered!

So…is RAG and chunking strategies for RAG dead by 2026? Of course not!

This is exactly the point when chunking strategies for RAG come into play!

The Pros of using RAG

Since we have already established that oversizing the prompt might be an architectural trap in 2026, here are the five primary reasons why a well-chunked RAG pipeline remains the superior choice for enterprise Azure solutions:

  • Precision: Neutralizes “Lost in the Middle” by isolating high-signal chunks and reducing semantic dilution.
  • Throughput: Optimizes Azure TPM (Tokens Per Minute) quota, supporting significantly more concurrent users.
  • Speed: Drastically lowers TTFT (Time To First Token) by minimizing prompt prefill and KV cache overhead.
  • Auditability: Enables deterministic citations and source-grounding for enterprise compliance.
  • Portability: Decouples your data index from the model thanks to which you can easily pivot from Azure OpenAI to Local LLMs without a re-index.

I hope at this point you are more convinced that chunking strategies for RAG are still a relevant topic. So let’s finally start analyzing the three common methods: Fixed-Size Chunking, Semantic Chunking, and Hierarchical Chunking.

Chunking strategies using C#

I propose to start with a baseline method, thanks to which we can learn its limitations, and then let’s focus on more complicated ones.

I will be using a markdown file (Data/grounding-data-design.md in the 06_ChunkingStrategies project) downloaded from this GitHub repository which stores Azure related documentation as markdown files.

Fixed-Size chunking

How it works?

This is the most basic approach to chunking. You define a fixed number of characters (or tokens) and split the text accordingly. Because this method is “content-blind,” we usually implement a sliding window or overlap. This means a small portion of the previous chunk is repeated in the next one. The goal is to ensure that if a critical piece of information is cut exactly in half, it exists in its entirety in at least one of the segments. It is a highly predictable, high-performance method, but it completely ignores the natural structure of your sentences or paragraphs.

This is how the core logic looks in my example:

private IEnumerable<string> CreateFixedSizeChunksWithOverlap(
    string text, int chunkSize, double overlapPercentage)
{
    int overlapSize = (int)(chunkSize * overlapPercentage / 100.0);
    int stride = chunkSize - overlapSize;

    if (stride <= 0)
        throw new ArgumentException(
            $"overlapPercentage ({overlapPercentage}%) produces a non-positive stride. Keep it below 100%.",
            nameof(overlapPercentage));

    return ChunkIterator(text, chunkSize, stride);
}

private IEnumerable<string> ChunkIterator(string text, int chunkSize, int stride)
{
    IReadOnlyList<int> encoded = _tokenizer.EncodeToIds(text);
    int[] allTokenIds = encoded as int[] ?? [.. encoded];

    for (int i = 0; i < allTokenIds.Length; i += stride)
    {
        int end = Math.Min(i + chunkSize, allTokenIds.Length);
        string? decoded = _tokenizer.Decode(new ArraySegment<int>(allTokenIds, i, end - i));

        if (!string.IsNullOrEmpty(decoded))
            yield return decoded;

        if (end == allTokenIds.Length)
            yield break;
    }
}

As you can see… it is nothing fancy, since this method is a kind of brute-force method. It is worth remembering what the best default values are for the chunk size (tokens) and overlap. These are 512 and 0.25 (based on tests performed at Microsoft some time ago). Of course… best defaults does not mean these are the best values for your project but it’s a good starting point at least!

Results of the Fixed-Size chunking:

Fixed-Sized chunking with overlap results

Each chunk has a similar size. You can also see the overlap in action. Each next chunk contains a part of the previous chunk.

Semantic chunking

I would like to feature the concept of semantic chunking now because it is a natural progression towards defining better boundaries for our text chunks. The primary limitation of Fixed-Size Chunking is, of course, its complete blindness, and this is where semantic chunking helps.

The idea is the following:

  • We chunk our file into some units, e.g., sentences.
  • We create an embedding which captures semantic meaning for the current sentence and for the adjacent sentence.
  • If the threshold that we specified, e.g., 0.75, is exceeded, then we assume these two sentences are very similar; we combine them and just continue with the loop over all sentences. We may also implement a tokens/words/characters limit so that each chunk cannot exceed a given threshold.

This is how such a naive (it’s not production ready… not even close) semantic chunking logic could look in C# using text-embedding-ada-002 deployed in Microsoft Foundry:

private async Task<IReadOnlyList<string>> CreateChunksGroupedSemanticallyAsync(
    string text, double similarityThreshold = 0.80, int maxTokensCountPerChunk = 1024)
{
    IReadOnlyList<string> sentences = SplitIntoSentences(text);

    if (sentences.Count <= 1)
        return sentences;

    float[][] embeddings = (await _embeddingClient.GenerateEmbeddingsAsync(sentences))
        .Value
        .Select(e => e.ToFloats().ToArray())
        .ToArray();

    var chunks = new List<string>();
    var currentChunk = new List<string> { sentences[0] };
    int currentTokenCount = _tokenizer.CountTokens(sentences[0]);

    for (int i = 1; i < sentences.Count; i++)
    {
        int sentenceTokens = _tokenizer.CountTokens(sentences[i]);
        float similarity = CosineSimilarity(embeddings[i - 1], embeddings[i]);

        if (similarity < (float)similarityThreshold || currentTokenCount + sentenceTokens > maxTokensCountPerChunk)
        {
            chunks.Add(string.Join(" ", currentChunk));
            currentChunk = [];
            currentTokenCount = 0;
        }

        currentChunk.Add(sentences[i]);
        currentTokenCount += sentenceTokens;
    }

    if (currentChunk.Count > 0)
        chunks.Add(string.Join(" ", currentChunk));

    return chunks;
}

In general, I encourage you not to develop your own custom semantic chunker (unless you think it will be beneficial due to the specifics of your project) but to use a battle-tested library. There are so many edge cases to be taken into account that what seems doable at first may very soon turn into a nightmare.

Results of the Semantic Chunking:

Semantic Chunking results

Each chunk has a different length and this is of course expected. I set maxTokensCountPerChunk as 1024 and similarityThreshold as 0.75.

How do I know these are the best settings? I don’t!

In a real app, I would just change these values, run the accuracy evaluation tests, and check. These values may vary depending on various criteria.

Hierarchical chunking

The last method I would like to discuss in more detail is Hierarchical Chunking, a.k.a. Parent-Child chunking. It’s an interesting method whose primary goal is to solve the following problem: the best length of a chunk to be vectorized, so it captures the maximum accurate meaning, does not translate 1:1 to what we want to inject into the final LLM prompt to enrich context in our RAG pipeline.

In other words: you may use a shorter chunk of text because you know it works best during the search phase, but at the same time this chunk might be too short to provide sufficient prompt enrichment.

Hierarchical Chunking mitigates that challenge by applying the following logic:

  • Child chunk – created with maximum emphasis on the quality of the embedding being generated.
  • Parent chunk – captures more context, including the child chunk, and is used to be injected into the final LLM prompt.

How does it work? After your search engine (e.g. Azure AI Search) returns the TOP N candidates, instead of injecting the chunks on which these vectors were initially created, you inject the parent chunks, which are usually paragraphs, entire sections, subsections, or, in the case of a very small file, the entire file itself.

There are two basic ways to implement Hierarchical Chunking (a.k.a. Parent-Child Chunking). During indexing, you may:

  • Store the Parent Chunk directly within the search document, which is the fastest option because no additional lookup operations are needed, but the size of your index will be significantly larger (and you may store the same Parent Chunk many times depending on the implementation). In such a scenario it is better to set searchable=false for the Parent Chunk field so that it is not used for the FullText BM25 search (when a hybrid search is performed) but just for the retrieval > stored=true and retrievable=true . Next question, should we include that Parent Chunk field in a semantic configuration to be analyzed by the cross-encoder? No, we shouldn’t because it could have polluted the results… but we don’t even need to think too much about it because once we set that field as searchable=false then it is technically impossible to use such a field during the L2 reranking phase.
  • Do not store the Parent Chunk directly, but instead store a ParentChunkId reference to then retrieve these larger chunks as a second stage of the search operation. It will be a little slower, but your index size remains almost the same (+1 field of Edm.Int32 type in Azure AI Search, for instance).

Ok, I hope the concept is understandable now so let’s focus on the C# code now.

public async Task<IReadOnlyList<(string Parent, string Child)>> CreateParentChildChunksAsync(
    string text,
    double similarityThreshold = 0.75,
    int maxTokensPerParentChunk = 1024,
    int maxTokensPerChildChunk = 256)
{
    IReadOnlyList<string> sentences = SplitIntoSentences(text);

    if (sentences.Count == 0)
        return [];

    float[][] embeddings = (await EmbeddingClient.GenerateEmbeddingsAsync(sentences))
        .Value
        .Select(e => e.ToFloats().ToArray())
        .ToArray();

    // First pass: group sentence indices into parent chunks
    var parentGroups = new List<List<int>>();
    var currentParentIndices = new List<int> { 0 };
    int currentParentTokenCount = Tokenizer.CountTokens(sentences[0]);

    for (int i = 1; i < sentences.Count; i++)
    {
        int sentenceTokens = Tokenizer.CountTokens(sentences[i]);
        float similarity = CosineSimilarity(embeddings[i - 1], embeddings[i]);

        if (similarity < (float)similarityThreshold || currentParentTokenCount + sentenceTokens > maxTokensPerParentChunk)
        {
            parentGroups.Add(currentParentIndices);
            currentParentIndices = [];
            currentParentTokenCount = 0;
        }

        currentParentIndices.Add(i);
        currentParentTokenCount += sentenceTokens;
    }

    if (currentParentIndices.Count > 0)
        parentGroups.Add(currentParentIndices);

    // Second pass: within each parent, sub-divide into child chunks
    var result = new List<(string Parent, string Child)>();

    foreach (var parentIndices in parentGroups)
    {
        string parentText = string.Join(" ", parentIndices.Select(i => sentences[i]));

        var currentChildIndices = new List<int> { parentIndices[0] };
        int currentChildTokenCount = Tokenizer.CountTokens(sentences[parentIndices[0]]);

        for (int j = 1; j < parentIndices.Count; j++)
        {
            int idx = parentIndices[j];
            int prevIdx = parentIndices[j - 1];
            int sentenceTokens = Tokenizer.CountTokens(sentences[idx]);
            float similarity = CosineSimilarity(embeddings[prevIdx], embeddings[idx]);

            if (similarity < (float)similarityThreshold || currentChildTokenCount + sentenceTokens > maxTokensPerChildChunk)
            {
                result.Add((parentText, string.Join(" ", currentChildIndices.Select(i => sentences[i]))));
                currentChildIndices = [];
                currentChildTokenCount = 0;
            }

            currentChildIndices.Add(idx);
            currentChildTokenCount += sentenceTokens;
        }

        if (currentChildIndices.Count > 0)
            result.Add((parentText, string.Join(" ", currentChildIndices.Select(i => sentences[i]))));
    }

    return result;
}

Again… this is a naive implementation to show you how it could look, but it is not production ready. This Parent-Child chunking algorithm is based on the idea of Semantic Chunking we already know. It’s fairly simple, we create parent chunks using maxTokensPerParentChunk = 1024, and then within such a parent chunk we create child chunks using maxTokensPerChildChunk = 256. Please note that this logic assumes that there is a fixed parent chunk and N child chunks within it. Thanks to that, the number of parent chunks is lower than in a solution that would try to find the best parent for each child chunk independently. The latter is definitely more accurate, but the logic might be more complicated and you would need to store more parent chunks (remember the ParentChunkId?).

Please also note that I use the same similarityThreshold=0.75 for both parent and child chunks for the simplicity of this example, but in a real production implementation, I would recommend keeping two separate similarity thresholds, stricter for child chunks and more “relaxed” for the parent chunks.

Results of the Parent-Child chunking:

Hierarchical Chunking

In the screenshot, you can see the parent chunks in conjunction with their corresponding child chunks. In my example, the maximum number of child chunks is 4, but there are also parent chunks with only a single child chunk. Of course, if you decided to use an independent parent chunk for each child chunk, you would always have a parent-child pair for every child chunk.

Other chunking strategies

We have discussed 3 chunking strategies for RAG but you should know that there are many more. There could be even techniques which do not have any exact names or are a mix of various known methods. Let’s go through a few more quickly just to know they exist without delving into too many details.

  • Page/Paragraph/Sentence Boundary – structural markers like page breaks or line endings define the chunk limits. This preserves natural units but often results in inconsistent chunk sizes and ignored token constraints.
  • Recursive – text is split using a hierarchy of delimiters (e.g., \n\n, \n, . ) until a target size is reached. This is the industry standard for maintaining structural integrity, as it only breaks sentences as a last resort.
  • Late Chunking – the entire document is encoded before splitting the resulting vectors into smaller pieces. This ensures each chunk’s mathematical meaning is informed by the global context of the entire file.
  • Contextual Chunking – a document summary or global metadata is prepended to every chunk before vectorization. This grounds small snippets in the broader topic, significantly improving retrieval accuracy for specific queries.

I am going to write a next blog post discussing the Contextual Chunking in more detail so stay tuned!

The final comparison

Let’s compare the 3 chunking strategies for RAG that we have discussed today.

FeatureFixed-Size ChunkingSemantic ChunkingParent-Child Chunking
Logic BasisCharacter or Token countCosine Similarity (or any other) of vectorsMulti-tier relationship
Indexing CostBaselineMedium (extra embeddings for splitting)Highest (Complex logic & extra embeddings)
Search PrecisionLow (Contextual noise)HighHighest (Narrow vector signal)
LLM ContextFragmentedCohesiveRich (Large window)
Storage ImpactLowLow/MediumLow/Medium or High (depending on where you store the parent chunks)

Which one to choose then? This is my opinion:

  • I would choose Fixed-Size Chunking only when the source data is already highly structured and uniform, or when I need to get a PoC running in a single afternoon.
  • I would choose Semantic Chunking when the “readability” of the chunk is the priority. If I am building a system where users will see the retrieved text directly, I don’t want them to see half-broken sentences or disconnected thoughts. It is the best “middle ground” option.
  • I would choose Parent-Child Chunking for my most critical production workloads. While the indexing logic is more complex, the ability to search against a 200-token “Child” and then provide a 1,000-token “Parent” to the LLM effectively solves the “Semantic Dilution” problem. It ensures that the LLM has enough information to be helpful without the vector search getting “confused” by too much text.

Recursive chunking is also worth exploring as a safe default option but again… it depends on the structure of your documents/source data.

Text Split Skill in Azure AI Search

It is also worth remembering that Azure AI Search offers automatic data chunking using a built in Microsoft.Skills.Text.SplitSkill. You can use it when you decide to use Integrated Vectorization in Azure AI Search. This skill is not as powerful as Semantic Chunking or Parent-Child Chunking but it might be sufficient for some scenarios as well. If you wanted to stick to the Integrated Vectorization pattern and use another text splitting skill you can use Microsoft.Skills.Custom.WebApiSkill and call custom logic in your API that would split the input data according to your needs. I think that the more customization you want to apply, the higher the chance that you decide to create a custom data ingestion pipeline. This built-in skill, though, might be also useful when you build a PoC solution (usually by leveraging the ‘Import’ wizard which is available in Azure AI Search and which creates the appropriate skills automatically behind the scenes).

Summary

I hope that after reading this blog post you know way more about chunking strategies for RAG, but also why RAG and techniques like fixed-size chunking, semantic chunking, and Parent‑Child chunking are still relevant in 2026.

The purpose of this post was to show you the mechanism of these various techniques and not the perfect implementation. I selected these 3 chunking strategies for RAG deliberately, because the problem/challenge each of them solves was worth discussing in my opinion.

I encourage you to review the C# code yourself, but for your production workloads I also recommend using battle-tested libraries unless… you want something extra adjusted to your very specific needs.

If you like the post, then I am very glad.

Thanks a lot for reading, and see you in the next one!

P.S. the next post will be about the Contextual Chunking and metadata enrichment.

Categorized in:

RAG,