Text Chunking Strategies for RAG: A Comprehensive GuideText Chunking Strategies for RAG: A Comprehensive GuideText Chunking Strategies for RAG: A Comprehensive Guide

2026::03::30
11 min
AUTHOR:Z.SHINCHVEN

Retrieval-Augmented Generation (RAG) has become the dominant pattern for grounding Large Language Models in external knowledge. At the heart of every RAG pipeline lies a deceptively simple question: how do you split your documents into chunks?

The choice of chunking strategy directly impacts retrieval accuracy, answer quality, and computational cost. A poor chunking decision can slice a critical paragraph in half, bury context across fragments, or balloon your embedding bill — all while silently degrading the answers your users receive.

This guide walks through six chunking strategies, ordered from the simplest to the most advanced, along with their trade-offs, practical considerations, and recent benchmark results.

Quick Reference

Strategy Contextual Awareness Computational Cost Best For
Fixed-Size Chunking Low Very Low Uniform data, budget-constrained pipelines
Recursive Chunking Medium Low General-purpose RAG (recommended starting point)
Document-Based Chunking High Low Structured formats (HTML, Markdown, code)
Semantic Chunking High Medium–High Topic-diverse documents requiring precise boundaries
LLM-Based Chunking Very High Very High High-stakes domains (legal, medical, financial)
Late Chunking Very High Medium Long documents where cross-chunk context matters

Fixed-Size Chunking

Fixed-size chunking is the most straightforward approach. It divides text into chunks of a predetermined number of tokens or characters, regardless of where sentences, paragraphs, or ideas begin and end.

How It Works

  1. Define a fixed chunk size (e.g., 256 or 512 tokens).
  2. Walk through the text and split at every n-th token.
  3. Optionally, apply an overlap window so that adjacent chunks share a portion of content (e.g., 50–100 tokens).
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separator=""
)
chunks = splitter.split_text(document)

Trade-Offs

  • Pros: Dead simple to implement, fast to execute, and predictable chunk sizes make downstream token budgeting easy.
  • Cons: Completely ignores content structure. Sentences get cut mid-thought, and semantically related ideas may land in different chunks. The overlap window helps but does not fully solve this problem.

When to Use

Fixed-size chunking is a reasonable choice when your data is relatively uniform (e.g., log files, tabular text, or homogeneous records) or when you are building a quick prototype and want a baseline to benchmark against.

Recursive Chunking

Recursive chunking takes a more structural approach by attempting to split text along natural boundaries before falling back to smaller separators.

How It Works

  1. Start by splitting the text using a primary separator — typically double newlines (\n\n), which correspond to paragraph breaks.
  2. If any resulting chunk exceeds the desired size limit, recursively apply secondary separators (single newlines \n, then sentences ., then spaces, and finally characters) until all chunks fit within the limit.
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " ", ""]
)
chunks = splitter.split_text(document)

Trade-Offs

  • Pros: Respects document structure at multiple granularity levels. Produces cleaner, more coherent chunks than fixed-size splitting. Extremely fast with no model dependencies.
  • Cons: Still a heuristic — the separator hierarchy may not align perfectly with every document format. Very long paragraphs with no internal newlines may still get split awkwardly.

When to Use

Recursive chunking is the recommended default for most RAG applications. A February 2026 benchmark by Vecta across 50 academic papers placed recursive 512-token splitting first at 69% end-to-end accuracy, outperforming more sophisticated methods. Start here, and only move to more advanced strategies if your retrieval metrics demand it.

Document-Based Chunking

Document-based chunking leverages the explicit structural markers that already exist within a document — headings, subheadings, code fences, HTML tags, or Markdown sections — to define chunk boundaries.

How It Works

  1. Parse the document to identify structural elements (e.g., ## headings in Markdown, <h2> tags in HTML, function definitions in code).
  2. Split the document at these natural boundaries.
  3. Each resulting chunk corresponds to a logically complete section.
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
chunks = splitter.split_text(markdown_document)

Trade-Offs

  • Pros: Produces the most contextually coherent chunks for structured documents. Each chunk naturally represents a complete topic or section. Metadata (like heading hierarchy) can be preserved alongside the chunk.
  • Cons: Entirely dependent on the document having well-defined structural markers. Falls apart with plain text, poorly formatted documents, or OCR output. Section sizes can be wildly uneven — a short introduction and a 3,000-word deep-dive section produce very different chunk sizes.

When to Use

This is the optimal strategy when your corpus consists of well-structured formats: technical documentation in Markdown, web pages in HTML, codebases, or any content with consistent heading hierarchies. Pair it with a recursive fallback for oversized sections.

Semantic Chunking

Semantic chunking moves beyond structural heuristics and into meaning-aware splitting. Instead of relying on formatting cues, it uses embeddings to determine where conceptual boundaries lie.

How It Works

  1. Split the text into atomic units (typically sentences).
  2. Generate an embedding vector for each unit.
  3. Compute the cosine distance between the embeddings of consecutive units.
  4. When the distance exceeds a threshold — indicating a significant shift in topic or meaning — insert a chunk boundary.
  5. Group consecutive units that remain below the threshold into a single chunk.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95
)
chunks = splitter.create_documents([document])

Trade-Offs

  • Pros: Each chunk represents a cohesive idea regardless of formatting. Excels at documents that discuss multiple interleaved topics without clear structural markers.
  • Cons: Requires an embedding model call for every sentence, which adds latency and cost. A NAACL 2025 Findings paper found that fixed 200-word chunks matched or beat semantic chunking across retrieval and answer generation tasks, suggesting the computational overhead is not always justified. Fragment size can be unpredictable — the Vecta 2026 benchmark found that semantic chunking produced fragments averaging only 43 tokens, which retrieved cleanly but gave the LLM too little context to generate correct answers.

When to Use

Semantic chunking is most valuable for documents that cover multiple disparate topics in unstructured prose — think research papers, meeting transcripts, or customer support threads where topic shifts are frequent and formatting is inconsistent. Monitor your chunk size distribution closely; set minimum size thresholds to avoid the tiny-fragment problem.

LLM-Based Chunking

LLM-based chunking is the most intelligence-heavy approach. It uses a Large Language Model to read, understand, and segment the text into self-contained propositions or semantic units.

How It Works

  1. Feed the text (or a section of it) to an LLM with a prompt instructing it to identify semantically complete, standalone propositions.
  2. The LLM outputs a list of propositions, each designed to be independently meaningful.
  3. Optionally, a second pass groups related propositions into larger chunks.
from langchain.chains import create_extraction_chain

schema = {
    "properties": {
        "proposition": {
            "type": "string",
            "description": "A standalone, semantically complete statement"
        }
    },
    "required": ["proposition"]
}

chain = create_extraction_chain(schema, llm)
propositions = chain.run(document)

Trade-Offs

  • Pros: Produces the highest-quality chunks in terms of semantic completeness and independence. Each proposition can stand alone, making retrieval highly precise. The LLM can resolve coreferences (e.g., replacing "it" with the actual entity name), which dramatically improves downstream retrieval.
  • Cons: By far the most expensive strategy. Every document must be processed through an LLM, which means API costs, latency, rate limits, and context window constraints all come into play. Not practical for real-time ingestion of large corpora.

When to Use

Reserve LLM-based chunking for high-stakes, high-value domains where retrieval precision directly impacts outcomes — legal document analysis, medical knowledge bases, financial compliance corpora. The cost is justified when a wrong or incomplete answer carries real consequences.

Late Chunking

Late chunking, introduced by Jina AI in 2024, inverts the traditional "chunk then embed" pipeline. Instead of splitting text before embedding, it embeds the entire document first and then extracts chunk-level representations from the full-document embeddings.

How It Works

  1. Feed the entire document into a long-context embedding model (e.g., jina-embeddings-v3 supporting up to 8,192 tokens, or newer models supporting 32K+ tokens).
  2. The model produces token-level embeddings where each token's representation is informed by the full document context via the transformer's attention mechanism.
  3. Apply chunking boundaries (using any of the strategies above) to the token sequence.
  4. Pool the token-level embeddings within each chunk boundary (e.g., mean pooling) to produce a single vector per chunk.
import requests

# Using Jina Embeddings API with late chunking enabled
response = requests.post(
    "https://api.jina.ai/v1/embeddings",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "jina-embeddings-v3",
        "input": ["First chunk boundary text.", "Second chunk boundary text."],
        "late_chunking": True
    }
)

Trade-Offs

  • Pros: Each chunk embedding carries contextual information from the entire document, solving the "lost context" problem that plagues all other strategies. A pronoun like "they" in chunk 5 is embedded with knowledge of who "they" refers to from chunk 1. Requires no additional training — works with any long-context embedding model.
  • Cons: The input document must fit within the embedding model's context window. While 32K-token models cover most documents, extremely long texts (books, codebases) still require pre-splitting. Also, this approach is tied to specific embedding model architectures that expose token-level representations.

When to Use

Late chunking is particularly effective for long, narrative documents where context flows across sections — technical manuals, legal contracts, research papers, and documentation where pronoun resolution and cross-referential understanding matter. It pairs well with any boundary-detection strategy (recursive, document-based, or semantic) for determining where to place the chunk splits.

Choosing the Right Strategy

There is no universally best chunking strategy. The right choice depends on your data, your accuracy requirements, and your budget. Here is a practical decision framework:

Start with Recursive Chunking at 400–512 tokens with 10–20% overlap. This is your baseline. Measure end-to-end answer quality, not just retrieval recall.

If your documents are well-structured (Markdown, HTML, code), switch to Document-Based Chunking for cleaner, more meaningful splits.

If retrieval is pulling in irrelevant content despite good recall scores, experiment with Semantic Chunking — but monitor chunk size distribution and set minimum thresholds.

If you are working in a high-stakes domain where every answer must be precise and traceable, evaluate LLM-Based Chunking for your most critical documents.

If context loss across chunks is your primary problem (pronouns losing referents, split arguments, fragmented explanations), adopt Late Chunking to preserve document-wide context in your embeddings.

Key benchmarking insight: A 2025 Vectara study (published at NAACL) found that on realistic document sets, fixed-size chunking consistently outperformed semantic chunking when measured end-to-end, and a systematic analysis identified a "context cliff" around 2,500 tokens where response quality drops. Always benchmark with your actual data before committing to a complex strategy.

Conclusion

Text chunking is not a solved problem — it is an active area of research where practical trade-offs matter more than theoretical elegance. The landscape continues to evolve with techniques like contextual retrieval (prepending LLM-generated summaries to chunks), cross-granularity indexing (storing the same content at multiple chunk sizes), and agentic chunking (using AI agents to dynamically decide boundaries).

The most important principle remains: measure end-to-end. Retrieval recall alone is misleading. What matters is whether the final generated answer is correct, complete, and grounded — and that depends on chunking, retrieval, and generation working together.

Sources:

RackNerd Billboard Banner
Share Node:

RELATED_DATA_STREAMS

SCANNING_DATABASE_FOR_CORRELATIONS...