Wednesday, January 21, 2026

What is Late Chunking

What it does: Processes the entire document through the transformer before chunking the token embeddings (not the text).


The problem it solves: Traditional chunking loses long-distance context. Late chunking preserves full document context in each chunk’s embedding.


Conceptual example:


def late_chunk(text: str, chunk_size=512) -> list:

    """Embed full document BEFORE chunking"""

  

    # Step 1: Embed entire document (8192 tokens max)

    full_doc_token_embeddings = transformer_embed(text)  # Token-level

  

    # Step 2: Define chunk boundaries

    tokens = tokenize(text)

    chunk_boundaries = range(0, len(tokens), chunk_size)

  

    # Step 3: Pool token embeddings for each chunk

    chunks_with_embeddings = []

    for start in chunk_boundaries:

        end = start + chunk_size

        chunk_text = detokenize(tokens[start:end])

  

        # Mean pool token embeddings (preserves full doc context!)

        chunk_embedding = mean_pool(full_doc_token_embeddings[start:end])

        chunks_with_embeddings.append((chunk_text, chunk_embedding))

  

    return chunks_with_embeddings




Late chunking in the context of embeddings for GenAI (Generative AI) is a strategy used when processing large documents or datasets for vector embeddings, particularly in RAG (Retrieval-Augmented Generation) workflows.


Here’s a clear breakdown:


Definition


Late chunking means delaying the splitting of content into smaller pieces (chunks) until after embedding generation has started or the content has been initially processed.


Instead of splitting a large document into chunks before generating embeddings (which is early chunking), the model or system first generates embeddings for larger units (like full documents or sections) and then splits or processes them further later in the pipeline if needed.


Why use Late Chunking?


Preserves context


Early chunking may break semantic context by splitting sentences or paragraphs arbitrarily.


Late chunking allows embeddings to capture larger context, improving similarity searches.


Efficient processing


You can generate embeddings for larger units first and only create smaller chunks if retrieval or indexing requires it, reducing unnecessary computations.


Dynamic retrieval granularity


Allows flexible adjustment of chunk size later depending on how the embeddings will be queried or used in the application.


Comparison to Early Chunking

Feature Early Chunking Late Chunking

When text is split Before embedding After embedding or during retrieval

Context retention Lower (may lose semantic links across chunks) Higher (larger context retained)

Processing efficiency May generate more embeddings unnecessarily Can reduce embedding count

Use case Simple search or small documents Large documents, long context GenAI applications


💡 Example Scenario:


A book with 1000 pages is to be used in a RAG application.


Early chunking: Split into 2-page chunks first → 500 embeddings.


Late chunking: Generate embeddings for each chapter first → 20 embeddings, then split chapters into smaller chunks later only if needed.


This approach balances context preservation and computational efficiency.

No comments:

Post a Comment