Engineering Notes

Engineering Notes

Thoughts and Ideas on AI by Muthukrishnan

Adding Context to Chunks for Better Retrieval with Contextual RAG

31 Dec 2025

Standard RAG has a fundamental problem: chunks lose their context. When you split a document into pieces, each chunk becomes isolated. A sentence like “The company increased revenue by 15%” is meaningless without knowing which company, which year, and what the previous revenue was.

Contextual RAG solves this by enriching each chunk with surrounding context before embedding. Anthropic’s research shows this technique can reduce retrieval failures by up to 67%.

The Problem with Standard Chunking

Consider this document:

# Q3 2025 Financial Report

## Executive Summary
Revenue grew 15% year-over-year to $4.2 billion.

## Regional Performance
### North America
The region exceeded targets with $2.1 billion in sales.

### Europe
Growth slowed to 8% due to currency headwinds.

Standard chunking might create:

Chunk 1: "Revenue grew 15% year-over-year to $4.2 billion."
Chunk 2: "The region exceeded targets with $2.1 billion in sales."
Chunk 3: "Growth slowed to 8% due to currency headwinds."

Now if a user asks “How did the European region perform?”, Chunk 3 might not be retrieved because it doesn’t mention “Europe”—that context is in the header above.

Contextual RAG Solution

Before embedding, we prepend context to each chunk:

Chunk 1: "This is from the Executive Summary section of the Q3 2025 Financial Report. Revenue grew 15% year-over-year to $4.2 billion."

Chunk 2: "This is from the North America subsection of Regional Performance in the Q3 2025 Financial Report. The region exceeded targets with $2.1 billion in sales."

Chunk 3: "This is from the Europe subsection of Regional Performance in the Q3 2025 Financial Report. Growth slowed to 8% due to currency headwinds."

Now “Europe” appears in Chunk 3, and retrieval succeeds.

Implementation

Step 1: Generate Context with an LLM

Use a language model to generate a brief context description for each chunk:

from openai import OpenAI

client = OpenAI()

def generate_context(document: str, chunk: str) -> str:
    """Generate contextual description for a chunk."""
    prompt = f"""
    <document>
    {document[:3000]}  # First part of document for context
    </document>

    Here is a chunk from the document:
    <chunk>
    {chunk}
    </chunk>

    Provide a brief (1-2 sentence) description of what this chunk is about
    and where it fits in the document. This will be prepended to the chunk
    for better search retrieval.
    """

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100
    )

    return response.choices[0].message.content.strip()

Step 2: Contextual Chunking Pipeline

from langchain.text_splitter import RecursiveCharacterTextSplitter

def create_contextual_chunks(document: str, chunk_size: int = 500) -> list[dict]:
    """Split document and add context to each chunk."""

    # Standard chunking
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=50
    )
    raw_chunks = splitter.split_text(document)

    contextual_chunks = []
    for i, chunk in enumerate(raw_chunks):
        # Generate context
        context = generate_context(document, chunk)

        # Create contextualized chunk
        contextual_chunk = f"{context}\n\n{chunk}"

        contextual_chunks.append({
            "id": f"chunk_{i}",
            "original": chunk,
            "context": context,
            "contextual_text": contextual_chunk
        })

    return contextual_chunks

Step 3: Embed and Store

Embed the contextualized text, not the original:

import chromadb
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")
chroma_client = chromadb.Client()
collection = chroma_client.create_collection("contextual_docs")

def index_contextual_chunks(chunks: list[dict]):
    """Index contextualized chunks in vector store."""

    for chunk in chunks:
        # Embed the contextual text
        embedding = embedder.encode(chunk["contextual_text"]).tolist()

        collection.add(
            ids=[chunk["id"]],
            embeddings=[embedding],
            documents=[chunk["original"]],  # Store original for display
            metadatas=[{"context": chunk["context"]}]
        )

Step 4: Retrieval

At query time, you can optionally add context to the query too:

def contextual_search(query: str, n_results: int = 5) -> list[dict]:
    """Search with optional query contextualization."""

    query_embedding = embedder.encode(query).tolist()

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results
    )

    return [
        {
            "text": doc,
            "context": meta["context"]
        }
        for doc, meta in zip(
            results["documents"][0],
            results["metadatas"][0]
        )
    ]

Optimization: Caching Context Generation

Generating context for every chunk is expensive. Cache the results:

import hashlib
import json
from pathlib import Path

CACHE_DIR = Path("./context_cache")
CACHE_DIR.mkdir(exist_ok=True)

def get_cached_context(document: str, chunk: str) -> str | None:
    """Check cache for existing context."""
    cache_key = hashlib.md5(f"{document[:500]}{chunk}".encode()).hexdigest()
    cache_file = CACHE_DIR / f"{cache_key}.json"

    if cache_file.exists():
        return json.loads(cache_file.read_text())["context"]
    return None

def save_context_to_cache(document: str, chunk: str, context: str):
    """Save generated context to cache."""
    cache_key = hashlib.md5(f"{document[:500]}{chunk}".encode()).hexdigest()
    cache_file = CACHE_DIR / f"{cache_key}.json"

    cache_file.write_text(json.dumps({"context": context}))

def generate_context_with_cache(document: str, chunk: str) -> str:
    """Generate context with caching."""
    cached = get_cached_context(document, chunk)
    if cached:
        return cached

    context = generate_context(document, chunk)
    save_context_to_cache(document, chunk, context)
    return context

Combining with BM25

For even better retrieval, combine contextual embeddings with BM25:

import numpy as np
from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self, chunks: list[dict]):
        self.chunks = chunks

        # BM25 on contextual text
        tokenized = [c["contextual_text"].split() for c in chunks]
        self.bm25 = BM25Okapi(tokenized)

        # Vector store
        self.embedder = SentenceTransformer("all-MiniLM-L6-v2")
        self.embeddings = self.embedder.encode(
            [c["contextual_text"] for c in chunks]
        )

    def search(self, query: str, n: int = 5, alpha: float = 0.5) -> list[dict]:
        """Hybrid search with weighted combination."""
        # BM25 scores
        bm25_scores = self.bm25.get_scores(query.split())
        bm25_scores = bm25_scores / bm25_scores.max()  # Normalize

        # Vector scores
        query_emb = self.embedder.encode(query)
        vector_scores = np.dot(self.embeddings, query_emb)
        vector_scores = (vector_scores - vector_scores.min()) / (
            vector_scores.max() - vector_scores.min()
        )

        # Combine scores
        combined = alpha * vector_scores + (1 - alpha) * bm25_scores

        # Get top n
        top_indices = np.argsort(combined)[-n:][::-1]

        return [self.chunks[i] for i in top_indices]

When to Use Contextual RAG

Use it when:

Skip it when:

Cost Analysis

Contextual RAG adds LLM costs at indexing time:

DocumentsChunksContext CallsCost (GPT-4o-mini)
1001,0001,000~$0.15
1,00010,00010,000~$1.50
10,000100,000100,000~$15.00

This is a one-time indexing cost. Query costs remain the same.

Measuring Improvement

Track these metrics before and after:

def evaluate_retrieval(queries: list[str], ground_truth: list[list[str]]):
    """Evaluate retrieval quality."""
    hits = 0
    total = 0

    for query, expected_chunks in zip(queries, ground_truth):
        results = contextual_search(query, n_results=5)
        retrieved_ids = [r["id"] for r in results]

        for expected in expected_chunks:
            total += 1
            if expected in retrieved_ids:
                hits += 1

    recall = hits / total
    return recall

Anthropic’s benchmarks show:

Complete Example

# Full pipeline
document = open("quarterly_report.md").read()

# 1. Create contextual chunks
chunks = create_contextual_chunks(document)

# 2. Index
index_contextual_chunks(chunks)

# 3. Query
results = contextual_search("What was the European performance?")

for r in results:
    print(f"Context: {r['context']}")
    print(f"Content: {r['text'][:200]}...")
    print("---")

What’s Next

Contextual RAG is one technique in the Advanced RAG toolkit. Other improvements to explore:

The key insight: RAG quality is mostly about retrieval quality. Invest in making your chunks as searchable as possible, and everything downstream improves.


Try It Yourself

Copy this prompt into your AI coding agent to build this project:

Build a Contextual RAG pipeline that improves retrieval by adding context:
1. A context generator that uses an LLM to describe each chunk's position
2. A chunking pipeline that prepends context to chunks before embedding
3. A ChromaDB vector store that indexes contextualized text
4. A hybrid retriever combining BM25 and vector search with alpha weighting

Include caching for context generation using MD5 hashes. Test with a
hierarchical document (with headers/sections) and show how contextual
chunks improve retrieval for queries referencing section context.
● Intelligence at Every Action

AI Native
Project Management

Stop using tools that bolt on AI as an afterthought. Jovis is built AI-first — smart routing, proactive monitoring, and intelligent workflows from the ground up.