From RAG to Agentic Retrieval
Why Retrieval Still Matters
Large language models are great at reasoning, but terrible at remembering everything.
Retrieval-Augmented Generation (RAG) solves this by pulling in external knowledge — but the basic pipeline is just the beginning.
In this post, I’ll walk through:
- Vanilla RAG (dense embeddings + cosine similarity)
 - Advanced RAG (BM25, rerankers, HyDE)
 - Agentic Retrieval (reflection + tool use)
 - And how to measure all of this — with both traditional IR metrics and LLM-as-a-judge evaluations.
 
Vanilla RAG
At its core, RAG relies on dense embeddings:
- A bi-encoder model maps the query into a vector.
 - The same encoder maps each document into a vector.
 - You compare them — usually with cosine similarity (how close the two vectors are in high-dimensional space).
 - Retrieve the top-k most similar documents (through FAISS or other vector stores).
 - Feed them into the LLM to answer.
 

Dense embeddings are the workhorse of simple RAG today: cheap, effective, and supported out-of-the-box by OpenAI, Hugging Face, and others.
But — if the retriever misses, the LLM has no chance.
And even with huge context windows (Anthropic’s 1M tokens, OpenAI’s 200k), you don’t just dump the whole database into the prompt.
- See Lost in the Middle: How Language Models Use Long Contexts - Liu et al. (2023): LLMs often fail when key info is buried inside long contexts.
 - So retrieval still matters — it decides what the model sees.
 
Hybrid Retrieval & Reranking
Dense embeddings are powerful, but not the whole story. Advanced RAG combines multiple approaches:
BM25
- Built as an improvement over TF–IDF (term frequency–inverse document frequency).
 - Normalizes document length and adjusts saturation of term frequency.
 - For decades, BM25 was the gold standard of search engines — and still remains a strong baseline today.
 
Dense Embeddings
- As in vanilla RAG, encode query and doc separately, compare with cosine similarity.
 - Capture semantic similarity better than BM25 alone.
 
Rerankers (Cross-Encoders)
- Unlike bi-encoders, a cross-encoder takes query + doc together and makes a joint relevance judgment.
 - Much slower, but extremely accurate — perfect for reranking the top-50 candidates.
 
HyDE (Hypothetical Document Embeddings)
- Use an LLM to imagine a possible answer, embed it, and search with that instead of the raw query.
 - Helps when the query is underspecified.
 
Pipeline:
- HyDE: Imagine a possible answer, embed it, and search with that instead of the raw query.
 - Retrieve with BM25 + dense embeddings.
 - Rerank with a cross-encoder.
 - Feed the best passages into the LLM.
 

Measuring Retrievers
When evaluating retrieval methods, two standard metrics are used in information retrieval:
Recall@k
- Definition: What fraction of questions had at least one gold (relevant) document appear in the top-k retrieved results?
 - Example: Recall@5 = 0.80 means in 80% of queries, at least one gold doc appeared among the top 5.
 - Simple, intuitive, and heavily used in RAG practice (since if the gold isn’t in the context, the LLM can’t answer).
 
nDCG@k
- Short for Normalized Discounted Cumulative Gain
 - Rewards systems not just for finding the right documents, but for ranking them higher.
 - Formula:
 
where is the relevance of the document at rank i (binary or graded).
where is the maximum possible DCG (ideal ranking).
- Range: 0 → 1.
 - Example: if a system retrieves the gold doc at rank 1, ; if at rank 10, the score is much lower.
 - Popular in academic IR benchmarks, but less critical in industry RAG where all top-k docs go into the prompt.
 
What Do These Metrics Look Like in Practice?
Here are published results (from BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models - Thakur et al. (2021)) across several datasets:
| Dataset | BM25 | DPR (Dense) | Hybrid + Cross-Encoder | 
|---|---|---|---|
| HotpotQA | 0.60 | 0.391 | 0.707 | 
| NQ | 0.329 | 0.474 | 0.533 | 
| FiQA-2018 | 0.236 | 0.112 | 0.347 | 
| TREC-COVID | 0.656 | 0.332 | 0.757 | 
Table adapted from Thakur et al., "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models" (2021).
Notes:
- BM25 = classic keyword search baseline.
 - DPR = Dense Passage Retrieval (bi-encoder).
 - Hybrid + CE = BM25 + dense retriever reranked with a cross-encoder.
 - nDCG ranges 0 → 1. Higher = more relevant docs ranked higher.
 - Datasets: Natural Questions (Kwiatkowski et al., CC BY-SA 3.0), FiQA-2018 (Maia et al., 2018), TREC-COVID (Voorhees et al., 2020; CORD-19, CC BY 4.0), HotpotQA (Yang et al., CC BY-SA 4.0)
 
Agentic Retrieval: Beyond One-Shot Search
Even with rerankers, RAG is brittle: if it misses once, you lose.
Agentic retrieval makes the system search actively.
- Reflection: “Did I find enough info?”
 - Query reformulation: generate better or multiple sub-queries.
 - Tool use: swap retrievers, rerank, or retry.
 - Multi-hop reasoning: break a question into smaller steps.
 
This is where we move from a static “retrieval pipeline” to a dynamic retrieval agent.
In high level you can look at in the following way:
iterations = 0
while iterations < max_iterations:
    results = search(current_query)
    reflection = assess_quality(results)
    if reflection.needs_more_search:
        current_query = reformulate(original_query, reflection.feedback)
    else:
        break
    iterations += 1
answer = ask(original_query + all_results)
Benchmark Snapshot
I ran 100 questions from the HotpotQA dataset (Yang et al., CC BY-SA 4.0) through all three pipelines.
Evaluation: LLM-as-a-judge (factual correctness) — not strict EM/F1, but a semantic check that the generated answer matched the ground truth.
| Method | Judge-Correct (100 Qs) | Notes | 
|---|---|---|
| Vanilla RAG | 72% | Dense embeddings only | 
| Advanced RAG | 74% | BM25 + embeddings + cross-encoder | 
| Agentic RAG | 80% | Reflection & tool use (multi-step) | 
Takeaways
- Dense embeddings (cosine similarity) are the backbone of simple RAG.
 - BM25 still matters: decades old, still competitive.
 - Cross-encoders are perfect rerankers: slower, but more precise than bi-encoders.
 - Chunking changes everything: sentence vs 1000-char blocks can make or break recall.
 - Agentic retrieval is the big leap: reflection + tool use > any single retriever.
 
Next Up
We're done with fundamentals, now let's talk about MCP (Model Context Protocol): how to wire up retrieval, agents, and infra in a real product.
👉 Full code for the agent is available in my GitHub Repo