From RAG to Agentic Retrieval

September 22, 2025

Why Retrieval Still Matters

Large language models are great at reasoning, but terrible at remembering everything.
Retrieval-Augmented Generation (RAG) solves this by pulling in external knowledge — but the basic pipeline is just the beginning.

In this post, I’ll walk through:

Vanilla RAG (dense embeddings + cosine similarity)
Advanced RAG (BM25, rerankers, HyDE)
Agentic Retrieval (reflection + tool use)
And how to measure all of this — with both traditional IR metrics and LLM-as-a-judge evaluations.

Vanilla RAG

At its core, RAG relies on dense embeddings:

A bi-encoder model maps the query into a vector.
The same encoder maps each document into a vector.
You compare them — usually with cosine similarity (how close the two vectors are in high-dimensional space).
Retrieve the top-k most similar documents (through FAISS or other vector stores).
Feed them into the LLM to answer.

Vanilla RAG

Dense embeddings are the workhorse of simple RAG today: cheap, effective, and supported out-of-the-box by OpenAI, Hugging Face, and others.

But — if the retriever misses, the LLM has no chance.
And even with huge context windows (Anthropic’s 1M tokens, OpenAI’s 200k), you don’t just dump the whole database into the prompt.

See Lost in the Middle: How Language Models Use Long Contexts - Liu et al. (2023): LLMs often fail when key info is buried inside long contexts.
So retrieval still matters — it decides what the model sees.

Hybrid Retrieval & Reranking

Dense embeddings are powerful, but not the whole story. Advanced RAG combines multiple approaches:

BM25

Built as an improvement over TF–IDF (term frequency–inverse document frequency).
Normalizes document length and adjusts saturation of term frequency.
For decades, BM25 was the gold standard of search engines — and still remains a strong baseline today.

Dense Embeddings

As in vanilla RAG, encode query and doc separately, compare with cosine similarity.
Capture semantic similarity better than BM25 alone.

Rerankers (Cross-Encoders)

Unlike bi-encoders, a cross-encoder takes query + doc together and makes a joint relevance judgment.
Much slower, but extremely accurate — perfect for reranking the top-50 candidates.

HyDE (Hypothetical Document Embeddings)

Use an LLM to imagine a possible answer, embed it, and search with that instead of the raw query.
Helps when the query is underspecified.

Pipeline:

HyDE: Imagine a possible answer, embed it, and search with that instead of the raw query.
Retrieve with BM25 + dense embeddings.
Rerank with a cross-encoder.
Feed the best passages into the LLM.

Measuring Retrievers

When evaluating retrieval methods, two standard metrics are used in information retrieval:

Recall@k

Definition: What fraction of questions had at least one gold (relevant) document appear in the top-k retrieved results?
Example: Recall@5 = 0.80 means in 80% of queries, at least one gold doc appeared among the top 5.
Simple, intuitive, and heavily used in RAG practice (since if the gold isn’t in the context, the LLM can’t answer).

nDCG@k

Short for Normalized Discounted Cumulative Gain
Rewards systems not just for finding the right documents, but for ranking them higher.
Formula:

DCG@k = \sum_{i=1}^{k} \frac{rel_i}{\log_2(i+1)}

where $rel_i$ is the relevance of the document at rank i (binary or graded).

nDCG@k = \frac{DCG@k}{IDCG@k}

where $IDCG@k$ is the maximum possible DCG (ideal ranking).

Range: 0 → 1.
Example: if a system retrieves the gold doc at rank 1, $nDCG@10 ≈ 1.0$ ; if at rank 10, the score is much lower.
Popular in academic IR benchmarks, but less critical in industry RAG where all top-k docs go into the prompt.

What Do These Metrics Look Like in Practice?

Here are published $nDCG@10$ results (from BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models - Thakur et al. (2021)) across several datasets:

Dataset	BM25	DPR (Dense)	Hybrid + Cross-Encoder
HotpotQA	0.60	0.391	0.707
NQ	0.329	0.474	0.533
FiQA-2018	0.236	0.112	0.347
TREC-COVID	0.656	0.332	0.757

Table adapted from Thakur et al., "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models" (2021).

Notes:

BM25 = classic keyword search baseline.
DPR = Dense Passage Retrieval (bi-encoder).
Hybrid + CE = BM25 + dense retriever reranked with a cross-encoder.
nDCG ranges 0 → 1. Higher = more relevant docs ranked higher.
Datasets: Natural Questions (Kwiatkowski et al., CC BY-SA 3.0), FiQA-2018 (Maia et al., 2018), TREC-COVID (Voorhees et al., 2020; CORD-19, CC BY 4.0), HotpotQA (Yang et al., CC BY-SA 4.0)

Agentic Retrieval: Beyond One-Shot Search

Even with rerankers, RAG is brittle: if it misses once, you lose.
Agentic retrieval makes the system search actively.

Reflection: “Did I find enough info?”
Query reformulation: generate better or multiple sub-queries.
Tool use: swap retrievers, rerank, or retry.
Multi-hop reasoning: break a question into smaller steps.

This is where we move from a static “retrieval pipeline” to a dynamic retrieval agent.
In high level you can look at in the following way:

iterations = 0
while iterations < max_iterations:
    results = search(current_query)
    reflection = assess_quality(results)
    if reflection.needs_more_search:
        current_query = reformulate(original_query, reflection.feedback)
    else:
        break
    iterations += 1
answer = ask(original_query + all_results)

Benchmark Snapshot

I ran 100 questions from the HotpotQA dataset (Yang et al., CC BY-SA 4.0) through all three pipelines.
Evaluation: LLM-as-a-judge (factual correctness) — not strict EM/F1, but a semantic check that the generated answer matched the ground truth.

Method	Judge-Correct (100 Qs)	Notes
Vanilla RAG	72%	Dense embeddings only
Advanced RAG	74%	BM25 + embeddings + cross-encoder
Agentic RAG	80%	Reflection & tool use (multi-step)

Takeaways

Dense embeddings (cosine similarity) are the backbone of simple RAG.
BM25 still matters: decades old, still competitive.
Cross-encoders are perfect rerankers: slower, but more precise than bi-encoders.
Chunking changes everything: sentence vs 1000-char blocks can make or break recall.
Agentic retrieval is the big leap: reflection + tool use > any single retriever.

Next Up

We're done with fundamentals, now let's talk about MCP (Model Context Protocol): how to wire up retrieval, agents, and infra in a real product.

👉 Full code for the agent is available in my GitHub Repo