A Real Search Engine: Information Retrieval Fundamentals from Scratch
You train a model, generate embeddings, wire up a RAG pipeline. The demo works. In production, the LLM starts hallucinating about things that are clearly in your documents. The problem isn’t the LLM — it’s search. The right documents simply aren’t making it into the context.
Information Retrieval (IR) is the field that studies exactly this: given a set of documents and a query, how do you find and rank the most relevant ones? It’s the silent foundation of every search system — from Google to OpenSearch to your RAG pipeline.
This guide is problem-driven. Each section opens with something that breaks — a real failure mode in a search system — and shows why it breaks before introducing the fix. The goal is intuition, not recipes.
Here’s the full map of what we’ll cover:
| Concept | What it is |
|---|---|
| Inverted index | The data structure that makes fast search possible — maps terms to the documents that contain them |
| Text analysis pipeline | How raw text is cleaned, tokenized, and normalized before indexing |
| TF-IDF | A ranking formula that weighs term frequency against how rare the term is across the corpus |
| BM25 | The standard lexical ranking algorithm — handles document length and term saturation better than TF-IDF |
| Evaluation metrics | P@k, Recall@k, MRR, NDCG — how to measure whether your search is actually good |
| Semantic search | Embedding-based search that finds documents by meaning, not just matching words |
| ANN (Approximate Nearest Neighbor) | Algorithms like HNSW that make vector search fast enough for millions of documents |
| Hybrid search + RRF | Combining lexical and semantic rankings so each method covers the other’s blind spots |
| HyDE + query expansion | LLM-assisted techniques that bridge the gap between short queries and long documents |
| Chunking | Splitting documents into indexable pieces — and why the size matters more than you’d expect |
| Two-stage retrieval + reranking | Using a fast retriever to get candidates, then a precise cross-encoder to reorder them |
| Metadata filtering | Hard filters (tenant, date, status, ACL) that prevent semantically right but operationally wrong results |
| RAG-specific metrics | Context Recall, Faithfulness, Answer Correctness — what to measure when the consumer is an LLM |
By the end, you won’t just know how to wire up a search pipeline — you’ll know where to look when it returns garbage, and why each layer of a production RAG system exists.
Note: The code examples are didactic, optimized for clarity. Real production systems use far more sophisticated implementations. I cover that in the last section. You can run the examples cell by cell in the Jupyter Notebook.
Problem 1: You have 10,000 documents. How do you find the right one?
Imagine a simple corpus: meeting notes, technical documentation, blog posts. The user types a query. You need to return the most relevant documents.
The most direct approach: iterate over all documents and check whether they contain the query words.
docs = [
"Neural networks learn hierarchical representations of data",
"Gradient descent minimizes the loss function iteratively",
"Transformers use attention to capture long-range dependencies",
"Random forests combine multiple decision trees",
# ... 9,996 more documents
]
def linear_search(query: str, docs: list[str]) -> list[str]:
query_lower = query.lower()
return [doc for doc in docs if query_lower in doc.lower()]
results = linear_search("neural networks", docs)
This works for 10 documents. For 10 million — the size of any serious corpus — scanning everything on every query is infeasible. With 10M documents of 1KB each, that’s 10GB of data to scan per search.
But the problem goes beyond performance. This search doesn’t know how relevant a document is — only whether it contains the query or not. There’s no ranking. No notion of which result is better.
What we need: a structure that lets us find documents containing a term without scanning the entire corpus, and a way to rank results by relevance.
Concept 1: The Inverted Index
A search engine doesn’t search documents — it searches an index. The index is built once (or updated incrementally) and enables fast query responses.
The central structure is the inverted index: a mapping of term → list of documents that contain it.
term → documents
───────────────────────────────
"neural" → [doc_0, doc_7, doc_42, ...]
"gradient" → [doc_1, doc_8, doc_19, ...]
"transformer" → [doc_2, doc_7, doc_55, ...]
"tree" → [doc_3, doc_99, ...]
It’s the inversion of the natural index (document → terms). Hence the name.
from collections import defaultdict
import re
def tokenize(text: str) -> list[str]:
# Lowercase + alphanumeric words only
return re.findall(r'\b[a-záéíóúàâêôãõüç]+\b', text.lower())
def build_index(docs: list[str]) -> dict[str, set[int]]:
index = defaultdict(set)
for doc_id, doc in enumerate(docs):
for term in tokenize(doc):
index[term].add(doc_id)
return dict(index)
def search(query: str, index: dict, docs: list[str]) -> list[str]:
terms = tokenize(query)
# Intersection: documents that contain ALL terms
if not terms:
return []
result = index.get(terms[0], set())
for term in terms[1:]:
result = result & index.get(term, set())
return [docs[i] for i in result]
docs = [
"Neural networks learn hierarchical representations",
"Gradient descent optimizes neural networks",
"Transformers are neural networks with attention",
"Random forests do not use gradients",
]
index = build_index(docs)
print(search("neural networks", index, docs))
# ['Neural networks learn...', 'Gradient descent optimizes...', 'Transformers are...']
Now search is O(1) per term (dictionary lookup) + O(k) to intersect posting lists, where k is the number of documents containing the term — much smaller than the entire corpus.
But a new problem emerged: are the three documents about neural networks equally relevant for the query “neural networks”? Intuitively, no — some are about the topic, others just mention it. We need ranking.
Problem 2: Not every occurrence counts the same
Consider two documents about “machine learning”:
- Doc A: “Machine learning is a subfield of artificial intelligence. Machine learning uses data to learn patterns. Machine learning algorithms include neural networks and decision trees.”
- Doc B: “Python is used in machine learning, but also in web development and automation.”
For the query “machine learning”, Doc A is clearly more relevant — the term appears 3 times and is the central theme. Doc B mentions it once, in passing.
Term Frequency (TF): how many times the term appears in the document.
But TF alone has a problem: the word “the” appears in almost every document. High frequency of “the” doesn’t indicate relevance.
Inverse Document Frequency (IDF): words that appear in few documents are more informative.
IDF(term) = log(N / df(term))
where:
N = total number of documents
df = number of documents that contain the term
“Machine learning” appears in 100 of 10,000 docs → IDF = log(100) ≈ 4.6
”the” appears in 9,999 of 10,000 docs → IDF ≈ 0.0001
Multiply the two: TF-IDF.
import math
from collections import Counter
def calculate_tf(doc_tokens: list[str]) -> dict[str, float]:
counts = Counter(doc_tokens)
total = len(doc_tokens)
return {term: freq / total for term, freq in counts.items()}
def calculate_idf(docs_tokens: list[list[str]]) -> dict[str, float]:
N = len(docs_tokens)
df = defaultdict(int)
for tokens in docs_tokens:
for term in set(tokens):
df[term] += 1
return {term: math.log(N / freq) for term, freq in df.items()}
def score_tfidf(query: str, doc: str, idf: dict) -> float:
query_tokens = tokenize(query)
doc_tokens = tokenize(doc)
tf = calculate_tf(doc_tokens)
return sum(tf.get(t, 0) * idf.get(t, 0) for t in query_tokens)
# Example
docs = [
"Machine learning is a subfield of AI. Machine learning uses data. Machine learning algorithms include neural networks.",
"Python is used in machine learning, but also in web and automation.",
"Deep learning is a subfield of machine learning with multiple layers.",
]
docs_tokens = [tokenize(d) for d in docs]
idf = calculate_idf(docs_tokens)
query = "machine learning"
scores = [(score_tfidf(query, doc, idf), doc[:50]) for doc in docs]
scores.sort(reverse=True)
for score, doc in scores:
print(f"{score:.4f} | {doc}")
# 0.0812 | Machine learning is a subfield of AI... ← more relevant
# 0.0271 | Deep learning is a subfield of machine...
# 0.0108 | Python is used in machine learning... ← less relevant
TF-IDF has been the standard search ranking for decades. But it has known limitations:
- Length normalization is incomplete: our TF formula divides by document length, but a long document that repeatedly mentions “ML” in passing can still outscore a short document where “ML” is the central theme. BM25 addresses this more robustly.
- TF grows linearly without saturation: 10 occurrences isn’t 10× more relevant than 1, but TF-IDF treats it that way.
Concept 2: BM25 — The standard ranking algorithm
BM25 (Best Matching 25) emerged in the 1990s and remains the standard lexical retrieval baseline — widely used in Lucene-based systems like OpenSearch and Elasticsearch.
BM25 addresses exactly the problems of TF-IDF with two parameters:
- k1 (typically 1.2–2.0): controls term frequency saturation. Higher values let repeated occurrences keep contributing; lower values make the score saturate faster. Unlike TF, the contribution never grows linearly without bound.
- b (typically 0.75): controls length normalization.
BM25(q, d) = Σ IDF(t) × [TF(t,d) × (k1 + 1)] / [TF(t,d) + k1 × (1 - b + b × |d|/avgdl)]
where:
|d| = document length
avgdl = average document length in the corpus
class BM25:
def __init__(self, docs: list[str], k1: float = 1.5, b: float = 0.75):
self.k1 = k1
self.b = b
self.docs_tokens = [tokenize(d) for d in docs]
self.N = len(docs)
self.avgdl = sum(len(t) for t in self.docs_tokens) / self.N
# IDF for each term
df = defaultdict(int)
for tokens in self.docs_tokens:
for term in set(tokens):
df[term] += 1
self.idf = {
t: math.log((self.N - freq + 0.5) / (freq + 0.5) + 1)
for t, freq in df.items()
}
def score(self, query: str, doc_id: int) -> float:
query_tokens = tokenize(query)
doc_tokens = self.docs_tokens[doc_id]
dl = len(doc_tokens)
tf_map = Counter(doc_tokens)
total = 0.0
for term in query_tokens:
if term not in self.idf:
continue
tf = tf_map.get(term, 0)
idf = self.idf[term]
numerator = tf * (self.k1 + 1)
denominator = tf + self.k1 * (1 - self.b + self.b * dl / self.avgdl)
total += idf * (numerator / denominator)
return total
def search(self, query: str, docs: list[str], top_k: int = 5) -> list[tuple]:
# Scores all documents for clarity. A real BM25 engine would use the
# inverted index to score only documents containing query terms.
scores = [(self.score(query, i), docs[i]) for i in range(self.N)]
return sorted(scores, reverse=True)[:top_k]
The practical difference between TF-IDF and BM25 becomes clear with documents of varying lengths:
docs = [
# Short and focused
"BM25 is the standard ranking algorithm in modern search engines.",
# Long with many mentions, but diluted
"This is a very long document about many things. It talks about NLP, IR, machine learning, Python, statistics, linear algebra, and neural networks. BM25 appears here too, but it is a minor detail among all of that. We keep talking about other topics for many more paragraphs.",
# Relevant but without mentioning exactly "BM25"
"Okapi BM25 extends the probabilistic model by Robertson and Spärck Jones.",
]
bm25 = BM25(docs)
results = bm25.search("BM25 ranking", docs)
for score, doc in results:
print(f"{score:.4f} | {doc[:60]}...")
# The short, focused document ranks above the long one with many mentions
Problem 3: How do we know if our search is good?
We built a system. But how do we know it works well? “Looks right” isn’t a metric. Before adding more complexity — embeddings, hybrid search, reranking — we need a way to measure whether each new layer actually improves retrieval.
IR has a well-established set of evaluation metrics, all based on a simple concept: judged relevance.
To evaluate, we need a test set: a set of queries where humans have judged which documents are relevant. With that, we compare what our system returned with what it should have returned.
Precision and Recall
Precision@k = (relevant documents in top-k) / k
Recall@k = (relevant documents in top-k) / (total relevant documents)
def precision_at_k(retrieved: list, relevant: set, k: int) -> float:
top_k = retrieved[:k]
return len(set(top_k) & relevant) / k
def recall_at_k(retrieved: list, relevant: set, k: int) -> float:
top_k = retrieved[:k]
return len(set(top_k) & relevant) / len(relevant)
# Example
retrieved = ["doc_2", "doc_5", "doc_1", "doc_8", "doc_3"]
relevant = {"doc_1", "doc_2", "doc_4", "doc_7"}
print(f"P@3 = {precision_at_k(retrieved, relevant, 3):.2f}") # 2/3 = 0.67
print(f"R@3 = {recall_at_k(retrieved, relevant, 3):.2f}") # 2/4 = 0.50
print(f"P@5 = {precision_at_k(retrieved, relevant, 5):.2f}") # 2/5 = 0.40
print(f"R@5 = {recall_at_k(retrieved, relevant, 5):.2f}") # 2/4 = 0.50
There’s a natural trade-off: increasing k increases recall (more relevant documents captured), but usually decreases precision (more irrelevant ones included).
Mean Reciprocal Rank (MRR)
For cases where there’s only one correct document (e.g., factual answer search), we use MRR: what position is the first relevant result in?
def reciprocal_rank(retrieved: list, relevant: set) -> float:
for i, doc in enumerate(retrieved, 1):
if doc in relevant:
return 1 / i
return 0.0
def mean_reciprocal_rank(results: list[tuple[list, set]]) -> float:
return sum(reciprocal_rank(r, rel) for r, rel in results) / len(results)
query_results = [
(["doc_3", "doc_1", "doc_2"], {"doc_1"}), # RR = 1/2
(["doc_5", "doc_4", "doc_2"], {"doc_4"}), # RR = 1/2
(["doc_1", "doc_2", "doc_3"], {"doc_1"}), # RR = 1/1
]
print(f"MRR = {mean_reciprocal_rank(query_results):.3f}") # (0.5 + 0.5 + 1.0) / 3 ≈ 0.667
NDCG (Normalized Discounted Cumulative Gain)
MRR and Precision treat relevance as binary (relevant/irrelevant). NDCG supports degrees of relevance (e.g., highly relevant = 3, relevant = 2, marginal = 1, irrelevant = 0).
The idea: a highly relevant document in position 1 is worth more than in position 5.
def dcg_at_k(scores: list[int], k: int) -> float:
"""scores[i] = relevance grade of the i-th result"""
return sum(s / math.log2(i + 2) for i, s in enumerate(scores[:k]))
def ndcg_at_k(retrieved_scores: list[int], ideal_scores: list[int], k: int) -> float:
dcg = dcg_at_k(retrieved_scores, k)
idcg = dcg_at_k(sorted(ideal_scores, reverse=True), k)
return dcg / idcg if idcg > 0 else 0.0
# retrieved: relevance scores of the returned documents in order
retrieved_scores = [3, 1, 2, 0, 3]
ideal_scores = [3, 3, 2, 1, 0] # best possible ordering
print(f"NDCG@5 = {ndcg_at_k(retrieved_scores, ideal_scores, 5):.3f}")
For RAG specifically: Recall@k is usually the first metric to optimize — the answer cannot be grounded if the evidence never reaches the LLM. But precision also matters: irrelevant or conflicting chunks can dilute the context and lead to wrong answers, even when the right chunk is present.
Problem 4: “Car” and “automobile” — lexical search doesn’t understand synonyms
BM25 is powerful, but has a fundamental limitation: it operates on the textual surface. It doesn’t know that “car” and “automobile” mean the same thing. It doesn’t know that “ML engineer” and “machine learning engineer” are equivalent.
docs = [
"How to rent an automobile in Brazil",
"Maintenance guide for your vehicle",
"The best electric cars of 2024",
]
bm25 = BM25(docs)
results = bm25.search("rent car", docs)
# "How to rent an automobile" ranks LOW — contains "rent" but not "car"
# "The best electric cars" ranks HIGH — contains "cars" but not "rent"
This problem is called vocabulary mismatch — the query vocabulary doesn’t match the document vocabulary.
The solution: instead of comparing words, compare meanings — represented as vectors in a semantic space.
Concept 3: Semantic Search with Embeddings
An embedding is a vector of real numbers representing the meaning of a text. Models like sentence-transformers are trained so that semantically similar texts produce nearby vectors in vector space.
Note on embedding models:
all-MiniLM-L6-v2is used here for simplicity. In production, prefer models trained specifically for retrieval — such as E5, BGE, or BGE-M3 — which use contrastive training on query-document pairs and often perform significantly better on IR benchmarks like BEIR and MTEB.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
docs = [
"How to rent an automobile in Brazil",
"Maintenance guide for your vehicle",
"The best electric cars of 2024",
"Transformers revolutionized natural language processing",
]
# Indexing: generate embeddings for all docs once
doc_embeddings = model.encode(docs) # shape: (4, 384)
def semantic_search(query: str, top_k: int = 3) -> list[tuple]:
query_embedding = model.encode([query])[0]
# Cosine similarity. For L2-normalized vectors this is equivalent to dot
# product — which is why production systems (FAISS, OpenSearch) normalize
# vectors at index time and use inner product for fast ANN search.
similarities = np.dot(doc_embeddings, query_embedding) / (
np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)
indices = np.argsort(similarities)[::-1][:top_k]
return [(similarities[i], docs[i]) for i in indices]
results = semantic_search("rent car")
# Now "How to rent an automobile" appears at the top
# even without containing the word "car"
Semantic search captures intent and synonyms. But it has its own weaknesses:
When lexical search is better:
- Queries with exact technical terms:
NullPointerException,CVE-2024-1234,config.yaml - Proper names and acronyms:
GPT-4,BERT,Fernando Wittmann - Product/code-specific searches:
SELECT * FROM users WHERE id = 42
When semantic search is better:
- Natural language queries: “how do I…?”
- Synonyms and paraphrases
- Multilingual queries (question in PT, doc in EN)
- Intent without specific keywords
Concept 4: Hybrid Search — the best of both worlds
In production, we rarely use just one method. The modern approach combines BM25 + semantic search via rank fusion.
The simplest and most effective method is Reciprocal Rank Fusion (RRF):
RRF_score(d) = Σ 1 / (k + rank_i(d))
where rank_i(d) is the position of document d in ranking i
and k is a constant (typically 60)
def reciprocal_rank_fusion(
rankings: list[list[int]],
k: int = 60
) -> list[tuple[int, float]]:
"""
rankings: list of doc_id lists, in relevance order
returns: list of (doc_id, score) ordered by score
"""
scores = defaultdict(float)
for ranking in rankings:
for position, doc_id in enumerate(ranking, 1):
scores[doc_id] += 1.0 / (k + position)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
# Example: combine BM25 and semantic search
def hybrid_search(query: str, docs: list[str], top_k: int = 5):
# BM25 ranking
bm25 = BM25(docs)
bm25_scores = [(bm25.score(query, i), i) for i in range(len(docs))]
bm25_ranking = [i for _, i in sorted(bm25_scores, reverse=True)]
# Semantic ranking (simplified — production uses ANN)
query_emb = model.encode([query])[0]
doc_embs = model.encode(docs)
sims = np.dot(doc_embs, query_emb) / (
np.linalg.norm(doc_embs, axis=1) * np.linalg.norm(query_emb)
)
semantic_ranking = list(np.argsort(sims)[::-1])
# RRF fusion
fused = reciprocal_rank_fusion([bm25_ranking, semantic_ranking])
return [(docs[doc_id], score) for doc_id, score in fused[:top_k]]
Empirical research consistently shows that hybrid search outperforms any single method on IR benchmarks. The intuition: the two methods fail in different cases, and fusion cancels out the errors. That said, hybrid search doesn’t rescue a bad embedding model, noisy OCR text, or poor chunking — it amplifies whatever quality each branch already has.
The Engine Underneath: Approximate Nearest Neighbor (ANN)
There’s a practical problem hidden in semantic search: to find the k vectors most similar to a query, we’d need to compute similarity with all N documents — O(N × d), where d is the embedding dimension (typically 768–1536).
For 10M documents with 768-dimensional embeddings, that’s ~7.68 billion operations per query. Infeasible.
The solution: Approximate Nearest Neighbor (ANN) — algorithms that find the k nearest neighbors approximately, with a small precision sacrifice in exchange for absurd speed.
The most widely used:
- HNSW (Hierarchical Navigable Small World): hierarchical navigation graph. Standard in modern systems. Behaves sublinearly in practice — often close to O(log N) — though actual latency and memory depend heavily on index parameters, vector dimension, and recall target.
- IVF (Inverted File Index): divides vector space into clusters, searches only in the nearest clusters.
- Product Quantization: compresses vectors to reduce memory.
Brute force: O(N × d) — infeasible for N > 100k
HNSW: ~O(log N) — fast in practice; billions of vectors typically
require distributed infrastructure or quantization
Libraries like faiss (Meta) and hnswlib implement these algorithms. OpenSearch and Elasticsearch have native support via k-NN plugins.
import faiss
import numpy as np
d = 128 # embedding dimension
N = 100_000
# Generate example embeddings
embeddings = np.random.rand(N, d).astype('float32')
faiss.normalize_L2(embeddings) # normalize for cosine similarity
# Build HNSW index
index = faiss.IndexHNSWFlat(d, 32) # 32 = number of connections per node
index.add(embeddings)
# Search: find the 10 most similar vectors to the query
query = np.random.rand(1, d).astype('float32')
faiss.normalize_L2(query)
distances, indices = index.search(query, k=10)
# Milliseconds, not seconds
Concept 5: How documents are processed before indexing
Before indexing, every search engine applies a text analysis pipeline that transforms raw documents into indexable tokens. Each step has a purpose:
raw text → tokenization → normalization → stopwords → stemming/lemmatization → index
import re
from typing import Optional
STOPWORDS_EN = {
'a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from',
'in', 'is', 'it', 'of', 'on', 'or', 'that', 'the', 'to', 'was',
'were', 'with'
}
def analyze_text(text: str, remove_stopwords: bool = True) -> list[str]:
# 1. Lowercase
text = text.lower()
# 2. Normalize accents (simplified)
accent_map = str.maketrans('áéíóúàâêôãõüç', 'aeiouaaeoaouc')
text = text.translate(accent_map)
# 3. Tokenize
tokens = re.findall(r'\b\w+\b', text)
# 4. Remove stopwords
if remove_stopwords:
tokens = [t for t in tokens if t not in STOPWORDS_EN]
# 5. Simple stemming (common English suffixes)
stems = []
for token in tokens:
if token.endswith('ing'):
token = token[:-3] # "learning" → "learn"
elif token.endswith('tion'):
token = token[:-4] + 't'
stems.append(token)
return stems
print(analyze_text("Neural networks are learning representations from data"))
# ['neural', 'networks', 'learn', 'representations', 'data']
Stemming vs Lemmatization:
- Stemming: mechanically strips suffixes. Fast, but imprecise (
"studies"→"studi"). - Lemmatization: uses a dictionary to find the base form. Accurate, but slower (
"studies"→"study").
In production, the choice depends on language and the performance/quality trade-off.
Tying It All Together: IR in RAG and LLM Agents
Now that you understand the fundamentals, let’s see how this connects to modern LLM systems.
RAG (Retrieval-Augmented Generation)
RAG is essentially IR + text generation:
user query
↓
[RETRIEVAL] → top-k relevant documents
↓
[AUGMENTATION] → query + documents build the context
↓
[GENERATION] → LLM generates an answer based on context
The bottleneck is almost always retrieval. If the right documents don’t reach the LLM, generation can’t be correct — and the LLM may hallucinate or simply answer “I don’t know.”
def simple_rag_pipeline(
query: str,
corpus: list[str],
llm_client,
top_k: int = 5
) -> str:
# 1. Retrieval: find relevant documents
bm25 = BM25(corpus)
scores = [(bm25.score(query, i), i) for i in range(len(corpus))]
top_indices = [i for _, i in sorted(scores, reverse=True)[:top_k]]
documents = [corpus[i] for i in top_indices]
# 2. Augmentation: build the context
context = "\n\n".join(f"[{i+1}] {doc}" for i, doc in enumerate(documents))
prompt = f"""Based on the documents below, answer: {query}
Documents:
{context}
Answer:"""
# 3. Generation
return llm_client.generate(prompt)
Query Rewriting and HyDE
LLMs can improve retrieval before searching:
Query Expansion: expand the query with synonyms and related terms.
def expand_query(query: str, llm_client) -> str:
prompt = f"""Given the search query: "{query}"
Generate an expanded version with synonyms and related terms.
Return only the expanded query, with no explanations.
Example: "car" → "car automobile vehicle transportation"
"""
return llm_client.generate(prompt)
HyDE (Hypothetical Document Embeddings): instead of embedding the query directly, ask the LLM to generate a hypothetical document that would answer the query, and embed that document. It works because user queries are short, informal, and fragmentary, while corpus documents are longer and written in a different register. Generating a hypothetical answer bridges that distributional gap — the resulting embedding lands closer to real documents in vector space.
def hyde(query: str, llm_client, embedding_model) -> np.ndarray:
# Generate a hypothetical document
prompt = f"Write a paragraph that directly answers: {query}"
hypothetical_doc = llm_client.generate(prompt)
# Embed the hypothetical document, not the query
return embedding_model.encode([hypothetical_doc])[0]
Agents with search tools
In agent systems, the search engine becomes a tool. The agent decides when and how to search:
tools = [
{
"name": "search_documents",
"description": "Searches relevant documents in the knowledge base",
"parameters": {
"query": "string — what you want to find",
"top_k": "int — how many documents to return (default: 5)",
"filters": "dict — optional metadata filters (date, author, type)"
}
},
{
"name": "search_code",
"description": "Searches code snippets in the repository",
"parameters": {
"query": "string — description of the code or function being searched for"
}
}
]
In this context, retrieval quality directly affects the agent’s ability to reason and act correctly. An agent that searches poorly, reasons poorly.
Chunking: splitting documents for indexing
Long documents need to be split into chunks before indexing. This deeply affects retrieval quality:
def chunk_by_sentence(
text: str,
max_tokens: int = 512,
overlap: int = 50
) -> list[str]:
"""
Split text into chunks while respecting sentences,
with overlap to avoid losing context at the edges.
"""
sentences = text.split('. ')
chunks = []
current_chunk = []
current_tokens = 0
for sentence in sentences:
sentence_tokens = len(sentence.split())
if current_tokens + sentence_tokens > max_tokens and current_chunk:
chunks.append('. '.join(current_chunk) + '.')
# Overlap: keep the last sentences
overlap_words = ' '.join(current_chunk[-2:]).split()
current_chunk = [' '.join(overlap_words[-overlap:])]
current_tokens = overlap
current_chunk.append(sentence)
current_tokens += sentence_tokens
if current_chunk:
chunks.append('. '.join(current_chunk))
return chunks
Chunk size vs quality:
- Chunks too small (< 100 tokens): lose context, make generation harder
- Chunks too large (> 1000 tokens): dilute the relevance signal
- Starting point: 256–512 tokens, with 10–20% overlap
The right size depends on the embedding model’s context window, document structure, and query type. Treat it as a starting point and validate with retrieval metrics — don’t cargo-cult the number.
Re-ranking: refining after retrieval
A common production pattern is two-stage retrieval:
- Candidate retrieval: fast search (BM25 or ANN) returns top-100 candidates
- Re-ranking: a more precise model (cross-encoder) reorders the top-100
query → [BM25/ANN] → top-100 candidates → [Cross-Encoder] → reranked top-10
Why two stages? Cross-encoders analyze query and document together — full attention over both. They’re much more accurate than bi-encoders (separate embeddings), but also much slower. Using them on 100 candidates, not 10 million, is feasible.
from sentence_transformers import CrossEncoder
# Cross-encoder for re-ranking
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def two_stage_retrieval(query: str, candidates: list[str], top_k: int = 10):
# Stage 1: candidates already come from BM25/ANN
# Stage 2: re-ranking with cross-encoder
pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)
result = sorted(zip(scores, candidates), reverse=True)
return [doc for _, doc in result[:top_k]]
Problem 5: Semantically right, operationally wrong
Imagine a RAG system over company documentation. A user asks about the vacation policy. Semantic search returns several highly relevant chunks — but they’re from the 2021 policy, since the 2024 version uses slightly different wording and lives in a different folder.
The retrieval was semantically correct. The answer is wrong.
This is a class of failure that ranking alone can’t fix. The solution is metadata filtering: attach structured attributes to every document at index time, and apply hard filters before or after ranking.
from dataclasses import dataclass
from datetime import date
@dataclass
class Document:
id: str
content: str
metadata: dict # source, date, tenant, doc_type, version, status, ...
corpus = [
Document("p1", "Employees get 15 days of vacation per year.", {
"doc_type": "policy", "status": "active", "version": "2024", "tenant": "acme"
}),
Document("p2", "Employees get 10 days of vacation per year.", {
"doc_type": "policy", "status": "deprecated", "version": "2021", "tenant": "acme"
}),
Document("p3", "Vacation policy for contractors.", {
"doc_type": "policy", "status": "active", "version": "2024", "tenant": "globex"
}),
]
def filtered_search(query: str, docs: list[Document], filters: dict, top_k: int = 5):
candidates = [
d for d in docs
if all(d.metadata.get(k) == v for k, v in filters.items())
]
bm25 = BM25([d.content for d in candidates])
scores = [(bm25.score(query, i), candidates[i]) for i in range(len(candidates))]
return [doc for _, doc in sorted(scores, reverse=True)[:top_k]]
results = filtered_search(
"vacation days",
corpus,
filters={"tenant": "acme", "status": "active"}
)
# Returns only the 2024 active policy for the correct tenant
Common metadata fields in production RAG: tenant, user_permissions, doc_type, status (active/deprecated), date, source, language, product_version. Without these, you can retrieve semantically relevant but operationally wrong documents — the wrong client’s data, deprecated policies, or content the user doesn’t have permission to see.
RAG-specific evaluation metrics
The generic IR metrics (P@k, Recall@k, MRR, NDCG) measure whether relevant documents were retrieved. In RAG, we also need to measure whether the retrieved chunks actually support correct generation.
| Metric | What it measures |
|---|---|
| Context Recall | Did we retrieve the chunks that contain the answer evidence? |
| Context Precision | Are the retrieved chunks actually useful for the answer? |
| Faithfulness | Is the generated answer supported by the retrieved context? |
| Answer Correctness | Is the final answer factually correct? |
| Citation Accuracy | Do the citations point to the chunks that actually support the claim? |
Frameworks like RAGAS automate most of these evaluations using an LLM judge. The key distinction from classic IR: a document can be relevant but still not contain enough evidence. A chunk that mentions “vacation policy” in passing doesn’t help the LLM answer “how many days do I get.” Context Recall specifically measures whether the evidence-bearing chunk made it into the top-k — which is why it’s often the first metric to optimize in a RAG system.
What separates the didactic from the real
The examples in this article are intentionally simplified. Real production systems — OpenSearch, Elasticsearch, Weaviate, Qdrant — solve additional problems that would be articles on their own:
| Aspect | Didactic | Production |
|---|---|---|
| Scale | In-memory, MB | Distributed, TB+ |
| Indexing | Synchronous, full rebuild | Incremental, online |
| ANN | FAISS/HNSW in-memory | FAISS/Qdrant/Weaviate/OpenSearch with tuning, persistence, sharding, filters |
| Text | Simple tokenizer | Language-specific analyzers, stemmer, synonyms |
| Ranking | Standard BM25 | BM25 + learning-to-rank |
| Resilience | No fault tolerance | Replication, sharding |
| Security | No authentication | RBAC, field-level security |
| Monitoring | None | p99 latency, relevance drift |
When you use OpenSearch or Elasticsearch, BM25 is there — just implemented in Java with decades of optimizations. When you use Weaviate or Qdrant, HNSW is there — implemented in Go/Rust with managed memory.
The point of understanding the fundamentals: when your RAG system returns garbage, you know where to look. When precision@k metrics drop, you know how to investigate. When the user says “search can’t find X”, you know whether it’s a vocabulary problem (BM25), an embedding problem (semantic), chunking, or text analysis.
Summary: the IR trail
Problem → Introduced solution
────────────────────────────────────────────────────────
Slow O(N) search → Inverted index
No ranking → TF-IDF
Doc length/saturation → BM25
How to measure quality? → P@k, Recall@k, MRR, NDCG
Vocabulary mismatch → Embeddings + cosine similarity
ANN too slow at scale → HNSW / FAISS
Lexical vs semantic → Hybrid search + RRF
Short queries vs docs → HyDE, query expansion
Top-100 → Top-10 precision → Two-stage + cross-encoder
Wrong doc despite relevance → Metadata filters (tenant, status, date, ACL)
Did retrieval help the LLM? → RAG-specific metrics (Context Recall, Faithfulness)
The field of Information Retrieval has 60+ years of research. LLMs are new; search isn’t. Understanding IR well is what separates RAG pipelines that work in demos from those that work in production.
Retrieval debugging checklist
When retrieval breaks in production, most failures fall into one of these patterns:
| Symptom | Likely cause | What to inspect |
|---|---|---|
| Exact term not found | Analyzer/tokenization issue | Token stream, stemming, stopword list |
| Synonyms not found | Vocabulary mismatch | Embeddings, hybrid search, synonym expansion |
| Old or deprecated docs retrieved | Missing freshness/status filter | Metadata filters, status field |
| Wrong tenant/client docs returned | Permission or filter bug | ACL filters, tenant isolation |
| Related chunks retrieved but answer is wrong | Chunk doesn’t contain the evidence | Chunk boundaries, chunk size, Context Recall metric |
| Top result looks right but LLM answer is wrong | Generation or context issue | Prompt, faithfulness metric, conflicting chunks |
| Good results in testing, bad in production | Distribution shift | Query logs, embedding model mismatch, index staleness |
Going further
- Executable notebook: full implementation of the concepts above, available as a Jupyter Notebook.
- BEIR benchmark: IR evaluation suite with 18 heterogeneous datasets — the standard for comparing retrieval systems.
- BM25 paper: Robertson & Zaragoza (2009), “The Probabilistic Relevance Framework: BM25 and Beyond.”
- MTEB: Massive Text Embedding Benchmark — evaluates embedding models on retrieval tasks.
- ColBERT: late interaction — an efficient alternative to cross-encoders that maintains precision with reasonable latency.
- SPLADE and learned sparse models: neural models that produce sparse representations with vocabulary expansion — combining lexical interpretability with semantic coverage. A middle ground between BM25 and dense embeddings worth exploring for domain-specific corpora.