{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Search Engines Under the Hood: Executable Notebook\n",
    "\n",
    "This notebook is generated from the blog post `src/content/blog/information-retrieval-search-engines.md`.\n",
    "\n",
    "Most examples are pure Python. A few production-style examples normally require optional packages such as `sentence-transformers`, `numpy`, `faiss`, or a real LLM client. Those cells include lightweight fallbacks so you can still run the notebook end to end before installing heavier dependencies."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You train a model, generate embeddings, wire up a RAG pipeline. The demo works. In production, the LLM starts hallucinating about things that are clearly in your documents. The problem isn't the LLM — it's search. The right documents simply aren't making it into the context.\n",
    "\n",
    "Information Retrieval (IR) is the field that studies exactly this: given a set of documents and a query, how do you find and rank the most relevant ones? It's the silent foundation of every search system — from Google to OpenSearch to your RAG pipeline.\n",
    "\n",
    "This guide builds intuition from the ground up. Each section starts from a concrete problem, shows why the current solution fails, and introduces the next concept. By the end, you'll understand not just *what* to do, but *why* it works — and when it doesn't.\n",
    "\n",
    "> **Note:** The code examples are didactic, optimized for clarity. Real production systems use far more sophisticated implementations. I cover that in the last section.\n",
    "\n",
    "---\n",
    "\n",
    "## Problem 1: You have 10,000 documents. How do you find the right one?\n",
    "\n",
    "Imagine a simple corpus: meeting notes, technical documentation, blog posts. The user types a query. You need to return the most relevant documents.\n",
    "\n",
    "The most direct approach: iterate over all documents and check whether they contain the query words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = [\n",
    "    \"Neural networks learn hierarchical representations of data\",\n",
    "    \"Gradient descent minimizes the loss function iteratively\",\n",
    "    \"Transformers use attention to capture long-range dependencies\",\n",
    "    \"Random forests combine multiple decision trees\",\n",
    "    # ... 9,996 more documents\n",
    "]\n",
    "\n",
    "def linear_search(query: str, docs: list[str]) -> list[str]:\n",
    "    query_lower = query.lower()\n",
    "    return [doc for doc in docs if query_lower in doc.lower()]\n",
    "\n",
    "results = linear_search(\"neural networks\", docs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This works for 10 documents. For 10 million — the size of any serious corpus — scanning everything on every query is infeasible. With 10M documents of 1KB each, that's 10GB of data to scan per search.\n",
    "\n",
    "But the problem goes beyond performance. This search doesn't know *how relevant* a document is — only whether it contains the query or not. There's no ranking. No notion of which result is better.\n",
    "\n",
    "**What we need:** a structure that lets us find documents containing a term without scanning the entire corpus, and a way to rank results by relevance.\n",
    "\n",
    "---\n",
    "\n",
    "## Concept 1: The Inverted Index\n",
    "\n",
    "A search engine doesn't search documents — it searches an **index**. The index is built once (or updated incrementally) and enables fast query responses.\n",
    "\n",
    "The central structure is the **inverted index**: a mapping of term → list of documents that contain it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "term            → documents\n",
    "───────────────────────────────\n",
    "\"neural\"        → [doc_0, doc_7, doc_42, ...]\n",
    "\"gradient\"      → [doc_1, doc_8, doc_19, ...]\n",
    "\"transformer\"   → [doc_2, doc_7, doc_55, ...]\n",
    "\"tree\"          → [doc_3, doc_99, ...]\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's the inversion of the natural index (document → terms). Hence the name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import defaultdict\n",
    "import re\n",
    "\n",
    "def tokenize(text: str) -> list[str]:\n",
    "    # Lowercase + alphanumeric words only\n",
    "    return re.findall(r'\\b[a-záéíóúàâêôãõüç]+\\b', text.lower())\n",
    "\n",
    "def build_index(docs: list[str]) -> dict[str, set[int]]:\n",
    "    index = defaultdict(set)\n",
    "    for doc_id, doc in enumerate(docs):\n",
    "        for term in tokenize(doc):\n",
    "            index[term].add(doc_id)\n",
    "    return dict(index)\n",
    "\n",
    "def search(query: str, index: dict, docs: list[str]) -> list[str]:\n",
    "    terms = tokenize(query)\n",
    "    # Intersection: documents that contain ALL terms\n",
    "    if not terms:\n",
    "        return []\n",
    "    result = index.get(terms[0], set())\n",
    "    for term in terms[1:]:\n",
    "        result = result & index.get(term, set())\n",
    "    return [docs[i] for i in result]\n",
    "\n",
    "docs = [\n",
    "    \"Neural networks learn hierarchical representations\",\n",
    "    \"Gradient descent optimizes neural networks\",\n",
    "    \"Transformers are neural networks with attention\",\n",
    "    \"Random forests do not use gradients\",\n",
    "]\n",
    "\n",
    "index = build_index(docs)\n",
    "print(search(\"neural networks\", index, docs))\n",
    "# ['Neural networks learn...', 'Gradient descent optimizes...', 'Transformers are...']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now search is O(1) per term (dictionary lookup) + O(k) to intersect posting lists, where k is the number of documents containing the term — much smaller than the entire corpus.\n",
    "\n",
    "**But a new problem emerged:** are the three documents about neural networks equally relevant for the query \"neural networks\"? Intuitively, no — some are *about* the topic, others just mention it. We need ranking.\n",
    "\n",
    "---\n",
    "\n",
    "## Problem 2: Not every occurrence counts the same\n",
    "\n",
    "Consider two documents about \"machine learning\":\n",
    "\n",
    "- **Doc A:** \"Machine learning is a subfield of artificial intelligence. Machine learning uses data to learn patterns. Machine learning algorithms include neural networks and decision trees.\"\n",
    "- **Doc B:** \"Python is used in machine learning, but also in web development and automation.\"\n",
    "\n",
    "For the query \"machine learning\", Doc A is clearly more relevant — the term appears 3 times and is the central theme. Doc B mentions it once, in passing.\n",
    "\n",
    "**Term Frequency (TF):** how many times the term appears in the document.\n",
    "\n",
    "But TF alone has a problem: the word \"the\" appears in almost every document. High frequency of \"the\" doesn't indicate relevance.\n",
    "\n",
    "**Inverse Document Frequency (IDF):** words that appear in few documents are more informative."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "IDF(term) = log(N / df(term))\n",
    "\n",
    "where:\n",
    "  N = total number of documents\n",
    "  df = number of documents that contain the term\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"Machine learning\" appears in 100 of 10,000 docs → IDF = log(100) ≈ 4.6  \n",
    "\"the\" appears in 9,999 of 10,000 docs → IDF ≈ 0.0001\n",
    "\n",
    "Multiply the two: **TF-IDF**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "from collections import Counter\n",
    "\n",
    "def calculate_tf(doc_tokens: list[str]) -> dict[str, float]:\n",
    "    counts = Counter(doc_tokens)\n",
    "    total = len(doc_tokens)\n",
    "    return {term: freq / total for term, freq in counts.items()}\n",
    "\n",
    "def calculate_idf(docs_tokens: list[list[str]]) -> dict[str, float]:\n",
    "    N = len(docs_tokens)\n",
    "    df = defaultdict(int)\n",
    "    for tokens in docs_tokens:\n",
    "        for term in set(tokens):\n",
    "            df[term] += 1\n",
    "    return {term: math.log(N / freq) for term, freq in df.items()}\n",
    "\n",
    "def score_tfidf(query: str, doc: str, idf: dict) -> float:\n",
    "    query_tokens = tokenize(query)\n",
    "    doc_tokens = tokenize(doc)\n",
    "    tf = calculate_tf(doc_tokens)\n",
    "    return sum(tf.get(t, 0) * idf.get(t, 0) for t in query_tokens)\n",
    "\n",
    "# Example\n",
    "docs = [\n",
    "    \"Machine learning is a subfield of AI. Machine learning uses data. Machine learning algorithms include neural networks.\",\n",
    "    \"Python is used in machine learning, but also in web and automation.\",\n",
    "    \"Deep learning is a subfield of machine learning with multiple layers.\",\n",
    "]\n",
    "\n",
    "docs_tokens = [tokenize(d) for d in docs]\n",
    "idf = calculate_idf(docs_tokens)\n",
    "\n",
    "query = \"machine learning\"\n",
    "scores = [(score_tfidf(query, doc, idf), doc[:50]) for doc in docs]\n",
    "scores.sort(reverse=True)\n",
    "for score, doc in scores:\n",
    "    print(f\"{score:.4f} | {doc}\")\n",
    "\n",
    "# 0.0812 | Machine learning is a subfield of AI...   ← more relevant\n",
    "# 0.0271 | Deep learning is a subfield of machine...\n",
    "# 0.0108 | Python is used in machine learning...     ← less relevant"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "TF-IDF has been the standard search ranking for decades. But it has known limitations:\n\n1. **Length normalization is incomplete:** our TF formula divides by document length, but a long document that repeatedly mentions \"ML\" in passing can still outscore a short document where \"ML\" is the central theme. BM25 addresses this more robustly.\n2. **TF grows linearly without saturation:** 10 occurrences isn't 10× more relevant than 1, but TF-IDF treats it that way.\n\n---\n\n## Concept 2: BM25 — The standard ranking algorithm\n\n**BM25** (Best Matching 25) emerged in the 1990s and remains the standard lexical retrieval baseline — widely used in Lucene-based systems like OpenSearch and Elasticsearch.\n\nBM25 addresses exactly the problems of TF-IDF with two parameters:\n\n- **k1** (typically 1.2–2.0): controls term frequency saturation. Higher values let repeated occurrences keep contributing; lower values make the score saturate faster. Unlike TF, the contribution never grows linearly without bound.\n- **b** (typically 0.75): controls length normalization."
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "BM25(q, d) = Σ IDF(t) × [TF(t,d) × (k1 + 1)] / [TF(t,d) + k1 × (1 - b + b × |d|/avgdl)]\n",
    "\n",
    "where:\n",
    "  |d|    = document length\n",
    "  avgdl  = average document length in the corpus\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "class BM25:\n    def __init__(self, docs: list[str], k1: float = 1.5, b: float = 0.75):\n        self.k1 = k1\n        self.b = b\n        self.docs_tokens = [tokenize(d) for d in docs]\n        self.N = len(docs)\n        self.avgdl = sum(len(t) for t in self.docs_tokens) / self.N\n\n        # IDF for each term\n        df = defaultdict(int)\n        for tokens in self.docs_tokens:\n            for term in set(tokens):\n                df[term] += 1\n        self.idf = {\n            t: math.log((self.N - freq + 0.5) / (freq + 0.5) + 1)\n            for t, freq in df.items()\n        }\n\n    def score(self, query: str, doc_id: int) -> float:\n        query_tokens = tokenize(query)\n        doc_tokens = self.docs_tokens[doc_id]\n        dl = len(doc_tokens)\n        tf_map = Counter(doc_tokens)\n\n        total = 0.0\n        for term in query_tokens:\n            if term not in self.idf:\n                continue\n            tf = tf_map.get(term, 0)\n            idf = self.idf[term]\n            numerator = tf * (self.k1 + 1)\n            denominator = tf + self.k1 * (1 - self.b + self.b * dl / self.avgdl)\n            total += idf * (numerator / denominator)\n        return total\n\n    def search(self, query: str, docs: list[str], top_k: int = 5) -> list[tuple]:\n        # Scores all documents for clarity. A real BM25 engine would use the\n        # inverted index to score only documents containing query terms.\n        scores = [(self.score(query, i), docs[i]) for i in range(self.N)]\n        return sorted(scores, reverse=True)[:top_k]"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The practical difference between TF-IDF and BM25 becomes clear with documents of varying lengths:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = [\n",
    "    # Short and focused\n",
    "    \"BM25 is the standard ranking algorithm in modern search engines.\",\n",
    "    # Long with many mentions, but diluted\n",
    "    \"This is a very long document about many things. It talks about NLP, IR, machine learning, Python, statistics, linear algebra, and neural networks. BM25 appears here too, but it is a minor detail among all of that. We keep talking about other topics for many more paragraphs.\",\n",
    "    # Relevant but without mentioning exactly \"BM25\"\n",
    "    \"Okapi BM25 extends the probabilistic model by Robertson and Spärck Jones.\",\n",
    "]\n",
    "\n",
    "bm25 = BM25(docs)\n",
    "results = bm25.search(\"BM25 ranking\", docs)\n",
    "for score, doc in results:\n",
    "    print(f\"{score:.4f} | {doc[:60]}...\")\n",
    "# The short, focused document ranks above the long one with many mentions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "---\n\n## Problem 3: How do we know if our search is good?\n\nWe built a system. But how do we know it works well? \"Looks right\" isn't a metric. Before adding more complexity — embeddings, hybrid search, reranking — we need a way to measure whether each new layer actually improves retrieval.\n\nIR has a well-established set of evaluation metrics, all based on a simple concept: **judged relevance**.\n\nTo evaluate, we need a **test set**: a set of queries where humans have judged which documents are relevant. With that, we compare what our system returned with what it should have returned.\n\n### Precision and Recall"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "Precision@k = (relevant documents in top-k) / k\n",
    "Recall@k    = (relevant documents in top-k) / (total relevant documents)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def precision_at_k(retrieved: list, relevant: set, k: int) -> float:\n",
    "    top_k = retrieved[:k]\n",
    "    return len(set(top_k) & relevant) / k\n",
    "\n",
    "def recall_at_k(retrieved: list, relevant: set, k: int) -> float:\n",
    "    top_k = retrieved[:k]\n",
    "    return len(set(top_k) & relevant) / len(relevant)\n",
    "\n",
    "# Example\n",
    "retrieved = [\"doc_2\", \"doc_5\", \"doc_1\", \"doc_8\", \"doc_3\"]\n",
    "relevant   = {\"doc_1\", \"doc_2\", \"doc_4\", \"doc_7\"}\n",
    "\n",
    "print(f\"P@3 = {precision_at_k(retrieved, relevant, 3):.2f}\")  # 2/3 = 0.67\n",
    "print(f\"R@3 = {recall_at_k(retrieved, relevant, 3):.2f}\")     # 2/4 = 0.50\n",
    "print(f\"P@5 = {precision_at_k(retrieved, relevant, 5):.2f}\")  # 2/5 = 0.40\n",
    "print(f\"R@5 = {recall_at_k(retrieved, relevant, 5):.2f}\")     # 2/4 = 0.50"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There's a natural trade-off: increasing k increases recall (more relevant documents captured), but usually decreases precision (more irrelevant ones included).\n",
    "\n",
    "### Mean Reciprocal Rank (MRR)\n",
    "\n",
    "For cases where there's only *one* correct document (e.g., factual answer search), we use MRR: what position is the first relevant result in?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def reciprocal_rank(retrieved: list, relevant: set) -> float:\n",
    "    for i, doc in enumerate(retrieved, 1):\n",
    "        if doc in relevant:\n",
    "            return 1 / i\n",
    "    return 0.0\n",
    "\n",
    "def mean_reciprocal_rank(results: list[tuple[list, set]]) -> float:\n",
    "    return sum(reciprocal_rank(r, rel) for r, rel in results) / len(results)\n",
    "\n",
    "query_results = [\n",
    "    ([\"doc_3\", \"doc_1\", \"doc_2\"], {\"doc_1\"}),  # RR = 1/2\n",
    "    ([\"doc_5\", \"doc_4\", \"doc_2\"], {\"doc_4\"}),  # RR = 1/2\n",
    "    ([\"doc_1\", \"doc_2\", \"doc_3\"], {\"doc_1\"}),  # RR = 1/1\n",
    "]\n",
    "print(f\"MRR = {mean_reciprocal_rank(query_results):.3f}\")  # (0.5 + 0.5 + 1.0) / 3 ≈ 0.667"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### NDCG (Normalized Discounted Cumulative Gain)\n",
    "\n",
    "MRR and Precision treat relevance as binary (relevant/irrelevant). NDCG supports **degrees of relevance** (e.g., highly relevant = 3, relevant = 2, marginal = 1, irrelevant = 0).\n",
    "\n",
    "The idea: a highly relevant document in position 1 is worth more than in position 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def dcg_at_k(scores: list[int], k: int) -> float:\n",
    "    \"\"\"scores[i] = relevance grade of the i-th result\"\"\"\n",
    "    return sum(s / math.log2(i + 2) for i, s in enumerate(scores[:k]))\n",
    "\n",
    "def ndcg_at_k(retrieved_scores: list[int], ideal_scores: list[int], k: int) -> float:\n",
    "    dcg = dcg_at_k(retrieved_scores, k)\n",
    "    idcg = dcg_at_k(sorted(ideal_scores, reverse=True), k)\n",
    "    return dcg / idcg if idcg > 0 else 0.0\n",
    "\n",
    "# retrieved: relevance scores of the returned documents in order\n",
    "retrieved_scores = [3, 1, 2, 0, 3]\n",
    "ideal_scores     = [3, 3, 2, 1, 0]  # best possible ordering\n",
    "\n",
    "print(f\"NDCG@5 = {ndcg_at_k(retrieved_scores, ideal_scores, 5):.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "**For RAG specifically:** Recall@k is usually the first metric to optimize — the answer cannot be grounded if the evidence never reaches the LLM. But precision also matters: irrelevant or conflicting chunks can dilute the context and lead to wrong answers, even when the right chunk is present.\n\n---\n\n## Problem 4: \"Car\" and \"automobile\" — lexical search doesn't understand synonyms\n\nBM25 is powerful, but has a fundamental limitation: it operates on the textual surface. It doesn't know that \"car\" and \"automobile\" mean the same thing. It doesn't know that \"ML engineer\" and \"machine learning engineer\" are equivalent."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "docs = [\n",
    "    \"How to rent an automobile in Brazil\",\n",
    "    \"Maintenance guide for your vehicle\",\n",
    "    \"The best electric cars of 2024\",\n",
    "]\n",
    "\n",
    "bm25 = BM25(docs)\n",
    "results = bm25.search(\"rent car\", docs)\n",
    "# \"How to rent an automobile\" ranks LOW — contains \"rent\" but not \"car\"\n",
    "# \"The best electric cars\" ranks HIGH — contains \"cars\" but not \"rent\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "This problem is called **vocabulary mismatch** — the query vocabulary doesn't match the document vocabulary.\n\nThe solution: instead of comparing words, compare **meanings** — represented as vectors in a semantic space.\n\n---\n\n## Concept 3: Semantic Search with Embeddings\n\nAn **embedding** is a vector of real numbers representing the meaning of a text. Models like `sentence-transformers` are trained so that semantically similar texts produce nearby vectors in vector space.\n\n> **Note on embedding models:** `all-MiniLM-L6-v2` is used here for simplicity. In production, prefer models trained specifically for retrieval — such as E5, BGE, or BGE-M3 — which use contrastive training on query-document pairs and often perform significantly better on IR benchmarks like BEIR and MTEB."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "# Semantic search example. Uses sentence-transformers when available; otherwise\n# falls back to a tiny deterministic concept embedding for demonstration.\ntry:\n    from sentence_transformers import SentenceTransformer\n    import numpy as np\n    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')\n    using_real_embeddings = True\nexcept Exception as exc:\n    import math\n    np = None\n    model = None\n    using_real_embeddings = False\n    print(f\"Using fallback embeddings instead of sentence-transformers: {type(exc).__name__}\")\n\ndocs = [\n    \"How to rent an automobile in Brazil\",\n    \"Maintenance guide for your vehicle\",\n    \"The best electric cars of 2024\",\n    \"Transformers revolutionized natural language processing\",\n]\n\ndef fallback_embedding(text: str) -> list[float]:\n    tokens = set(tokenize(text))\n    concepts = [\n        {\"rent\", \"car\", \"cars\", \"automobile\", \"vehicle\", \"transportation\"},\n        {\"maintenance\", \"guide\", \"repair\", \"vehicle\"},\n        {\"electric\", \"cars\", \"ev\", \"battery\"},\n        {\"transformers\", \"language\", \"processing\", \"nlp\"},\n    ]\n    return [float(len(tokens & concept)) for concept in concepts]\n\ndef cosine(a, b) -> float:\n    numerator = sum(x * y for x, y in zip(a, b))\n    denominator = math.sqrt(sum(x * x for x in a)) * math.sqrt(sum(y * y for y in b))\n    return numerator / denominator if denominator else 0.0\n\nif using_real_embeddings:\n    doc_embeddings = model.encode(docs)\nelse:\n    doc_embeddings = [fallback_embedding(doc) for doc in docs]\n\ndef semantic_search(query: str, top_k: int = 3) -> list[tuple]:\n    if using_real_embeddings:\n        query_embedding = model.encode([query])[0]\n        # Cosine similarity. For L2-normalized vectors this is equivalent to dot\n        # product — which is why production systems (FAISS, OpenSearch) normalize\n        # vectors at index time and use inner product for fast ANN search.\n        similarities = np.dot(doc_embeddings, query_embedding) / (\n            np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)\n        )\n        indices = np.argsort(similarities)[::-1][:top_k]\n        return [(float(similarities[i]), docs[i]) for i in indices]\n\n    query_embedding = fallback_embedding(query)\n    similarities = [cosine(doc_embedding, query_embedding) for doc_embedding in doc_embeddings]\n    indices = sorted(range(len(docs)), key=lambda i: similarities[i], reverse=True)[:top_k]\n    return [(similarities[i], docs[i]) for i in indices]\n\nresults = semantic_search(\"rent car\")\nfor score, doc in results:\n    print(f\"{score:.3f} | {doc}\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "Semantic search captures intent and synonyms. But it has its own weaknesses:\n\n**When lexical search is better:**\n- Queries with exact technical terms: `NullPointerException`, `CVE-2024-1234`, `config.yaml`\n- Proper names and acronyms: `GPT-4`, `BERT`, `Fernando Wittmann`\n- Product/code-specific searches: `SELECT * FROM users WHERE id = 42`\n\n**When semantic search is better:**\n- Natural language queries: \"how do I...?\"\n- Synonyms and paraphrases\n- Multilingual queries (question in PT, doc in EN)\n- Intent without specific keywords\n\n---\n\n## Concept 4: Hybrid Search — the best of both worlds\n\nIn production, we rarely use just one method. The modern approach combines **BM25 + semantic search** via rank fusion.\n\nThe simplest and most effective method is **Reciprocal Rank Fusion (RRF)**:"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "RRF_score(d) = Σ 1 / (k + rank_i(d))\n",
    "\n",
    "where rank_i(d) is the position of document d in ranking i\n",
    "and k is a constant (typically 60)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "def reciprocal_rank_fusion(\n    rankings: list[list[int]],\n    k: int = 60\n) -> list[tuple[int, float]]:\n    \"\"\"\n    rankings: list of doc_id lists, in relevance order\n    returns: list of (doc_id, score) ordered by score\n    \"\"\"\n    scores = defaultdict(float)\n    for ranking in rankings:\n        for position, doc_id in enumerate(ranking, 1):\n            scores[doc_id] += 1.0 / (k + position)\n    return sorted(scores.items(), key=lambda x: x[1], reverse=True)\n\ndef semantic_ranking(query: str, docs: list[str]) -> list[int]:\n    if using_real_embeddings:\n        query_emb = model.encode([query])[0]\n        doc_embs = model.encode(docs)\n        sims = np.dot(doc_embs, query_emb) / (\n            np.linalg.norm(doc_embs, axis=1) * np.linalg.norm(query_emb)\n        )\n        return list(np.argsort(sims)[::-1])\n\n    query_embedding = fallback_embedding(query)\n    sims = [cosine(fallback_embedding(doc), query_embedding) for doc in docs]\n    return sorted(range(len(docs)), key=lambda i: sims[i], reverse=True)\n\ndef hybrid_search(query: str, docs: list[str], top_k: int = 5):\n    # BM25 ranking\n    bm25 = BM25(docs)\n    bm25_scores = [(bm25.score(query, i), i) for i in range(len(docs))]\n    bm25_ranking = [i for _, i in sorted(bm25_scores, reverse=True)]\n\n    # Semantic ranking\n    semantic = semantic_ranking(query, docs)\n\n    # RRF fusion\n    fused = reciprocal_rank_fusion([bm25_ranking, semantic])\n    return [(docs[doc_id], score) for doc_id, score in fused[:top_k]]\n\nfor doc, score in hybrid_search(\"rent car\", docs):\n    print(f\"{score:.4f} | {doc}\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "Empirical research consistently shows that hybrid search outperforms any single method on IR benchmarks. The intuition: the two methods fail in different cases, and fusion cancels out the errors. That said, hybrid search doesn't rescue a bad embedding model, noisy OCR text, or poor chunking — it amplifies whatever quality each branch already has.\n\n---\n\n## The Engine Underneath: Approximate Nearest Neighbor (ANN)\n\nThere's a practical problem hidden in semantic search: to find the k vectors most similar to a query, we'd need to compute similarity with all N documents — O(N × d), where d is the embedding dimension (typically 768–1536).\n\nFor 10M documents with 768-dimensional embeddings, that's ~7.68 billion operations per query. Infeasible.\n\nThe solution: **Approximate Nearest Neighbor (ANN)** — algorithms that find the k nearest neighbors *approximately*, with a small precision sacrifice in exchange for absurd speed.\n\nThe most widely used:\n\n- **HNSW** (Hierarchical Navigable Small World): hierarchical navigation graph. Standard in modern systems. Behaves sublinearly in practice — often close to O(log N) — though actual latency and memory depend heavily on index parameters, vector dimension, and recall target.\n- **IVF** (Inverted File Index): divides vector space into clusters, searches only in the nearest clusters.\n- **Product Quantization**: compresses vectors to reduce memory."
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "```\nBrute force:  O(N × d)   — infeasible for N > 100k\nHNSW:         ~O(log N)  — fast in practice; billions of vectors typically\n                           require distributed infrastructure or quantization\n```"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Libraries like `faiss` (Meta) and `hnswlib` implement these algorithms. OpenSearch and Elasticsearch have native support via k-NN plugins."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ANN example. Uses FAISS when available; otherwise demonstrates a tiny\n",
    "# brute-force nearest-neighbor search with only the Python standard library.\n",
    "try:\n",
    "    import faiss\n",
    "    import numpy as np\n",
    "\n",
    "    d = 128  # embedding dimension\n",
    "    N = 100_000\n",
    "    embeddings = np.random.rand(N, d).astype('float32')\n",
    "    faiss.normalize_L2(embeddings)\n",
    "\n",
    "    index = faiss.IndexHNSWFlat(d, 32)  # 32 = number of connections per node\n",
    "    index.add(embeddings)\n",
    "\n",
    "    query = np.random.rand(1, d).astype('float32')\n",
    "    faiss.normalize_L2(query)\n",
    "\n",
    "    distances, indices = index.search(query, k=10)\n",
    "    print(indices[0][:5])\n",
    "except Exception as exc:\n",
    "    import math\n",
    "    import random\n",
    "\n",
    "    print(f\"FAISS not available, using pure-Python brute-force demo: {type(exc).__name__}\")\n",
    "    random.seed(42)\n",
    "    d = 16\n",
    "    N = 1_000\n",
    "\n",
    "    def normalize(vector):\n",
    "        length = math.sqrt(sum(value * value for value in vector))\n",
    "        return [value / length for value in vector] if length else vector\n",
    "\n",
    "    embeddings = [normalize([random.random() for _ in range(d)]) for _ in range(N)]\n",
    "    query = normalize([random.random() for _ in range(d)])\n",
    "    scores = [(sum(a * b for a, b in zip(vector, query)), idx) for idx, vector in enumerate(embeddings)]\n",
    "    top = sorted(scores, reverse=True)[:10]\n",
    "    print([idx for _, idx in top[:5]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Concept 5: How documents are processed before indexing\n",
    "\n",
    "Before indexing, every search engine applies a **text analysis** pipeline that transforms raw documents into indexable tokens. Each step has a purpose:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "raw text → tokenization → normalization → stopwords → stemming/lemmatization → index\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "from typing import Optional\n",
    "\n",
    "STOPWORDS_EN = {\n",
    "    'a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from',\n",
    "    'in', 'is', 'it', 'of', 'on', 'or', 'that', 'the', 'to', 'was',\n",
    "    'were', 'with'\n",
    "}\n",
    "\n",
    "def analyze_text(text: str, remove_stopwords: bool = True) -> list[str]:\n",
    "    # 1. Lowercase\n",
    "    text = text.lower()\n",
    "\n",
    "    # 2. Normalize accents (simplified)\n",
    "    accent_map = str.maketrans('áéíóúàâêôãõüç', 'aeiouaaeoaouc')\n",
    "    text = text.translate(accent_map)\n",
    "\n",
    "    # 3. Tokenize\n",
    "    tokens = re.findall(r'\\b\\w+\\b', text)\n",
    "\n",
    "    # 4. Remove stopwords\n",
    "    if remove_stopwords:\n",
    "        tokens = [t for t in tokens if t not in STOPWORDS_EN]\n",
    "\n",
    "    # 5. Simple stemming (common English suffixes)\n",
    "    stems = []\n",
    "    for token in tokens:\n",
    "        if token.endswith('ing'):\n",
    "            token = token[:-3]  # \"learning\" → \"learn\"\n",
    "        elif token.endswith('tion'):\n",
    "            token = token[:-4] + 't'\n",
    "        stems.append(token)\n",
    "\n",
    "    return stems\n",
    "\n",
    "print(analyze_text(\"Neural networks are learning representations from data\"))\n",
    "# ['neural', 'networks', 'learn', 'representations', 'data']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Stemming vs Lemmatization:**\n",
    "- **Stemming:** mechanically strips suffixes. Fast, but imprecise (`\"studies\"` → `\"studi\"`).\n",
    "- **Lemmatization:** uses a dictionary to find the base form. Accurate, but slower (`\"studies\"` → `\"study\"`).\n",
    "\n",
    "In production, the choice depends on language and the performance/quality trade-off.\n",
    "\n",
    "---\n",
    "\n",
    "## Tying It All Together: IR in RAG and LLM Agents\n",
    "\n",
    "Now that you understand the fundamentals, let's see how this connects to modern LLM systems.\n",
    "\n",
    "### RAG (Retrieval-Augmented Generation)\n",
    "\n",
    "RAG is essentially IR + text generation:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "user query\n",
    "       ↓\n",
    "[RETRIEVAL] → top-k relevant documents\n",
    "       ↓\n",
    "[AUGMENTATION] → query + documents build the context\n",
    "       ↓\n",
    "[GENERATION] → LLM generates an answer based on context\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**The bottleneck is almost always retrieval.** If the right documents don't reach the LLM, generation can't be correct — and the LLM may hallucinate or simply answer \"I don't know.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def simple_rag_pipeline(\n",
    "    query: str,\n",
    "    corpus: list[str],\n",
    "    llm_client,\n",
    "    top_k: int = 5\n",
    ") -> str:\n",
    "    # 1. Retrieval: find relevant documents\n",
    "    bm25 = BM25(corpus)\n",
    "    scores = [(bm25.score(query, i), i) for i in range(len(corpus))]\n",
    "    top_indices = [i for _, i in sorted(scores, reverse=True)[:top_k]]\n",
    "    documents = [corpus[i] for i in top_indices]\n",
    "\n",
    "    # 2. Augmentation: build the context\n",
    "    context = \"\\n\\n\".join(f\"[{i+1}] {doc}\" for i, doc in enumerate(documents))\n",
    "    prompt = f\"\"\"Based on the documents below, answer: {query}\n",
    "\n",
    "Documents:\n",
    "{context}\n",
    "\n",
    "Answer:\"\"\"\n",
    "\n",
    "    # 3. Generation\n",
    "    return llm_client.generate(prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Query Rewriting and HyDE\n",
    "\n",
    "LLMs can improve retrieval *before* searching:\n",
    "\n",
    "**Query Expansion:** expand the query with synonyms and related terms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def expand_query(query: str, llm_client) -> str:\n",
    "    prompt = f\"\"\"Given the search query: \"{query}\"\n",
    "Generate an expanded version with synonyms and related terms.\n",
    "Return only the expanded query, with no explanations.\n",
    "Example: \"car\" → \"car automobile vehicle transportation\"\n",
    "\"\"\"\n",
    "    return llm_client.generate(prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "**HyDE (Hypothetical Document Embeddings):** instead of embedding the query directly, ask the LLM to generate a hypothetical document that would answer the query, and embed that document. It works because user queries are short, informal, and fragmentary, while corpus documents are longer and written in a different register. Generating a hypothetical answer bridges that distributional gap — the resulting embedding lands closer to real documents in vector space."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def hyde(query: str, llm_client, embedding_model):\n",
    "    # Generate a hypothetical document\n",
    "    prompt = f\"Write a paragraph that directly answers: {query}\"\n",
    "    hypothetical_doc = llm_client.generate(prompt)\n",
    "\n",
    "    # Embed the hypothetical document, not the query\n",
    "    return embedding_model.encode([hypothetical_doc])[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Agents with search tools\n",
    "\n",
    "In agent systems, the search engine becomes a **tool**. The agent decides when and how to search:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tools = [\n",
    "    {\n",
    "        \"name\": \"search_documents\",\n",
    "        \"description\": \"Searches relevant documents in the knowledge base\",\n",
    "        \"parameters\": {\n",
    "            \"query\": \"string — what you want to find\",\n",
    "            \"top_k\": \"int — how many documents to return (default: 5)\",\n",
    "            \"filters\": \"dict — optional metadata filters (date, author, type)\"\n",
    "        }\n",
    "    },\n",
    "    {\n",
    "        \"name\": \"search_code\",\n",
    "        \"description\": \"Searches code snippets in the repository\",\n",
    "        \"parameters\": {\n",
    "            \"query\": \"string — description of the code or function being searched for\"\n",
    "        }\n",
    "    }\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this context, retrieval quality directly affects the agent's ability to reason and act correctly. An agent that searches poorly, reasons poorly.\n",
    "\n",
    "### Chunking: splitting documents for indexing\n",
    "\n",
    "Long documents need to be split into chunks before indexing. This deeply affects retrieval quality:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def chunk_by_sentence(\n",
    "    text: str,\n",
    "    max_tokens: int = 512,\n",
    "    overlap: int = 50\n",
    ") -> list[str]:\n",
    "    \"\"\"\n",
    "    Split text into chunks while respecting sentences,\n",
    "    with overlap to avoid losing context at the edges.\n",
    "    \"\"\"\n",
    "    sentences = text.split('. ')\n",
    "    chunks = []\n",
    "    current_chunk = []\n",
    "    current_tokens = 0\n",
    "\n",
    "    for sentence in sentences:\n",
    "        sentence_tokens = len(sentence.split())\n",
    "        if current_tokens + sentence_tokens > max_tokens and current_chunk:\n",
    "            chunks.append('. '.join(current_chunk) + '.')\n",
    "            # Overlap: keep the last sentences\n",
    "            overlap_words = ' '.join(current_chunk[-2:]).split()\n",
    "            current_chunk = [' '.join(overlap_words[-overlap:])]\n",
    "            current_tokens = overlap\n",
    "        current_chunk.append(sentence)\n",
    "        current_tokens += sentence_tokens\n",
    "\n",
    "    if current_chunk:\n",
    "        chunks.append('. '.join(current_chunk))\n",
    "\n",
    "    return chunks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "**Chunk size vs quality:**\n- Chunks too small (< 100 tokens): lose context, make generation harder\n- Chunks too large (> 1000 tokens): dilute the relevance signal\n- Starting point: 256–512 tokens, with 10–20% overlap\n\nThe right size depends on the embedding model's context window, document structure, and query type. Treat it as a starting point and validate with retrieval metrics — don't cargo-cult the number.\n\n---\n\n## Re-ranking: refining after retrieval\n\nA common production pattern is **two-stage retrieval**:\n\n1. **Candidate retrieval:** fast search (BM25 or ANN) returns top-100 candidates\n2. **Re-ranking:** a more precise model (cross-encoder) reorders the top-100"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "query → [BM25/ANN] → top-100 candidates → [Cross-Encoder] → reranked top-10\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Why two stages?** Cross-encoders analyze query and document *together* — full attention over both. They're much more accurate than bi-encoders (separate embeddings), but also much slower. Using them on 100 candidates, not 10 million, is feasible."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cross-encoder example. Uses sentence-transformers when available; otherwise\n",
    "# falls back to lexical overlap so the cell remains executable.\n",
    "try:\n",
    "    from sentence_transformers import CrossEncoder\n",
    "    reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')\n",
    "\n",
    "    def two_stage_retrieval(query: str, candidates: list[str], top_k: int = 10):\n",
    "        pairs = [[query, doc] for doc in candidates]\n",
    "        scores = reranker.predict(pairs)\n",
    "        result = sorted(zip(scores, candidates), reverse=True)\n",
    "        return [doc for _, doc in result[:top_k]]\n",
    "except Exception as exc:\n",
    "    print(f\"CrossEncoder not available, using lexical-overlap fallback: {type(exc).__name__}\")\n",
    "\n",
    "    def two_stage_retrieval(query: str, candidates: list[str], top_k: int = 10):\n",
    "        query_terms = set(tokenize(query))\n",
    "        scored = []\n",
    "        for doc in candidates:\n",
    "            score = len(query_terms & set(tokenize(doc)))\n",
    "            scored.append((score, doc))\n",
    "        result = sorted(scored, reverse=True)\n",
    "        return [doc for _, doc in result[:top_k]]\n",
    "\n",
    "print(two_stage_retrieval(\"rent car\", docs, top_k=3))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8877d466",
   "source": "---\n\n## What separates the didactic from the real\n\nThe examples in this article are intentionally simplified. Real production systems — OpenSearch, Elasticsearch, Weaviate, Qdrant — solve additional problems that would be articles on their own:\n\n| Aspect | Didactic | Production |\n|---|---|---|\n| **Scale** | In-memory, MB | Distributed, TB+ |\n| **Indexing** | Synchronous, full rebuild | Incremental, online |\n| **ANN** | FAISS/HNSW in-memory | FAISS/Qdrant/Weaviate/OpenSearch with tuning, persistence, sharding, filters |\n| **Text** | Simple tokenizer | Language-specific analyzers, stemmer, synonyms |\n| **Ranking** | Standard BM25 | BM25 + learning-to-rank |\n| **Resilience** | No fault tolerance | Replication, sharding |\n| **Security** | No authentication | RBAC, field-level security |\n| **Monitoring** | None | p99 latency, relevance drift |\n\nWhen you use OpenSearch or Elasticsearch, BM25 is there — just implemented in Java with decades of optimizations. When you use Weaviate or Qdrant, HNSW is there — implemented in Go/Rust with managed memory.\n\n**The point of understanding the fundamentals:** when your RAG system returns garbage, you know where to look. When precision@k metrics drop, you know how to investigate. When the user says \"search can't find X\", you know whether it's a vocabulary problem (BM25), an embedding problem (semantic), chunking, or text analysis.\n\n---\n\n## Summary: the IR trail",
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "id": "18df1ffa",
   "source": "```\nProblem                     →  Introduced solution\n────────────────────────────────────────────────────────\nSlow O(N) search            →  Inverted index\nNo ranking                  →  TF-IDF\nDoc length/saturation       →  BM25\nHow to measure quality?     →  P@k, Recall@k, MRR, NDCG\nVocabulary mismatch         →  Embeddings + cosine similarity\nANN too slow at scale       →  HNSW / FAISS\nLexical vs semantic         →  Hybrid search + RRF\nShort queries vs docs       →  HyDE, query expansion\nTop-100 → Top-10 precision  →  Two-stage + cross-encoder\nWrong doc despite relevance →  Metadata filters (tenant, status, date, ACL)\nDid retrieval help the LLM? →  RAG-specific metrics (Context Recall, Faithfulness)\n```",
   "metadata": {},
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "id": "b0d2a71d",
   "source": "The field of Information Retrieval has 60+ years of research. LLMs are new; search isn't. Understanding IR well is what separates RAG pipelines that work in demos from those that work in production.\n\n---\n\n## Retrieval debugging checklist\n\nWhen retrieval breaks in production, most failures fall into one of these patterns:\n\n| Symptom | Likely cause | What to inspect |\n|---|---|---|\n| Exact term not found | Analyzer/tokenization issue | Token stream, stemming, stopword list |\n| Synonyms not found | Vocabulary mismatch | Embeddings, hybrid search, synonym expansion |\n| Old or deprecated docs retrieved | Missing freshness/status filter | Metadata filters, `status` field |\n| Wrong tenant/client docs returned | Permission or filter bug | ACL filters, tenant isolation |\n| Related chunks retrieved but answer is wrong | Chunk doesn't contain the evidence | Chunk boundaries, chunk size, Context Recall metric |\n| Top result looks right but LLM answer is wrong | Generation or context issue | Prompt, faithfulness metric, conflicting chunks |\n| Good results in testing, bad in production | Distribution shift | Query logs, embedding model mismatch, index staleness |\n\n---\n\n## Going further\n\n- **Executable notebook:** full implementation of the concepts above, available [in this repository](#).\n- **BEIR benchmark:** IR evaluation suite with 18 heterogeneous datasets — the standard for comparing retrieval systems.\n- **BM25 paper:** Robertson & Zaragoza (2009), \"The Probabilistic Relevance Framework: BM25 and Beyond.\"\n- **MTEB:** Massive Text Embedding Benchmark — evaluates embedding models on retrieval tasks.\n- **ColBERT:** late interaction — an efficient alternative to cross-encoders that maintains precision with reasonable latency.\n- **SPLADE and learned sparse models:** neural models that produce sparse representations with vocabulary expansion — combining lexical interpretability with semantic coverage. A middle ground worth exploring for domain-specific corpora.",
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## What separates the didactic from the real\n",
    "\n",
    "The examples in this article are intentionally simplified. Real production systems — OpenSearch, Elasticsearch, Weaviate, Qdrant — solve additional problems that would be articles on their own:\n",
    "\n",
    "| Aspect | Didactic | Production |\n",
    "|---|---|---|\n",
    "| **Scale** | In-memory, MB | Distributed, TB+ |\n",
    "| **Indexing** | Synchronous, full rebuild | Incremental, online |\n",
    "| **ANN** | Brute force | HNSW with tuned parameters |\n",
    "| **Text** | Simple tokenizer | Language-specific analyzers, stemmer, synonyms |\n",
    "| **Ranking** | Standard BM25 | BM25 + learning-to-rank |\n",
    "| **Resilience** | No fault tolerance | Replication, sharding |\n",
    "| **Security** | No authentication | RBAC, field-level security |\n",
    "| **Monitoring** | None | p99 latency, relevance drift |\n",
    "\n",
    "When you use OpenSearch or Elasticsearch, BM25 is there — just implemented in Java with decades of optimizations. When you use Weaviate or Qdrant, HNSW is there — implemented in Go/Rust with managed memory.\n",
    "\n",
    "**The point of understanding the fundamentals:** when your RAG system returns garbage, you know where to look. When precision@k metrics drop, you know how to investigate. When the user says \"search can't find X\", you know whether it's a vocabulary problem (BM25), an embedding problem (semantic), chunking, or text analysis.\n",
    "\n",
    "---\n",
    "\n",
    "## Summary: the IR trail"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "Problem                     →  Introduced solution\n",
    "────────────────────────────────────────────────────────\n",
    "Slow O(N) search            →  Inverted index\n",
    "No ranking                  →  TF-IDF\n",
    "Doc length/saturation       →  BM25\n",
    "How to measure quality?     →  P@k, Recall@k, MRR, NDCG\n",
    "Vocabulary mismatch         →  Embeddings + cosine similarity\n",
    "ANN too slow at scale       →  HNSW / FAISS\n",
    "Lexical vs semantic         →  Hybrid search + RRF\n",
    "Short queries vs docs       →  HyDE, query expansion\n",
    "Top-100 → Top-10 precision  →  Two-stage + cross-encoder\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The field of Information Retrieval has 60+ years of research. LLMs are new; search isn't. Understanding IR well is what separates RAG pipelines that work in demos from those that work in production.\n",
    "\n",
    "---\n",
    "\n",
    "## Going further\n",
    "\n",
    "- **Executable notebook:** full implementation of the concepts above, available [in this repository](#).\n",
    "- **BEIR benchmark:** IR evaluation suite with 18 heterogeneous datasets — the standard for comparing retrieval systems.\n",
    "- **BM25 paper:** Robertson & Zaragoza (2009), \"The Probabilistic Relevance Framework: BM25 and Beyond.\"\n",
    "- **MTEB:** Massive Text Embedding Benchmark — evaluates embedding models on retrieval tasks.\n",
    "- **ColBERT:** late interaction — an efficient alternative to cross-encoders that maintains precision with reasonable latency."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12",
   "mimetype": "text/x-python",
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "pygments_lexer": "ipython3",
   "nbconvert_exporter": "python",
   "file_extension": ".py"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}