banner
RAG for LLMs
Tricky and effective
#️⃣   ⌛  ~50 min 🤓  Intermediate
20.02.2025
upd:
#151

views-badgeviews-badge
banner
RAG for LLMs
Tricky and effective
⌛  ~50 min
#151


🎓 156/167

This post is a part of the AI engineering educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!


Retrieval-Augmented Generation (RAG) has emerged as one of the most compelling strategies for enhancing the factual grounding and contextual relevance of large language models (LLMs). The rapid growth in the capabilities of LLMs—such as GPT-based models, BERT derivatives, and other transformer-based architectures—has spurred research into strategies that leverage external knowledge sources. RAG is at the forefront of these efforts. By combining information retrieval with generative modeling, RAG can draw upon an external corpus (e.g., a vector database, knowledge graph, or curated document store) to supplement an LLM's internal learned representation. Hence, RAG helps LLMs produce answers that are not only fluent, coherent, and contextually holistic, but also significantly more factual and grounded in the latest knowledge.

In this article, I will explore the theoretical foundations of RAG, dive into the architectural components that constitute RAG-based pipelines, demonstrate step-by-step implementations (including various advanced techniques), discuss relevant open-source frameworks and state-of-the-art research, and compare RAG to other common approaches such as fine-tuning or knowledge distillation. My goal is to give you an in-depth, PhD-level understanding of RAG, covering everything from embedding-based retrieval algorithms to orchestrating multi-step interactions with large language models.

The core principle behind RAG is straightforward in theory: a large language model directly leverages external documents or data for context, instead of relying solely on the capacity of its internal parameters. But the actual implementation details can be quite intricate and require a deep understanding of vector databases, indexing, approximate nearest neighbor (ANN) search, chunking or segmentation of documents, and real-time orchestration with generative models.

Throughout this article, I will approach RAG from both theoretical and practical angles. On the theoretical side, I will examine how similarity measures in embedding spaces connect to the idea of retrieving semantically relevant pieces of information for the generative model. On the practical side, I will show typical code snippets in Python, referencing popular libraries and frameworks that implement RAG pipelines. I will also introduce advanced strategies like multi-query retrieval, memory augmentation, and specialized re-ranking methods, as well as discuss the potential pitfalls (e.g., hallucinations, mismatch in domain-specific embeddings, privacy or latency constraints) and how to mitigate them.

Background And Context

Before diving deeper, let me contextualize RAG's origins. Patrick Lewis and gang (2020) introduced the concept of retrieval-augmented generation to tackle knowledge-intensive NLP tasks. Their paper demonstrated that bridging retrieval techniques with generative models can outperform purely parametric approaches, including fully fine-tuned BERT and GPT variations, when the tasks demand factual accuracy and context. Since then, many follow-up works have expanded on RAG, exploring topics such as knowledge-grounded question answering, open-domain dialogue generation, or multi-turn reasoning.

The principle of RAG can be summarized as follows:

  1. User issues a query or prompt.
  2. The system converts this query into an embedding using a dedicated or pretrained encoder.
  3. A retrieval component (often an approximate nearest neighbor system) searches for relevant documents, text chunks, or knowledge items based on similarity to the query embedding.
  4. The top-k retrieved items are appended (or fed in as separate structured context) to the prompt or model input.
  5. The generative language model draws on both the provided context and its learned knowledge to generate a coherent answer.

By reusing or updating the external knowledge source, the system retains continuous access to new or changing information, which significantly reduces the need for frequent re-training or fine-tuning. This property is immensely beneficial in dynamic domains—like finance, e-commerce, news monitoring, or corporate knowledge bases—where the underlying information can change rapidly.

Theoretical Foundation Of Retrieval-Augmented Generation

Linking Retrieval And Generation

RAG's theoretical structure hinges on the composition of two principal modules: a retriever R R and a generator G G . Formally, let q q be the user query. The retriever R(q) R(q) produces a set of relevant documents or passages {d1,d2,...,dk} \{d_1, d_2, ..., d_k\} . The generator G(q,{di}) G(q, \{d_i\}) is then tasked with producing a response a a . Thus, we can define the process as:

a=argmaxapG(aq,d1,d2,,dk) a = \arg\max_{a} p_G(a \mid q, d_1, d_2, \ldots, d_k)

Here, pG p_G indicates the probability distribution induced by the generative model.

The retrieved documents {di} \{d_i\} constitute external knowledge that augments the internal representation of the language model's parameters. Conceptually, best results arise when the retrieval subsystem is tightly coupled to the generative subsystem, such that the retrieved knowledge directly supports the generation process (Lewis and gang, 2020).

Embedding Space And Similarity Metrics

Key to RAG is the idea that both queries and documents live in a (typically high-dimensional) embedding space where dot product, cosine similarity, or other distance metrics reflect semantic closeness. Let x x be a text fragment (which could be a user query or a chunk of a document). An embedding model E() E(\cdot) maps x x into a vector vRn \mathbf{v} \in \mathbb{R}^n . For example,

v=E(x), \mathbf{v} = E(x),

where n n could be on the order of hundreds or thousands, depending on the embedding model.

The retrieval step typically relies on searching among these vectors for the k k closest neighbors to the query's embedding vq=E(q) \mathbf{v_q} = E(q) . If vd \mathbf{v_d} is the embedding for a document chunk d d , the similarity might be measured by the cosine similarity cos(vq,vd) \cos(\mathbf{v_q}, \mathbf{v_d}) or the inner product vqvd \mathbf{v_q}^\top \mathbf{v_d} . Across large corpora (potentially billions of documents), approximate nearest neighbor search algorithms (like Hierarchical Navigable Small World graphs, or product quantization methods) are vital in making retrieval at scale computationally tractable.

Probabilistic Modeling

From a probabilistic standpoint, one might consider p(dq) p(d \mid q) as the probability that a document d d is relevant to query q q . In many RAG systems, p(dq) p(d \mid q) is approximated by a function of the vector similarity sim(E(q),E(d)) \mathrm{sim}( E(q), E(d) ) . Then, the final generation is shaped by:

p(aq)=dDpG(aq,d)p(dq) p(a \mid q) = \sum_{d \in \mathcal{D}} p_G(a \mid q, d) \, p(d \mid q)

where D \mathcal{D} is the entire document corpus. Implementing this sum explicitly is infeasible for large corpora, but approximate top-k retrieval picks out the most probable (or relevant) documents to reduce the search space.

Key Components Of A RAG Pipeline

Retriever

At the heart of RAG resides the retriever, which surfaces the most relevant documents from a large corpus given a query. Typically, a retriever is built on an embedding model plus a vector database that indexes these embeddings. Some well-known vector databases include:

  • FAISS (Facebook AI Similarity Search)
  • ScaNN (Scalable Nearest Neighbors by Google)
  • Annoy (Approximate Nearest Neighbors Oh Yeah)
  • Milvus
  • Pinecone
  • Chroma

Each of these solutions provides different trade-offs in terms of CPU/GPU usage, indexing speed, memory requirements, and query latency.

Because the retriever is critical for final performance, one often invests in specialized training or fine-tuning for the retrieval module. For instance, models like DPR (Karpukhin and gang, 2020) or Contriever can yield advanced retrieval performance when dealing with domain-specific corpora.

Generator

The generator is a large language model—for instance, a GPT-based architecture or T5—that takes not only the user's query but also the retrieved text chunks as context to produce a response. The generator typically has a limited context window (e.g., a few thousand tokens in GPT-style models), so thorough control of how the retrieved documents are appended, summarized, or re-encoded is crucial.

The generator might also rely on specialized input formatting. For instance, a prompt could look like:

"User query: [Q]
Context: [D1] [D2] [D3]
Answer: …"

Advanced frameworks like LangChain or LlamaIndex handle this prompt concatenation automatically, but if you are implementing RAG from scratch, you must be strategic about how you pass context to the model to avoid exceeding token limits or losing important details.

Chunking (Document Splitting)

Because documents can be very large and exceed typical context windows, the pipeline usually splits each document into smaller chunks of text. For instance, each chunk might be 200–500 words or tokens. Each chunk is then embedded independently, so that retrieval can be more fine-grained.

Chunking strategies vary. One might use:

  • Simple fixed-size segments (e.g., 256-token windows).
  • Semantic segmentation based on headings or paragraphs.
  • Recursive character/paragraph splitters that break text at logical boundaries.

The chunk size profoundly impacts retrieval performance. Overly large chunks might reduce the precision of retrieval, while overly small chunks could lose context.

A vector database stores all the chunk embeddings and allows fast approximate nearest neighbor queries. Internally, it may employ indexing structures such as an inverted file system, a k-means-based product quantization, or HNSW-type graphs to achieve sub-linear search times.

When building a vector storage, the general steps are:

  1. Ingest documents.
  2. Split them into chunks.
  3. Embed each chunk.
  4. Insert these embeddings into a vector database, typically with metadata (e.g., chunk ID, source document, page number).

At query time:

  1. The query is embedded.
  2. The database returns the top-k most similar chunks.
  3. Those chunks are fed into the generator model.

Orchestration: Encapsulation And Workflow

In a complete pipeline, the RAG steps need to be orchestrated. This can be done manually (by chaining together embedding, vector search, and generation calls in your code) or by using frameworks like:

  • LangChain
  • LlamaIndex
  • FastRAG
  • Haystack

These frameworks integrate data ingestion, chunking, embedding, retrieval, and generation steps under a uniform API, helping you quickly stand up RAG-based applications. They also offer convenient modules for memory (capturing conversation history), caching, tool usage (e.g. calling external APIs before generation), and advanced QA chaining.

Implementation Details

Building A Minimal RAG Pipeline

To illustrate the general structure of a RAG pipeline, I will now provide an example snippet in Python. This example uses a hypothetical embedding model (like OpenAI's embeddings API) and a vector database interface (like FAISS or Pinecone).


import os
import openai
import numpy as np

# Hypothetical vector DB client, e.g. pinecone
import pinecone

# Step 1: Chunking 
def split_document_into_chunks(document, chunk_size=300):
    words = document.split()
    chunks = []
    current_chunk = []
    for word in words:
        current_chunk.append(word)
        if len(current_chunk) >= chunk_size:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

# Step 2: Generating embeddings
# We'll use OpenAI's embedding endpoint for demonstration
def get_embedding(text):
    # This call requires your OpenAI API key to be set in openai.api_key
    # e.g. openai.api_key = "YOUR_KEY"
    response = openai.Embedding.create(
        input=[text],
        model="text-embedding-ada-002"
    )
    vector = response['data'][0]['embedding']
    return vector

# Step 3: Indexing chunks in a vector store
def index_in_pinecone(chunks, index_name="my_index"):
    # Initialize Pinecone
    pinecone.init(api_key="YOUR_API_KEY", environment="us-east1-gcp")
    
    # Create index if it doesn't exist
    if index_name not in pinecone.list_indexes():
        pinecone.create_index(index_name, dimension=1536)
    
    index = pinecone.Index(index_name)
    
    upserts = []
    for i, chunk in enumerate(chunks):
        chunk_vector = get_embedding(chunk)
        upserts.append((str(i), chunk_vector, {"text": chunk}))
    
    index.upsert(vectors=upserts)

# Step 4: Retrieval
def retrieve_chunks_from_pinecone(query, index_name="my_index", top_k=3):
    index = pinecone.Index(index_name)
    query_vector = get_embedding(query)
    results = index.query(vector=query_vector, top_k=top_k, include_metadata=True)
    return [match["metadata"]["text"] for match in results["matches"]]

# Step 5: Generation with retrieved context
def generate_answer(query):
    # 1. Retrieve
    relevant_chunks = retrieve_chunks_from_pinecone(query)
    # 2. Form prompt
    prompt = f"User query: {query}\nContext: {relevant_chunks}\nAnswer:"
    
    # 3. Use GPT for generation
    completion = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        max_tokens=150
    )
    return completion.choices[0].text.strip()

# Putting it all together:
if __name__ == "__main__":
    sample_document = "Here is a long text about advanced machine learning, ...
                       We also discuss concepts like RAG, MLOps, and so forth."
    chunks = split_document_into_chunks(sample_document)
    index_in_pinecone(chunks)
    
    user_query = "What is RAG in the context of LLMs?"
    answer = generate_answer(user_query)
    print(answer)

In this dummy example, I have illustrated how you might chunk a document, embed the chunks, store them in Pinecone, retrieve the top few matches for a query, and pass them into a GPT-based model. In a real production environment, you would likely refine each step, such as:

  • Using more sophisticated chunk splitting (by sentence or headings).
  • Caching embeddings so that you don't re-encode the same text repeatedly.
  • Performing additional logic to format or re-rank retrieved chunks.

Nevertheless, this general pattern is representative of many RAG systems.

Multi-Hop Retrieval And Re-Ranking

An advanced technique called multi-hop retrieval can address queries that require multiple reasoning steps or combining information from multiple chunks. In multi-hop retrieval, the system iteratively refines the query or expands the set of candidate documents. The newly retrieved documents at each step are used to formulate a subsequent query.

You can also incorporate re-ranking steps (similar to how cross-encoders function) to reorder the retrieved documents based on deeper semantic checks. Approaches like ColBERT or re-rankers fine-tuned on question-answer pairs might significantly improve retrieval precision.

Orchestration Frameworks

LangChain

LangChain is one of the most popular frameworks for building end-to-end RAG pipelines. It allows you to define “chains” of prompts and connect them with broader retrieval or question-answer modules. It also integrates conversation “memory,” tool usage (including external APIs), and advanced prompting techniques.

LangChain's advantage lies in packaging many best practices for LLM usage into a single cohesive library. For instance, you can define a chain that first rewrites the user query to enhance retrieval, fetches top-k documents, and calls a second chain for summarization or final answer generation.

LlamaIndex

LlamaIndex (formerly GPT Index) is similarly oriented toward retrieval-augmented tasks, but focuses heavily on indexing and building hierarchical or graph-based structures on top of your data. It can be used with a variety of LLMs and vector databases. LlamaIndex covers chunking, embedding, retrieval, and generation while still allowing you to customize each step.

FastRAG

FastRAG is an emerging library (Intel Labs) that emphasizes optimizing the retrieval-augmented pipeline for low-latency response times, employing advanced caching and model acceleration.

RAG Versus Fine-Tuning

RAG is often contrasted with the more traditional approach of fine-tuning, in which a language model is updated (via gradient-based training) on a domain-specific corpus or a given dataset. The difference can be summarized as follows:

  • Fine-Tuning:
    1. You effectively bake domain knowledge into the model's parameters.
    2. The approach can yield excellent domain-specific results but tends to be static—once trained, the knowledge is frozen until a new fine-tuning round.
    3. Can be expensive or infeasible for extremely large LLMs.
  • RAG:
    1. You keep the model's parameters fixed, but attach an external knowledge base or vector database.
    2. Ensures up-to-date knowledge is always available, as you can update the external data store regularly without retraining the model.
    3. May require well-engineered retrieval index structures to keep latency manageable.

In many real-world scenarios, RAG is a more flexible approach: if your knowledge base changes frequently or must incorporate multiple data sources, it's usually more practical to retrieve from an updatable store than to re-train or fine-tune a large model from scratch.

Use Cases And Applications

  1. Open-Domain Question Answering: RAG enables robust QA in scenarios where the answer to a question may lie in a large text corpus or website. As changes occur in the corpus, the system remains accurate without retraining.

  2. Customer Support Chatbots: A RAG-based system can retrieve relevant knowledge base content (FAQs, policy documents, troubleshooting guides) and base its answers on up-to-date references, drastically reducing the risk of providing outdated information.

  3. Enterprise Knowledge Management: In an enterprise setting, RAG can serve as a dynamic interface to large volumes of documents—memos, wikis, policy docs—without requiring elaborate data wrangling each time.

  4. Scientific Literature Search: Researchers can query a database of academic papers by embedding user queries and retrieving relevant sections, prompting the language model to summarize or highlight key points.

  5. News And Trend Monitoring: Journalists or data analysts can retrieve the most relevant news fragments to unify them into a coherent storyline for real-time analysis.

  6. Educational Applications: RAG-based tutoring systems can retrieve relevant textbooks or reference materials in real time, augmenting the knowledge of a base language model.

Evaluating RAG-Based Systems

Retrieval Metrics

One part of evaluation focuses on retrieval quality. Common retrieval metrics include:

  • Recall@k: fraction of queries for which a relevant document is among the top-k retrieved results.
  • MRR (Mean Reciprocal Rank): measures how high in the ranking the first relevant document appears.
  • nDCG (Normalized Discounted Cumulative Gain): accounts for multiple relevance levels in ranking.

For advanced domain-specific tasks, a manual annotation or gold-labeled set might be needed to measure how well retrieval is performing.

Generation Metrics

Once relevant documents are retrieved, the language model's generation is evaluated with metrics like:

  • Perplexity: how well the model predicts the observed text, though less common for open-ended tasks.
  • ROUGE/BLEU: measure textual overlap with a reference answer (used in summarization or QA).
  • Factual accuracy: specialized to check correctness of the produced statements (can be done partially with retrieval-based cross-checking).

In knowledge-intensive tasks, human evaluation or specialized QA metrics often remain the gold standard to measure the “usefulness” and correctness of generated answers.

Holistic End-To-End Evaluation

It is often practical to adopt pipeline-level metrics. For instance, a question-answering system can be scored on whether the final answer is correct, ignoring the intermediate question of which documents were retrieved. Tools like Ragas or DeepEval allow direct end-to-end QA evaluation and help diagnose where errors occur (retriever or generator).

Potential Pitfalls And Future Directions

  1. Hallucinations: Even if relevant documents are retrieved, LLMs sometimes hallucinate or fabricate details. Careful prompt engineering and chain-of-thought checking can reduce but not eliminate this issue.

  2. Domain-Specific Embeddings: If your corpus is domain-specific (e.g., legal texts, chemical patents), pretrained generalist embeddings may fail to accurately capture domain concepts. Fine-tuning or specialized embedding models can improve retrieval performance.

  3. Latency And Scalability: Large corpora plus big LLMs can cause response delays. Strategies such as quantization, distillation, caching, and approximate nearest neighbor indexing are crucial for real-world viability.

  4. Security And Privacy: Many RAG pipelines rely on external APIs for embedding or generation. Sensitive data might need to remain on-premises, prompting the search for private embedding models or self-hosted solutions.

  5. Multilingual Retrieval: Substantial progress is still needed on multilingual RAG, where queries and documents may appear in multiple languages. Cross-lingual embedding approaches, such as LaBSE or multilingual MiniLM, can help unify the retrieval space.

  6. Knowledge Graphs Integration: Some pipelines integrate knowledge graphs or relational data with embeddings for schema-aware retrieval. This approach can provide structured knowledge and improve interpretability, but requires more sophisticated indexing and retrieval logic.

  7. Advanced Re-Ranking And Fusion Techniques: Future research is exploring how an LLM can dynamically re-rank or fuse multiple retrieved pieces of text, especially for multi-hop reasoning.

Example Code Snippets For Advanced Features

Multi-Query Retrieval

In multi-query retrieval, the system might reformulate the user's original query multiple times to capture different facets of the question. Below is a simplified demonstration:


def multi_query_retrieval(query, times=3):
    # Step 1: Generate expansions or reformulations
    # For domain-specific tasks, you might use a specialized LLM or rules
    expansions = []
    for i in range(times):
        expansion_prompt = f"Rephrase the query in a different way:
Original query: {query}
Alternative version #{i+1}:"
        completion = openai.Completion.create(engine="text-davinci-003", prompt=expansion_prompt, max_tokens=50)
        expansions.append(completion.choices[0].text.strip())
    
    # Step 2: Retrieve for each expansion
    all_retrieved_chunks = []
    for eq in expansions:
        eq_chunks = retrieve_chunks_from_pinecone(eq, top_k=2)
        all_retrieved_chunks.extend(eq_chunks)
    
    # De-duplicate or re-rank final chunks
    unique_chunks = list(set(all_retrieved_chunks))
    # Optionally run a re-ranking step
    # ...
    return unique_chunks

Here, I generate multiple expansions of the query. Each expansion is used to retrieve top-k results, and then all the retrieved chunks are merged and re-ranked. This approach sometimes unearths relevant documents that would be missed by a single retrieval query.

Integrating Summaries Or Distillation

Instead of passing raw retrieved text to the generator, you can compress or summarize each chunk before final usage, especially when chunk sizes are large.


def summarize_chunk(chunk):
    prompt = f"Summarize this text in a concise paragraph:
{chunk}
Summary:"
    summary = openai.Completion.create(engine="text-davinci-003", prompt=prompt, max_tokens=80)
    return summary.choices[0].text.strip()

def retrieve_and_summarize(query, top_k=3):
    chunks = retrieve_chunks_from_pinecone(query, top_k=top_k)
    summaries = [summarize_chunk(ch) for ch in chunks]
    return summaries

This ensures your final prompt to the LLM has more relevant coverage of multiple retrieved chunks while staying within the model's context window. Summarization can be performed through smaller or specialized language models to reduce cost and latency.

Conclusion

Retrieval-Augmented Generation is an exciting, powerful paradigm for bridging the gap between massive language models and real-world knowledge. By harnessing vector embeddings, sophisticated indexing structures, and generative AI, RAG can provide accurate, context-aware, and up-to-date responses in domains where knowledge changes frequently. The synergy of retrieval and generation reduces the need for repeated fine-tuning, offers dynamic knowledge updates, and can significantly improve the reliability and factual grounding of LLM outputs.

From a theoretical perspective, RAG thrives on well-structured retrieval probabilities, advanced embedding models, and carefully orchestrated multi-step generation. In practical terms, developers face a suite of engineering challenges regarding text chunking, metadata management, latency, cost optimization, and data governance. Nonetheless, the ecosystem supporting RAG—from open-source frameworks like LangChain and LlamaIndex to commercial vector databases and HPC-optimized pipelines—is rapidly maturing.

Whether you are building enterprise chatbots, knowledge-driven question-answering systems, scientific literature discovery tools, or real-time data analysis platforms, RAG can be a cornerstone of a robust, future-proof solution. By leveraging RAG, I believe you can design LLM-powered services that truly reflect the latest information and deliver domain-specific insights with precision, clarity, and trustworthiness.

If you are keen to expand these ideas further, consider exploring next-generation retrieval systems (e.g., dense passage retrieval with domain adaptation, knowledge graphs, or retrieval with advanced re-ranking), investigating advanced multi-hop or multi-turn retrieval strategies, or experimenting with specialized hardware acceleration for large-scale deployments. RAG stands at the intersection of cutting-edge NLP, IR (Information Retrieval), and knowledge management—a nexus that I expect will continue evolving swiftly in the coming years.

I encourage you to experiment with the code snippets, adapt them to your domain, and keep a close eye on new developments in the broader IR and generative AI research communities. Bringing retrieval augmentation fully into the LLM workflow can unlock unprecedented potential for real-time knowledge assimilation, bridging the gap between static parametric knowledge and the ever-changing world of information.

kofi_logopaypal_logopatreon_logobtc-logobnb-logoeth-logo
kofi_logopaypal_logopatreon_logobtc-logobnb-logoeth-logo