Advanced RAG

Advanced RAG

More frameworks, more applications

#️⃣   ⌛  ~1 h 📚  Advanced

25.02.2025

upd:

#152

Advanced RAG

More frameworks, more applications

⌛  ~1 h

#152

🎓 159/167

This post is a part of the AI engineering educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

While retrieval-augmented generation has already gained prominence through frameworks that combine vector databases with generative models, real-world production systems rarely hinge on a single strategy. Instead, they integrate multiple data sources, exploit specialized structured queries, and sometimes incorporate sophisticated tool usage to better address the user's needs. All these refinements belong under the rubric of advanced RAG. Throughout this piece, I'll dive into specialized query-construction techniques, dynamic tool usage by agentic frameworks, advanced post-processing such as re-ranking or classification, and the new wave of programmatic approaches to refining prompts and model usage. To ground these discussions, I'll also offer references to notable research from leading conferences like NeurIPS and ICML, as well as code snippets to illustrate practical implementations.

Before we dive into the heart of advanced RAG, let me briefly situate it within the broader context of the course. We've already discussed the fundamentals of data science, including how to handle the complexities of various data modalities (like images, text, and audio) and how to set up robust machine learning pipelines. We have also touched on the basics of building retrieval-augmented systems in simpler forms, such as using a local vector store with approximate nearest neighbor (ANN) indexing for knowledge lookup. In the advanced RAG context, we will see how those essential building blocks—like indexing, embedding, and initial prompt engineering—morph into large-scale, multi-stage pipelines that handle complex real-world tasks.

It's essential to state that advanced RAG, although it heavily relies on the synergy between external data sources and large language models, is also about orchestrating a wide variety of tools and specialized sub-modules. You may discover that a single step of retrieving relevant documents is insufficient if you also need, for instance, to gather up-to-date data from a web API, retrieve structured data from a relational database, and then unify all the information in a consistent response. Agents that can dynamically choose the appropriate tool or method, craft queries in different languages (SQL, Cypher, etc.), and interpret results are becoming more critical to advanced RAG pipelines. This piece will guide you through precisely how such systems can be designed, optimized, and robustly implemented.

In sum, this article will progressively transition from the underlying principles of advanced RAG—to a more practical vantage point, offering code and scenario-based examples. The final chapters will provide a glimpse into newly emerging research directions in RAG. Let's begin with a quick recap of RAG fundamentals before proceeding to sophisticated pipeline enhancements.

rag fundamentals revisited

Before plunging into advanced RAG, it's worthwhile to restate the major components of a standard retrieval-augmented generation system and highlight the motivations that led to such designs in the first place. The classic formulation—popularized by works such as Lewis and gang (2020) in open-domain question answering—posits that an LLM alone, however powerful, remains limited by its training cutoff date or by the inherent risk of generating hallucinations. Hence, we provide the LLM with external knowledge in the form of retrieved documents, enabling more grounded and accurate responses.

standard pipeline

In a typical RAG pipeline, you have a user query $q$ , which then gets transformed into a vector representation via an embedding function $E(q)$ . That vector is used to query a vector database info often an ANN index like FAISS or ScaNN, retrieving a set of candidate documents $\{ d_1, d_2, \ldots, d_k \}$ . These documents—often selected based on top-k similarity—are fed into a generative model, which merges the user query and the retrieved text to produce a final response. The generative process can be schematically described as:

P(\text{response} \mid q, D) \approx P(\text{response} \mid q, d_1, \ldots, d_k),

where $D$ is the entire corpus, and $d_1, \ldots, d_k$ is the subset retrieved from that corpus. At inference time, the model effectively conditions on the relevant context. The hallmark advantage is that the model's final answer is more likely to remain faithful to these external documents, mitigating one dimension of hallucination.

motivations for advanced approaches

Though this pipeline works well in many open-domain QA settings, it often proves insufficient for more complex tasks. You might need, for instance, to handle:

Structured data: A query might best be served by a table lookup in SQL or a subgraph query in a graph database.
Multiple data modalities: You might require images, videos, or domain-specific data that cannot be processed solely via embeddings.
Real-time data: Some contexts require APIs that yield fresh information, such as weather data or stock prices.
Tool usage: For tasks that require calculations or transformations, you may integrate a Python interpreter or specialized libraries.
Post-processing: Re-ranking and filtering can further refine the set of documents or data that the model eventually consumes.
Programmability: Systems that systematically adapt prompts or manipulate their internal states programmatically based on feedback or evaluation metrics.

These intricacies define the advanced RAG topic. Next, I'll dive into the main conceptual expansions: advanced query construction, specialized agents and tool usage, multi-stage post-processing, and frameworks for programmatic LLM usage.

advanced pipeline architecture

Let's open up the advanced pipeline. You can visualize it as a multi-step, branching structure that starts with the user's raw query, proceeds through possible transformations or expansions of that query, consults an arsenal of retrieval tools or external services, re-ranks or fuses retrieved results, and, finally, channels all that curated context into a generative model. By the time the pipeline has concluded, you ideally have a robust response that merges the best of an LLM's generative powers with up-to-date, relevant, and multimodal knowledge. For clarity, let's defineRAG pipeline diagram with multiple branches" path="" caption="A schematic view of an advanced RAG pipeline with multiple retrieval branches and post-processing." zoom="false" />

Query understanding: Attempt to interpret user intent and possibly reformulate or expand the query.
Structured query construction: If the data is in a structured source (like SQL or a graph database), construct the appropriate query language.
Agent-based tool usage: If the question demands a specialized action (like calling a Python function), an agent can orchestrate these steps.
Retrieval from multiple sources: Retrieve relevant documents from an array of different indices, each specialized in a certain domain or data modality.
Post-processing: Once the relevant data is collected, re-rank or unify these results, possibly using classification, clustering, or other advanced heuristics.
Prompt assembly: Insert all curated data into the generative model's prompt.
Response generation: The model produces an answer, which can optionally be refined or re-checked in a final validation step.

bridging structured and unstructured knowledge

An essential challenge in advanced RAG arises when you have to combine structured data, perhaps stored in relational or graph databases, with unstructured knowledge from corpora. An advanced pipeline might simultaneously run a vector retrieval query on an entire text corpus while also generating an SQL query to fetch relevant rows from a table. The pipeline merges these results, feeding them to the language model for final synthesis.

concurrency vs. sequential flow

One subtlety is whether you want these retrieval steps to happen concurrently or sequentially. A concurrency-based approach might spawn multiple retrieval tasks (e.g., vector search, SQL query, web API call) in parallel, aggregate the results at the end, and feed them all to the model. A sequential approach might first retrieve text documents, analyze them, and then decide if additional queries or expansions are necessary. Both designs are valid; concurrency often yields faster results, but a carefully orchestrated sequential approach can produce more refined queries at each step.

query construction

Query construction is integral when retrieving from structured data. If a user wants data about, say, daily sales for a specific product from a SQL-based data warehouse, an LLM that only indexes textual knowledge might fail. We can attempt to parse the user's text query and produce a structured query in SQL. This technique is sometimes referred to as query translation or text-to-SQL. The same principle applies to graph databases with query languages like Cypher or SPARQL.

text-to-sql

Translating natural language to SQL has been a research problem for years, with numerous neural semantic parsing approaches. One of the more direct methods in advanced RAG is to treat the LLM as a translator:

Prompt: Provide the LLM a prompt that includes examples mapping English queries to SQL.
Fine-tuning: Optionally fine-tune the model on a text-to-SQL dataset.
Execution: The model outputs the SQL query string, which can be executed against the database.

Once you have the results, you feed them back into the pipeline, merging them with your unstructured retrieval results. This synergy can significantly improve coverage for queries that rely on precise, tabular data. However, you might also want to ensure your system verifies or at least sanity-checks the LLM-generated SQL. Failing to do so can lead to queries that are invalid or possibly malicious (e.g., inadvertent table drops).


import sqlite3

def generate_sql_query(llm, user_query, db_schema):
    """
    This query from a user_query.
    db_schema is a textual description of the database schema that can
    guide the LLM on table names and structures.
    """
    prompt = f"""
    You are translating the user's request into a valid SQL query.
    Here is the schema:
    {db_schema}

    User wants: {user_query}
    Produce only a valid SQL statement without explanation.
    """
    # Hypothetical call to the language model
    sql_query = llm(prompt)
    return sql_query

# Example usage
conn = sqlite3.connect('sales.db')
user_request = "Show me the total revenue for product X in the last week."
schema = "Tables: [sales(date, product, amount), products(id, name)]"
sql = generate_sql_query(llm, user_request, schema)
result = conn.execute(sql).fetchall()

print("SQL query:", sql)
print("Query result:", result)

In the snippet above, the function generate_sql_query demonstrates a naive approach. In production, you'd refine this pattern with more robust error handling, partial query checks, and possibly a subsequent step for verifying the correctness of the returned SQL.

graph queries

Similarly, query construction can extend to graph databases, where the data is stored as nodes and edges, and queries can leverage graph structures in languages like Cypher or Gremlin. This can be critical when dealing with knowledge graphs or social networks. The principle is the same: parse user intent, produce a well-formed query, retrieve the relevant subgraph, and unify that data with other retrieved contexts.

agents and tool usage

One of the more exciting developments in advanced RAG is the notion of agents that can seamlessly integrate multiple tools to accomplish a task. While a single vector store is a specialized retrieval tool, real-world questions often require additional steps, such as:

Using a Python interpreter to perform calculations.
Consulting an external service like Google or Wikipedia for new information.
Querying an internal ticketing system (like Jira) to fetch the status of a user request.
Accessing domain-specific APIs (e.g., weather or finance data).

agent frameworks

A suite of frameworks has emerged that let you treat LLMs as orchestrators. For instance, the library LangChain popularized a design pattern where the model can be given instructions about how to call certain tools with arguments, interpret the returned values, and continue the conversation. Another approach, known as DSPy (Peng and gang, 2023, introduced at Stanford), aims to blend code-like structures with language model calls, letting you script complex interactions with an LLM that are tested and refined.

The agent concept often uses info ReAct, or Reason+Act pattern or chain-of-thought prompting to break down a user's request. Instead of handling the entire query in a single step, the LLM is allowed to reason about the instructions, decide on the next action, and so forth, effectively making a plan. This yields powerful capabilities such as reading partial results from a database, deciding more data is needed, calling another tool, and eventually synthesizing the final response.

pros and cons of an agentic approach

Pros:
1. Modularity: Tools can be independently developed and tested.
2. Domain coverage: Since multiple domain-specific or specialized APIs can be plugged in, you can handle a wide variety of tasks.
3. Better interpretability: The chain-of-thought or reasoning steps can be logged, although one must be cautious about data privacy.
Cons:
1. Complexity: The pipeline's complexity can balloon as more tools are introduced.
2. Latency: Sequential tool usage can slow responses.
3. Security: If the agent can call arbitrary tools, you need robust sandboxing to prevent malicious or erroneous calls.

An image was requested, but the frog was found.

Alt: "Agent-based RAG example"

Caption: "An example flow where the LLM uses multiple tools sequentially to refine the answer."

Error type: missing path

post-processing

Even after retrieving documents or calling upon external tools, you might want to filter, rank, or classify the results to ensure the most relevant context is provided to the language model. Post-processing steps can significantly enhance performance, especially when you have large sets of candidate documents.

re-ranking

Re-ranking is a technique where you apply a secondary model to the retrieved documents, each time adjusting the relevance score. For instance, you might retrieve 50 documents from a vector store but pass them through a BERT-based cross-encoder to compute a more precise relevance measure. Only the top 5 from this refined list are ultimately used for generation. This approach can be formulated as:

\( \hat{d}_i = \text{argmax)

where $R(q,d)$ is a re-ranking function, typically a neural model that takes both the query $q$ and the candidate document $d$ as input, computing a similarity or relevance score. Re-rankers are often more expensive computationally than the initial retriever but yield higher precision.

classification or clustering

In some advanced pipelines, retrieved documents are further categorized or clustered into distinct topics before prompt assembly. This can help in multi-topic queries or tasks that benefit from structured organization of results. Suppose the user's request is broad, covering multiple sub-questions; the pipeline can cluster the retrieved passages according to their themes, feeding each cluster in a structured manner to the generative model.

rag-fusion

A specialized technique called RAG-fusion has gained attention in some open-source communities (Raudaschl, 2021). The idea is to fuse multiple retrieved documents at the token-level or representation-level in the language model's hidden states, often requiring a custom architecture or specialized re-ranking approach. The result is a more fine-grained integration of multiple text segments, purportedly leading to better integrative answers. While not mainstream yet, RAG-fusion represents an intriguing direction for advanced RAG.

programmatic large language model usage

Beyond the typical notion of calling an LLM in a single prompt-response fashion, advanced RAG can involve programmatic approaches to building prompts, evaluating outputs, and iterating. Tools like DSPy allow you to define workflows that evaluate the correctness or coherence of an LLM's answer, systematically refine the prompt, and re-run. This process can be repeated until the system meets certain criteria—like factual correctness or a minimal perplexity threshold.

dspy and beyond

DSPy (Peng and gang, 2023) is an experimental framework that sees the language model as a programmable function. You can define typed transformations, constraints, and validation rules, all of which the model is guided to respect. In advanced RAG, you might automatically refine a retrieval query or re-invoke a tool if the LLM's partial answer appears to be missing key data. This kind of info self-healing pipeline helps reduce errors and fosters robust performance.

cost and evaluation

However, repeatedly calling an LLM can be expensive, especially if you're dealing with large-scale production systems. Hence, advanced RAG pipelines often incorporate metrics-based heuristics: if the first retrieval + generation pass already meets certain confidence thresholds, you might skip further iterative refinements. On the other hand, if the pipeline detects contradictions or incomplete data, it triggers additional retrieval or re-ranking. Balancing cost, latency, and accuracy is a central concern in advanced RAG design.

advanced rag in practice

Let's consider a more concrete scenario that marries all these elements: suppose you're building a help-desk assistant for an enterprise software company. Users might ask:

Product usage: "How do I deploy the server on a Docker container?"
Runtime errors: "Why am I getting error code 504 when I query the data pipeline?"
SQL data: "What was our average page load time on the analytics dashboard last week?"
Jira tasks: "What is the status of bug ticket DS-1234?"

A single pipeline needs to decide how to handle each request. It might do the following:

Classify or parse the user query to see if it is a standard knowledge base question, a troubleshooting question, or a data query.
If it's a knowledge base question, run it through vector retrieval in the unstructured documentation.
If it's an SQL question, automatically generate the relevant SQL query to retrieve real metrics from a database.
If it's about a Jira ticket, call an agent that has an integration with the Jira API.
Aggregate or unify the results from these calls, feed them to the LLM, and produce a coherent answer.
If needed, re-check the correctness via a quick, domain-specific summarization or classification model, potentially requesting a second pass from the LLM if the answer is flagged as incomplete.

Below is a pseudocode snippet that sketches a possible advanced RAG pipeline with multiple steps:


def advanced_rag_pipeline(user_query, llm, vector_db, sql_db_conn, jira_api):
    """
    Orchestrates an advanced RAG pipeline:
      - Classifies user query
      - Retrieves or calls tools
      - Re-ranks or merges results
      - Generates final answer
    """
    # 1. Classify or parse query
    category = classify_query(user_query)
    
    context_passages = []
    agent_responses = []
    
    # 2. If it's likely a knowledge base question, retrieve from vector DB
    if category in ["kb_query", "how_to", "documentation"]:
        docs = vector_db.similarity_search(user_query, top_k=15)
        # Re-rank
        top_docs = re_rank(llm, user_query, docs, top_k=5)
        context_passages.extend(top_docs)
    
    # 3. If it's an SQL query, let LLM produce a SQL statement
    if category == "data_query":
        sql_query = generate_sql_query(llm, user_query, get_db_schema(sql_db_conn))
        sql_result = sql_db_conn.execute(sql_query).fetchall()
        agent_responses.append(("sql_result", sql_result))
    
    # 4. If it's a Jira question, call Jira API
    if category == "jira_query":
        ticket_id = extract_ticket_id(user_query)
        ticket_status = jira_api.get_ticket_status(ticket_id)
        agent_responses.append(("ticket_status", ticket_status))
    
    # 5. Merge or unify contexts
    final_context = unify_contexts(context_passages, agent_responses)
    
    # 6. Generate final answer
    final_answer = llm.generate_answer(user_query, final_context if not is_valid(final_answer):
        final_answer = refine_answer(llm, user_query, final_context)
    
    return final_answer

In the code above, the system conditionally handles queries based on classification. It calls a vector DB for textual docs, generates SQL queries for data retrieval, and queries the Jira API for ticket details. The results are merged and fed back to the LLM to generate a coherent final answer. Real-world production solutions can be far more elaborate, with additional fallback or error handling layers.

research highlights and cutting-edge directions

Advanced RAG is a highly active research area, with many novel ideas and prototypes emerging:

memory-based rag

Some researchers are exploring ways to maintain a long-term memory for an agent—using not just a static corpus but also a continuously updated event log. Over multiple interactions, the agent can recall past user requests or relevant system changes. This has been studied in contexts like multi-turn retrieval dialogues (Zhang and gang, EMNLP 2021), where the system learns to track context across conversation turns.

knowledge distillation and hybrid indexing

Instead of indexing massive external corpora for every query, some are exploring partial knowledge distillation from these corpora into specialized neural networks or embeddings. This approach reduces retrieval latency for extremely large datasets. Hybrid indexing—combining traditional inverted indexes (like BM25) with neural embeddings—still has promise for ensuring coverage of rare terms.

dynamic retrieval augmentation

Work is ongoing to allow the model itself to request more data dynamically. That is, the generation step can produce tokens that specify additional retrieval calls mid-generation (Lao and gang, ICML 2022). This approach merges retrieval and generation in real-time, letting the LLM refine the retrieval criteria as it composes the final text.

single-step vs. multi-step rag

Research is being done on the trade-offs between single-step retrieval (fetch all relevant data at once) and multi-step retrieval (iteratively refine retrieval queries). Multi-step retrieval can significantly improve the final answer's precision, but it can also be slower or more complicated to orchestrate. For certain tasks, though, the iterative approach yields a dramatic performance improvement.

implementation best practices

Putting advanced RAG into production requires balancing performance, cost, correctness, and user experience. I recommend the following best practices:

Schema design: For structured data, ensure your schemas are well-documented and that the LLM has sufficient context about table or graph structures.
Tool sandboxing: If you allow agentic frameworks to call external tools, sandbox them carefully to avoid security vulnerabilities.
Prompt versioning: Maintain versioned prompts for your LLM calls to track which prompt changes lead to performance gains or regressions.
Batch processing: If dealing with large volumes of queries, consider batching retrieval requests or re-ranking steps to leverage parallelization.
Caching: Store partial results (e.g., top-50 retrieved documents before re-ranking) to expedite repeated queries or near-duplicate queries.
Monitoring and feedback: Continuously track metrics such as response latency, user satisfaction, and error rates. Use them to refine your approach.

illustrative formula

To consolidate the advanced retrieval-augmented generation concept from a probabilistic standpoint, consider that we want to compute the posterior distribution for the final answer $a$ given the user query $q$ and the entire knowledge base $D$ (both structured and unstructured). We can express an approximate factorization as:

P(a \mid q, D) = \sum_{R} P(a \mid q, R) P(R \mid q, D),

where $R$ denotes a particular retrieval or set of retrieved items. The advanced pipeline tries to refine $R$ by including not just text documents but also database records, graph substructures, or external API calls. Once $R$ is determined, the LLM produces $a$ via:

a = \text{argmax}_{a} \; P_\theta(a \mid q, R),

where $P_\theta$ is parameterized by the large language model's weights (denoted $\theta$ ). By refining the retrieval distribution $P(R \mid q, D)$ with re-ranking or agent-based expansions, you effectively sharpen the distribution, focusing on the most relevant contexts. The final generation step can be further regulated by techniques such as temperature scaling, top-k sampling, or nucleus sampling.

building robust advanced rag systems

Let's piece everything together as a concluding note on how you might build and deploy an advanced RAG system in a real production environment. Typically, you'd start with:

Data ingestion: Gather your textual documents, structured data, logs, etc. Build or maintain indexes for each data type, such as vector embeddings for text, indexes for SQL tables, or specialized data lakes for big data contexts.
LLM selection and alignment: Choose an LLM with the right size, capabilities, and alignment strategy for your domain. Some organizations rely on an external API like OpenAI or Anthropic, while others host their own models (like LLaMA variants) for privacy and cost reasons.
Tool integration: Decide which tools your agent might need—these might be local Python scripts, web search, or enterprise systems. Provide the agent with descriptive usage instructions.
Pipeline design: Implement your advanced pipeline logic, possibly using an existing agent framework or building your own custom orchestrator.
Post-processing: Don't forget the final re-ranking, classification, or response validation steps. You can tailor them to your domain's compliance or correctness needs.
Monitoring and iteration: Gather telemetry data on real user interactions, watch for failure modes, adjust your pipeline's thresholds, and refine your prompts or re-ranking approach as needed.

additional resources

For readers interested in pushing the boundaries further, I recommend the following references a retriever.

Li and gang (NeurIPS 2021): Introduce a multi-step retrieval approach for knowledge-intensive tasks.
Peng and gang (2023): DSPy, detailing how to write LLM-based pipelines with typed transformations.
Raudaschl (2021): RAG-fusion open-source repository exploring token-level fusion of retrieved documents.
Zhang and gang (EMNLP 2021): Multi-turn retrieval for extended dialogue contexts.

final reflections

Advanced RAG is, in my view, an evolving ecosystem rather than a single technique. It merges sophisticated retrieval strategies—spanning unstructured, structured, and real-time data sources—with the generative prowess of large language models. By orchestrating an array of tools, from Python interpreters to SQL query constructors, you can handle complex user demands that transcend simple question-answering. Post-processing, dynamic re-ranking, multi-step reasoning, and iterative prompt refinement are all indispensable components in the quest for truly robust, production-grade retrieval-augmented systems.

I hope this discussion has helped clarify not only the conceptual underpinnings of advanced RAG but also the practical considerations that guide its real-world deployment. You can craft your own pipeline by mixing and matching components from the toolkit we've covered: agents, specialized queries, concurrency vs. sequential retrieval, re-ranking, classification, programmatic LLM usage, and more. As research continues to accelerate in this space, you'll find new frameworks, better re-ranking methods, and more powerful agentic solutions. My advice is to keep an eye on emerging trends from top AI conferences and to remain flexible in how you design your advanced RAG pipelines. Through continuous experimentation and iteration, you'll be well on your way to building state-of-the-art systems that unify the best of retrieval and generation.

Averett's Heuristics@avheuristics

Subscribe to my Telegram channel for updates in the Research section and more tech content