Modern RAG - Beyond Vector Search for Effective Information Retrieval
Published on June 8, 2026
Loading...Subscribe to the Newsletter
Join other readers for the latest posts and insights on AI, MLOps, and best practices in software design.
Published on June 8, 2026
Loading...Join other readers for the latest posts and insights on AI, MLOps, and best practices in software design.
Large Language Models are impressive in their ability to generate text that can be convincing and appear to be true. However, on their own, they do not have access to your private documents, product catalogs, machine manuals, project specifications, or internal knowledge bases.
That is why many real-world AI systems use Retrieval-Augmented Generation, usually called RAG.
The basic idea is simple:
Before the LLM answers, we first search for relevant information. Then we give that information to the model as context. The model answers based on that retrieved context.
The difficult part is not the generation, but the retrieval and we can argue that finding the right information is a much harder problem than it looks like on the surface. In fact, the retrieval part is so complex that it is often the bottleneck of the entire system and the main source of errors.
A simple RAG system usually works well for simple queries like:
servo alarm F217
But it starts to struggle with more complex questions like:
The machine stops with servo alarm F217 after the gripper module was replaced. What is causing it, and which machine variants are affected?
This second question is not just a keyword search. It contains multiple information needs and requires multiple steps of reasoning and retrieval. The system needs to:
A naive “top-K similar chunks” retrieval may find one document about F217, but it may miss the machine variant compatibility information. Or it may find a service ticket but miss the official maintenance manual.
This is why Modern RAG system is not just vector search. This is where modern retrieval techniques become crucial.
RAG means that the model does not answer only from its internal training data. Instead, it can access external information sources to find relevant context before generating an answer. This allows the model to provide more accurate and up-to-date responses, especially for queries that require specific knowledge that may not be present in the training data.
The basic RAG pipeline goes like this:

Obviously, it's not enough to just produce a fluent answer. The goal is to produce an answer that is grounded in the retrieved information. This means that the model should not hallucinate or make up information, but rather use the retrieved documents as evidence to support its answer.
Any real-world document system may need to search through a variety of sources, such as manuals, tickets, service reports, and meeting notes and the list can go on and on.
This knowledge is distributed across documents, emails, Confluence pages, PDFs, and other data sources. The retrieval system needs to be able to search through all of these sources effectively to find the relevant information.
Let's go step by step through the different retrieval techniques that are commonly used in modern RAG systems.
Keyword search is the traditional form of search. It relies on matching the exact words in the query with the words in the documents. Documents are stored in an inverted index or a sorted list of keywords, being the terms in the documents. When a query is made, the search engine looks for documents that contain the same keywords.

When the user searches for "F217 gripper replacement", the engine can quickly find documents that contain these terms. This is useful for instance in technical systems because exact identifiers are critical and so keyword search is strong for queries that contain such specific terms, if the search misses them then it will not find the relevant documents and the answer may be wrong.
BM25 is one of the most common ranking algorithms for full-text search. It scores how relevant a document is for a query. The formula is usually written like this:

BM25 is based on the following three important ideas:
If a query term appears in a document, the document is probably relevant. If it appears several times, the document may be more relevant. But BM25 does not reward repetition endlessly. A document that repeats “servo” 100 times is not automatically 100 times better.
Rare words are more important than common words. For example, the word “machine” may appear in almost every document and it is not very specific. On the other hand, “F217” may appear in only a few documents. That makes it highly informative. BM25 gives rare terms more weight.
Long documents naturally contain more words. Without correction, long documents would often rank too high. BM25 normalizes by document length, so a short but precise service note can beat a long generic manual.
For all these reasons, BM25 is a powerful algorithm for keyword search. It is simple, efficient, and effective for many types of queries, especially those that contain specific terms or identifiers. It can be used as a first step in a RAG system to quickly narrow down the set of relevant documents before applying more complex retrieval techniques.
Vector search works differently. Instead of matching exact words, it matches "meaning". The system uses an embedding model to convert text into a vector.
For example, the query:
query = "machine stops after gripper replacement"becomes something like:
query_vector = [0.21, 0.54, 0.10, ...]Documents are also converted into vectors. The system then calculates the similarity between the query vector and the document vectors, often using cosine similarity. The documents with the highest similarity scores are considered the most relevant.
Example:
doc1 = "The machine stops with servo alarm F217 after the gripper module was replaced."
doc2 = "After replacing the handling unit, recalibrate the servo end positions."Keyword search may miss this because “gripper” and “handling unit” are different words. Vector search may find it because the meaning is similar.
The illustration of vector search below the process of converting text to vectors and calculating similarity:

A way to messure similarity is simple comparing the vectors using mathematical operations. For example, we can calculate the dot product of the query vector and a document vector. A higher dot product indicates that the vectors are more similar, which suggests that the document is more relevant to the query. Another common similarity measure is cosine similarity:

which measures the angle between two vectors. A cosine similarity of 1 means the vectors are identical in direction, while a cosine similarity of 0 means they are orthogonal (completely different). This allows the system to find documents that are semantically similar to the query, even if they do not contain the exact same words.
Vector search is powerful, but it has weaknesses. It can miss exact terms. For example:
These may be close in text form, but they are completely different error codes. A vector model may understand that they are all “servo alarm codes”, but it may not always preserve the exact distinction strongly enough. The same applies to:
In searching for specific technical information, exactness is crucial. If the system retrieves a document about F271 when the query is about F217, it may lead to incorrect answers. Vector search is good for finding related concepts and understanding meaning. Keyword search is good for precision. That is why reliable RAG systems usually use both.
A hybrid search system combines keyword search and vector search. It can use keyword search to quickly filter documents that contain specific terms, and then apply vector search to rank those documents based on semantic similarity. A typical hybrid retrieval flow looks like this:

The results from keyword search and vector search cannot be merged by simply concatenating them because they may have different scoring scales and ranking criteria. Keyword search may produce scores based on term frequency and inverse document frequency, while vector search produces scores based on semantic similarity. These scores are not directly comparable, and simply combining them without normalization can lead to a biased ranking where one method dominates the other. Therefore, a more sophisticated approach like Reciprocal Rank Fusion (RRF) is needed to effectively merge the results from both searches while giving appropriate weight to each method's strengths.
Let's see next how merging the results in hybrid search is done using RRF.
As mentioned, when keyword search and vector search return two different ranked lists, we need to merge them.
One simple and effective method is Reciprocal Rank Fusion, or RRF. The point here is, RRF is not interested in how high the original scores are, it only cares about the ranks. This makes it robust to different scoring scales and allows it to effectively merge results from different retrieval methods.
The idea is to take the ranks of the documents in both lists and combine them using a formula like this that gives more weight to documents that rank higher in either list:

To illustrate this, imagine we have two ranked lists of documents from keyword search and vector search. The BM25 keyword search returns Document A, Document B, and Document C, while the vector search returns Document C, Document A, and Document D. Using RRF, we can calculate a combined score for each document based on its ranks in both lists. Setting the parameter k to 60, we can compute the RRF score for each document as follows:

Document A has the highest RRF score because it ranks well in both lists, while Document C has a lower score because it ranks well in vector search but not as well in keyword search. Document D has the lowest score because it only appears in the vector search results and ranks lower there.
The RRF score for a document is calculated based on its rank in both the keyword search and vector search results. The document's score is higher if it ranks well in either list, and it is boosted if it ranks well in both lists. This allows the system to effectively combine the strengths of both retrieval methods, giving a more balanced and relevant set of results.
After Reciprocal Rank Fusion, we have a merged list of candidate documents.
This list is already better than using keyword search or vector search alone. Keyword search helped us find exact terms such as error codes, part numbers, and machine names. Vector search helped us find documents with similar meaning, even when the wording was different. RRF then merged both result lists into one ranking.
But there is still a problem.
RRF does not deeply read the query and the document together. It only looks at where a document appeared in the previous result lists. If a document ranked high in keyword search or vector search, RRF gives it a boost. That is useful, but it is still not the same as asking:
Does this document actually answer the question?
This is where reranking comes in.
A reranker takes the top documents from the retrieval stage and scores them again with a more precise model. The retrieval stage gives us candidates. The reranker decides which candidates are actually the best evidence.
The pipeline now looks like this:

The important point is that the reranker is not used over the whole document collection. That would be too slow and too expensive.
Instead, the first retrieval stage reduces the search space. This happens by precomputing document vector embeddings which is why this stage is fast. The reranker then takes the top candidates and scores them with a more expensive model that can read the query and the document together.
This gives us a practical balance.
The retriever is optimized for speed and recall. It should find everything that might be relevant.
The reranker is optimized for precision. It should decide which of those candidates are actually the most useful.
To understand reranking, it helps to compare two model architectures: Bi-Encoders and Cross-Encoders.
A Bi-Encoder processes the query and the document separately.

This is the architecture behind many vector search systems.
The document embeddings can be computed once and stored in a vector database. When a user asks a question, the system only needs to compute the query embedding and compare it against the stored document embeddings.
That makes Bi-Encoders fast and scalable.
This is why they are useful for first-stage retrieval.
But there is a tradeoff.
The query and the document do not interact inside the model. They are compressed separately into vectors, and only after that they are compared with a similarity function such as cosine similarity.
That means the model may find documents that are close in meaning, but not necessarily documents that answer the question best.
A Cross-Encoder works differently.

The query and the document are passed into the model together. The model can directly compare the words, numbers, entities, negations, and relationships between them.
This is slower, but usually more precise.
For example, imagine this query:
machine stops with servo alarm F217 after gripper replacementAnd these two documents:
Document A:
After replacing the gripper module, recalibrate the servo end positions.
Document B:
Servo alarm F271 can occur after replacing the drive module.A Bi-Encoder may consider both documents similar. Both mention servo alarms, replacement, and machine behavior.
A Cross-Encoder can look more carefully.
It can see that Document A matches the repair situation and gives a likely cause. It can also see that Document B mentions a different alarm code and a different module.
So the reranker may produce scores like this:
Document A → 0.94
Document B → 0.31This is the reason Cross-Encoders are often used as rerankers.
They are too expensive to run over millions of documents, but they are very useful for rescoring the top candidates returned by hybrid search.
So can we just say the system we introduced so far is good enough for any query? Not really. For simple queries, it may work well. But for complex queries that contain multiple information needs, it may struggle to find all the relevant information in one go and therefore a complex query should not be treated as a single retrieval task. Instead, it should be decomposed into multiple sub-queries that can be handled separately. We call this step query planning and decomposition.

For example, the original query:
It stops after we changed the gripper motor and shows F217, maybe calibration was missed. Also, do we still use motor 700-1842 in the AX-360 variant?
A good query planner would break it down and decompose it into sub-queries focused on specific aspects of the problem, like:
servo alarm F217 after gripper motor replacement Source: Service tickets
calibration steps after gripper motor replacement Source: Maintenance manuals
compatibility of motor 700-1842 with AX-360 Source: Parts catalog
What happens during query planning and decomposition is that the system identifies the different information needs in the original query and creates separate sub-queries for each need. It corrects misspellings, identifies relevant keywords, and determines which data sources are most likely to contain the relevant information for each sub-query.
This allows the retrieval system to focus on specific aspects of the problem and find more relevant documents for each sub-query, ultimately leading to a more comprehensive and accurate answer when all the retrieved information is combined.
At this point, we already have a strong retrieval pipeline.
We can use keyword search for exact matches, vector search for semantic matches, RRF to merge rankings, and reranking to select the best candidates. We can also rewrite and decompose complex user queries into smaller sub-queries.
But there is still one limitation.
The system is mostly following a fixed pipeline.
For many real-world questions, this is not enough. The retrieval system needs to decide what to do next based on the query, the available knowledge sources, and the results it has already found. This is where agentic retrieval comes in.
Agentic retrieval means that we apply agent-like behavior to the retrieval process. The system does not just search once. It plans, searches, checks the results, decides whether something is missing, and may retrieve again.
A simplified agentic retrieval flow looks like this:

The key idea is that retrieval becomes a controlled reasoning process. Instead of just one retrieval step, the system can have multiple iterations of retrieval, checking, and planning.
This is especially useful when the user asks a question that requires information from multiple places.
For example:
"It stops after we changed the gripper motor and shows F217, maybe calibration was missed. Also, do we still use motor 700-1842 in the AX-360 variant?"
A classic retrieval system may send this as one query to one search index. An agentic retrieval system can do something more useful.
It can create a small retrieval plan:
Search service tickets for: servo alarm F217 after gripper motor replacement
Search maintenance manuals for: calibration procedure after gripper motor replacement
Search parts catalog for: motor 700-1842 compatibility with AX-360
Search variant configuration data for: AX-360 gripper motor configuration
The important part is that these sub-queries are not only different strings. They are different search actions with different target sources.
The system understands that the first part of the query is a troubleshooting problem. The second part is a maintenance procedure question. The third part is a parts compatibility question.
That is the difference between simple retrieval and agentic retrieval.
In real systems, not every knowledge source is useful for every sub-query. A maintenance manual is useful for calibration steps. A service ticket system is useful for known issues and real incidents. A parts catalog is useful for part numbers and compatibility. A variant table is useful for machine configurations. An agentic retriever should therefore decide where to search.
For the example above:
Sub-query:
servo alarm F217 after gripper motor replacement
Best source:
Service tickets, troubleshooting database, alarm documentation
Sub-query:
calibration procedure after gripper motor replacement
Best source:
Maintenance manuals, service instructions
Sub-query:
motor 700-1842 compatibility with AX-360
Best source:
Parts catalog, BOM, variant configuration table
This avoids sending every query to every source. That matters because enterprise search can be expensive, slow, and noisy.
Good source selection improves both quality and latency.
After planning and source selection, the system can execute multiple searches in parallel. This is called fan-out retrieval. In the figure above, we see that the original user query is decomposed into four sub-queries, each targeting a specific aspect of the problem and directed to the most relevant source. The system can run these four sub-queries simultaneously across different databases or search indices.
Each sub-query can still use the full retrieval stack internally, that is keyword search, vector search, RRF fusion, and reranking. The difference is that now we have multiple sub-queries running in parallel across different sources.
So agentic retrieval does not replace hybrid search. It orchestrates it. Hybrid search is the retrieval engine. Agentic retrieval is the control layer around it.
The first retrieval pass may not be enough.
The system may find the F217 alarm description and the calibration procedure, but not the affected variants. Or it may find the part number but not whether it is still used in AX-360.
A more advanced agentic retriever can reflect on the retrieved evidence and ask:
If information is missing, the system can run follow-up searches.
For example:
Follow-up query:
AX-360 gripper motor BOM 700-1842 replacement part
Follow-up source:
Variant configuration database
This iterative behavior is one of the biggest differences between simple RAG and agentic retrieval. Simple RAG retrieves once. Agentic retrieval can retrieve, inspect, and retrieve again.
When several sub-queries run across several sources, the system must merge the retrieved evidence. This is not just putting all chunks into the prompt.
The system should remove duplicates, group related evidence, prefer authoritative sources, and preserve the link between each piece of evidence and the sub-question it answers.
A good merged result may look like this:
Fault evidence:
Procedure evidence:
Compatibility evidence:
This structured evidence is much more useful for the LLM than a random list of chunks.
The LLM can now generate an answer that is complete and grounded.
The main point is simple: RAG should not be reduced to vector search.
Vector search is useful, but it is only one part of the retrieval stack. Real users ask messy questions. They mix symptoms, assumptions, part numbers, variants, and follow-up questions into one sentence. The information needed to answer them may be spread across manuals, tickets, catalogs, configuration tables, and internal documentation.
That is why I think strong RAG systems are built as a stack.
Keyword search gives precision for exact terms like error codes, part numbers, and machine variants. Vector search gives semantic recall when users and documents use different wording. Hybrid search combines both. RRF helps merge rankings. Reranking improves precision. Query planning and decomposition help when one user question contains several information needs. Agentic retrieval goes further by planning, searching the right sources, checking what is missing, and merging the evidence into a grounded answer.
There are simpler alternatives. BM25 alone can be enough for exact internal search. Vector search alone can work for semantic discovery. Hybrid search with reranking is often a very strong production baseline. Agentic retrieval should be used when the question is complex enough to justify the extra latency and cost.
My opinion is that reliable AI systems are not just LLMs connected to vector databases. They are systems that understand the query, retrieve the right evidence, and give the model the context it needs to answer responsibly.