basic concept
In the area of information technology.Retrieval means that from a large dataset (usually a document, web page, image, audio, video, or other form of information), based on a user's query or need, theThe process of efficiently locating and extracting relevant information. Its core objective is to findInformation most relevant to user needs, and present it to the user.
- Query: A search term or condition entered by the user.
- Index: A data structure that preprocesses data in order to improve retrieval efficiency.
- Relevance: The extent to which the retrieved results match the query.
RAG schemes based on building large model knowledge bases often do not use a single "retrieval" technique, e.g., the commonly used: sparse + dense hybrid retrieval. The selection of the retrieval technique has to be carefully adapted to the content to be retrieved, which requires a lot of debugging.
The Mainstream Retrieval Model
Retrieval models are mainly categorized as: boolean models, vector space models, probabilistic models, neural network models, graph models (e.g., Knowledge Graph), and language models (e.g., GPT3).
We can present the mainstream Retrieval models in two "simple" categories, the core difference being how they understand and match text:
1. Lexical/Keyword-based Matching.
This type of model focuses on queries and documents inLiterally matching words, without a deeper understanding of the meaning behind the words.
-
Core Idea. Count occurrences of words in documents and queries and match them.
-
Main model.
-
Boolean Model. Simply match based on the presence or absence of keywords (AND, OR, NOT).
-
Vector Space Model (VSM). Documents and queries are represented as vectors of word weights, which are matched by vector similarity (e.g., cosine similarity). A common weighting method is TF-IDF.
-
BM25. An improved model based on probabilistic statistics that takes into account factors such as document length is a cornerstone of many search engines.
-
Pros. Simple, efficient and easy to implement.
Disadvantages. Inability to understand the semantic relationships of words and susceptibility to problems such as synonyms and polysemy.
2. Semantic/Meaning-based Matching.
Semantic-based embedding models, in addition to supporting different embedded text lengths and dimensions, different embedding models also have different ways of understanding "sentences", which is a priority in choosing an embedding model (although most of them use more general models).
For example, the word "apple" is semantically prioritized as "fruit" by some models and as "cell phone" by others.
Such models attempt to understand the query and documentdeep semantic information, not just superficial word matching.
-
Core Idea. Mapping text to semantic space and matching by semantic similarity.
-
Main model.
-
Topic Models. Mining documents for potential topics, retrieved by topic similarity (e.g., LDA).
-
Embedding Models. Mapping words, sentences or documents into a low-dimensional dense vector space captures semantic information.
-
Word Embeddings. Examples include Word2Vec, GloVe, FastText.
-
Sentence Embeddings. Examples include Sentence-BERT, Universal Sentence Encoder. OpenAI EmbeddingsThe
-
-
Dense Retrieval Models. Queries and documents are encoded into high-dimensional dense vectors using deep learning models (usually Transformer) and retrieved by vector similarity. Examples are DPR, Contriever, and the model based on the OpenAI Embeddings The constructed retrieval system.
-
Neural Interaction Models. Modeling the interaction between queries and documents at a finer level, e.g. ColBERT, RocketQA.
-
Graph Neural Network Models. Documents and queries are constructed into graphs and retrieved using the graph structure.
-
Pros. The ability to better understand the meaning of text, deal with semantic correlations, and find relevant information more accurately.
Disadvantages. Usually more complex and computationally expensive.
Key difference:
-
Lexical Matching Models Look "Literal", focusing on keyword occurrences.
-
Semantic Matching Models Look at "Meaning", focusing on the intrinsic meanings and relationships of the text.
Summarize the form:
categorization | core idea | Dominant Models | RAG Application focus in |
Vocabulary-based matching | Literally matching words | Boolean Model, Vector Space Model (VSM), BM25 | Early or simple scenarios |
Semantic-based matching | Understanding deep semantic information | Topic Model, Word Embedding Model, Sentence Embedding Model (with OpenAI Embeddings), dense search models (including those based on OpenAI Embeddings systems), neural network interaction models, graph neural network models | Mainstream selection, with a particular focus on sentence embedding and intensive searching |
Applications in RAG
RAG (Retrieval-Augmented Generation)is an AI framework that combines retrieval and generation techniques whose main use is to improve the accuracy and contextual relevance of generated content.
- retrieval stage: Identify documents or passages from a large knowledge base that are relevant to user input.
- generation phase: Use the retrieved information as context to generate answers or content.
In RAG, the retrieval model is responsible for providing high-quality sources of information, while the generative model is responsible for generating natural language answers based on this information. Since RAG can obtain up-to-date information from external knowledge sources, it performs particularly well in answering knowledge-intensive questions.
Application focus in RAG:
In RAG (Retrieval Augmentation Generation).Semantic matching models are often preferred, because they can more accurately retrieve contextual information relevant to the user's query, thus helping the generative model to produce more accurate and coherent answers. In particular.Sentence Embedding Model and Dense Retrieval ModelFor example, based on OpenAI Embeddings retrieval, which is widely used in RAG systems due to its excellent semantic representation capability and retrieval efficiency.
case (law)
1. Application of Lexical Retrieval (Lexical Retrieval)
-
Core Ideas: The retrieval system relies heavily on queries and documents inLiterally keyword matchingThe
-
Case 1: Finding a Specific Command in Technical Documentation
-
Scene: You are using a software and want to know how to perform a file copy operation and need to look up the relevant commands.
-
Retrieval mechanism: The RAG system uses a vocabulary-based model (e.g., BM25) to search the software's help documentation for passages that contain the keywords "copy file", "file copy command", or "copy file".
-
Example of search results: The system may find a section of the documentation titled "File Management Commands" that contains the section "Using the cp The command "copy file" explains how to copy a file.
-
How to help generate: The specific instructions for the retrieved include commands are provided to the generating model, which can generate more precise steps for the operation, e.g., "You can use the cp command to copy a file. For example.cp source.txt destination.txt will copy source.txt to destination.txt."
-
Key Points: Retrieval relies on exact keyword matching. If you use a different phrase, such as "move a copy of a document", you may not retrieve the same results.
-
-
Case 2: Finding a specific model in a product catalog
-
Scene: You want to purchase a specific model of printer, for example, "Model XYZ-123".
-
Retrieval mechanism: The RAG system searches the catalog database for entries containing the exact model name "XYZ-123".
-
Example of search results: The system will find product entries containing the name, detailed specifications, price and other information about "Printer XYZ-123".
-
How to help generate: The retrieved product information can be used directly to generate introductions, price inquiries, or purchase links, etc., about that printer model.
-
Key Points: Relies on exact matching of product models. If the user enters a vague description, such as "high-performance printer", a term-based search may not work well.
-
2. Semantic Retrieval applications
-
Core Ideas: The retrieval system understands the query and documentdeep semantic information, you can find relevant content even if you don't have the exact same keywords.
-
Case 3: Finding information about the symptoms of a disease in the medical literature
-
Scene: Do you want to know "What are the common physical discomforts caused by high blood pressure?"
-
Retrieval mechanism: The RAG system uses a semantic-based model (e.g., dense search based on Sentence-BERT or OpenAI Embeddings) to vectorize the query and the medical literature, and then finds the closest passage in the semantic space to the query vector. Even if the documents do not contain exactly the same wording, e.g., using "elevated blood pressure" instead of "hypertension" or specific symptom descriptions instead of "malaise," they can be retrieved. be searched.
-
Example of search results: The system may find passages that contain the following text, "People with high blood pressure often report symptoms such as headaches, dizziness, and chest tightness. Prolonged uncontrolled high blood pressure may lead to palpitations and difficulty breathing."
-
How to help generate: The retrieved description of the symptoms of hypertension is provided to the generative model, which can produce a more natural and comprehensive response, "Hypertension may cause a variety of discomforts, commonly including headache, dizziness, and chest tightness. Severe or prolonged hypertension may also cause heart palpitations and difficulty breathing."
-
Key Points: Be able to understand synonyms ("elevated blood pressure" vs. "high blood pressure"), proximate expressions ("physical discomfort" vs. "headache, dizziness ") and related concepts that provide richer context.
-
-
Case 4: Finding Similar Styles of Text in Creative Writing Assistance
-
Scene: You are working on a science fiction novel and want to find some passages in a similar literary style to use as inspiration. You type, "Describe a thriving vision of a futuristic city filled with towering buildings and heavy traffic."
-
Retrieval mechanism: The RAG system uses a semantic-based model to search through a large library of science fiction texts, looking for passages that are semantically closest to your description, even if they don't exactly use keywords such as "city of the future" or "boom".
-
Example of search results: The system might find passages such as, "Steel behemoths pierce the clouds, and glass walls reflect colorful light. Flying cars shuttled like shuttles between buildings, crowds bustled on the ground, and the hum of energy filled the city that never sleeps."
-
How to help generate: Retrieved passages with similar moods and descriptions can be used as a reference for the generative model, helping it to create a text that more closely matches your desired style.
-
Key Points: Being able to understand the implicit meaning, emotional coloring, and literary style of a text goes beyond simple keyword matching and focuses more on semantic similarities.
-