LLM-based Query Expansion

AI Knowledge Base5mos agoupdate AI Sharing Circle

Have you ever been in a situation where you type a keyword into a search engine and what comes up is not what you want? Or, you want to search for something, but you don't know what words to use to express the most accurate? Don't worry, "query expansion" technology can help you solve these problems.

The technique of query expansion has been on fire again recently. It used to be a standard part of search engines, and then it went quiet for a while for a number of reasons. But now, with the rise of a new technology called "Agentic Search" (Agentic Search), query expansion is back in the limelight.

Why do you need query extensions?

When we usually use search engines, the search terms we enter tend to be short and colloquial. This can lead to two problems:

Search terms are too generalFor example, if you want to know the latest progress of "artificial intelligence", but only type in "artificial intelligence", it is difficult for the search engine to determine which aspect you want to know.
Search terms too specific: For example, you want to search for information about a certain disease, but you don't know which terminology is the most accurate to use.

All these issues affect the quality of search results. Even more advanced search techniques, such as Agentic Search, face the same challenges.

What is Agentic Search?

Reasoned search is a smarter way to search. You can think of it as a smart assistant that not only understands the keywords you type in, but also helps you find more accurate and comprehensive information based on context and your intent.

For example, if you search for "how to make a cake", a traditional search engine may only return pages containing the words "make" and "cake". But inferential search understands that your intention is to learn how to make a cake, so it may return more detailed tutorials, videos, or even recipes for different kinds of cakes.

Although inferential search is smarter, it still struggles to accurately capture our intent if our search terms are too short or vague. To solve this problem, we need a technique to "expand" or "rewrite" our search terms so that they more accurately and comprehensively express our search intent. This is called "query expansion".

What is query expansion?

Query Expansion (Query Expansion) is a technique for optimizing search results. Its core idea is simple: add some related words to your original search terms to make it easier for search engines to find the results you want.

For example, if you want to search for "How to make braised pork", the query extension may automatically add "menu (in restaurant)","practice","the daily life of a family","streaky pork" These words. This way, the search results will not only contain content with "braised pork" in the title or body, but also those recipes that teach you how to make braised pork, home remedies, and even tutorials on how to make braised pork with pancetta, making the search results more comprehensive and tailored to your needs.

Figure 1: Flowchart of query expansion using synonym dictionary

Query expansion can be used in all types of search engines, including traditional keyword search engines and more advanced inferential search engines. For inferential search, query expansion can help it better understand the user's search intent and thus provide more relevant results.

In traditional search engines, query expansion is mainly used to solve the following two problems:

morphology (linguistics): Different forms of the same word (e.g., "run" and "running") are considered different words, leading to incomplete search results.
Synonyms and related words: If you search for "lose weight", a traditional search engine may not find a search engine that contains "go on a diet","fat loss"or"Weight control" pages, but in reality these words are highly relevant.

A lot of things have been thought of to achieve query expansion, such as:

Manual development of a dictionary of synonyms: It's like a dictionary that tells you which words have similar meanings.
Automatically find relevant words from a large number of articles: Determine if they are related by analyzing which words often appear together.
Analyzing Search History: See what other keywords people use when searching for similar content.
Based on user feedback: Let the user tell the search engine which words are relevant.

Semantic Vector Modeling and Query Extension

In recent years, with the development of Artificial Intelligence, a new technology called "Semantic Vector Modeling" has emerged. Think of it as a "word translator" that translates each word into a string of numbers (we call them "vectors"). These numbers represent the meaning of the word, and the closer the meaning of the word, the closer the corresponding string of numbers (vector).

With semantic vector models, search engines should theoretically be smarter, and query expansion seems unnecessary. For example, if you search for "How to make braised pork", the semantic vector model should know "menu (in restaurant)","practice"These words are close enough to the meaning of "braised pork" that a search engine should be able to find a recipe even if you don't type them in.

However, the reality is that the semantic vector model is not perfect. The "strings of numbers" (vectors) it produces may contain ambiguous information, leading to less accurate search results.

For example, if you search for "pomegranate", the search engine may return something about "Apple phone" results, and may also return some information about the "fruit"The result may also return "Apple Inc." for stock information. But if what you're really looking for is information about "Apple Growing Technology" articles, but may be flooded with other results. If we add the search term "plantation", it will help the search engines understand our intent more accurately and find results that are more in line with our needs.

Query Expansion with AI Large Models (LLM)

Now, we have a more powerful tool for doing query expansion, and that's the AI Large Model (LLM).

What is an AI Large Model (LLM)? Think of it as a super-knowledgeable "linguist" who has been trained on massive amounts of text data and has acquired a wealth of knowledge and linguistic skills.

There are several significant advantages of doing query expansion with LLM over traditional query expansion methods:

vast vocabulary: LLM See lots of words and don't worry about finding the right synonyms or related words.
Have some judgment: LLM can initially determine which words are relevant to your search topic, which traditional methods cannot do. It can help you filter out extensions that are not relevant.
Flexible and customizable: You can tell LLM what extensions you want based on a specific search task. It's like you're giving LLM a "command" to tell it what results you want.

After generating the extension terms with LLM, the next step is similar to traditional query expansion: add these terms to your original search terms, then use the semantic vector model to generate a "query vector", and then use this vector to search.

Figure 2: Vector Query Extension using LLM

Experiment: see the effect of query expansion

To verify that the LLM-assisted query expansion works or not, we did some experiments.

experimental environment

LLM: We used Google's Gemini 2.0 Flash model.
vector model: We used two vector models: jina-embeddings-v3 and all-MiniLM-L6-v2.
data set: We used some publicly available search test datasets.

Experimental Methods

We have designed two types of Prompts to guide the LLM in generating an extension. Prompts are like instructions you give to the LLM, telling it what kind of result you want.

Common Cues: For a variety of search tasks.
Task-specific prompts: Cue words designed for specific search tasks (e.g., medical searches).

We also tested different numbers of extensions: 100, 150 and 250.

The effect of generic cue words

We found that the following generic cue word worked well:

Please provide additional search keywords and phrases for
each of the key aspects of the following queries that make
it easier to find the relevant documents (about {size} words
per query):
{query}
Please respond in the following JSON schema:
Expansion = {"qid": str, "additional_info": str}
Return: list [Expansion]

This cue can process multiple search terms at once and generate an expanded list of terms for each search term.

We first tested it with the jina-embeddings-v3 model with the following results:

test set	word having no extension	100 Expanded Words	150 Expanded Words	250 Expanded Words
SciFact(Fact-checking mandate)	72.74	73.39	74.16	74.33
TRECCOVID(Medical retrieval tasks)	77.55	76.74	77.12	79.28
FiQA(Financial options search)	47.34	47.76	46.03	47.34
NFCorpus(Medical Information Retrieval)	36.46	40.62	39.63	39.20
Touche2020(Argument retrieval task)	26.24	26.91	27.15	27.54

As you can see from the results, query expansion improves search in most cases.

In order to verify the effectiveness of the query expansion on different models, we repeated the same test with the all-MiniLM-L6-v2 model with the following results:

test set	word having no extension	100 Expanded Words	150 Expanded Words	250 Expanded Words
SciFact(Fact-checking mandate)	64.51	68.72	66.27	68.50
TRECCOVID(Medical retrieval tasks)	47.25	67.90	70.18	69.60
FiQA(Financial options search)	36.87	33.96	32.60	31.84
NFCorpus(Medical Information Retrieval)	31.59	33.76	33.76	33.35
Touche2020(Argument retrieval task)	16.90	25.31	23.52	23.23

From the results, it can be seen that query expansion has a significant improvement on the search results, especially for smaller models like all-MiniLM-L6-v2.

The table below summarizes the average lift of each model on all tasks:

mould	100 Expansion words	150 Expansion words	250 Expansion words
jina-embeddings-v3	+1.02	+0.75	+1.48
all-MiniLM-L6-v2	+6.51	+5.84	+5.88

all-MiniLM-L6-v2 The lift is more than jina-embeddings-v3 Much larger, probably because all-MiniLM-L6-v2 The initial performance of the model is low.jina-embeddings-v3 The model itself is better able to understand the meaning of search terms, so the additional help that query expansion can provide is more limited.

However, this result also shows that query expansion can significantly improve the search results of some models with average performance, allowing them to perform well in some situations.

Task-specific prompt words

We found that generic cue words, while generally effective, may introduce irrelevant words that instead reduce search effectiveness. Therefore, we designed more specific cue words for two specific search tasks (fact-checking and financial options retrieval):

Please provide additional search keywords and phrases for
each of the key aspects of the following queries that make
it easier to find the relevant documents scientific document
that supports or rejects the scientific fact in the query
field (about {size} words per query):
{query}
Please respond in the following JSON schema:
Expansion = {"qid": str, "additional_info": str}
Return: list [Expansion]

Experimental results show that this more specific cue word improves search results in almost all cases:

test set	mould	non-expanded word	100 Expansion words	150 Expansion words	250 Expansion words
SciFact	jina-embeddings-v3	72.74	75.85 (+2.46)	75.07 (+0.91)	75.13 (+0.80)
SciFact	all-MiniLM-L6-v2	64.51	69.12 (+0.40)	68.10 (+1.83)	67.83 (-0.67)
FiQA	jina-embeddings-v3	47.34	47.77 (+0.01)	48.20 (+1.99)	47.75 (+0.41)
FiQA	all-MiniLM-L6-v2	36.87	34.71 (+0.75)	34.68 (+2.08)	34.50 (+2.66)

As can be seen from the table above, the search results improve under all settings except for the case of using the all-MiniLM-L6-v2 model on SciFact and adding 250 extensions.

with regards to jina-embeddings-v3 model, we find that adding 100 or 150 extensions gives the best results; adding 250 extensions reduces the results. This shows that it is not better to have more extensions, but if you add too many words, it may make the search results worse.

Benefits and challenges of query expansion

vantage

Can improve search results: Query extensions allow search engines to better understand your intent and find more relevant and comprehensive information.
More effective for models with average performance: The query extension can help some models with average performance to make their search results decent.

challenge

Cost issues: Using LLM increases the time and computational cost of searching. Additional costs are incurred if some paid LLM services are used.
Design of cues: Designing good cue words is not easy and requires a lot of experimentation and tuning. Moreover, different LLMs, different vector models, and different search tasks may require different cue words.
Other Optimization Methods: If your vector model is performing poorly, it's probably more cost-effective to just switch to a better model than to spend time doing query scaling.

future direction

While query expansion still has some challenges, we believe it will play an important role in the future of search technology. We are exploring the following directions:

See if the query extension can be used to improve the vector representation of documents.
Explore the use of query expansion in other AI search techniques such as reordering.
Comparison of LLM-generated extensions and traditional methods (e.g., synonym dictionaries) for generating
Train LLM specifically for the query expansion task.
Optimize the number of extensions and avoid adding too many words at once.
Research how to recognize good extensions and bad extensions.

All code and experimental results are open-sourced, and are welcome to be surrounded and reproduced:

llm-query-expansion: https://github.com/jina-ai/llm-query-expansion/