NoLiMA, released in February 2025, is a Large Language Model (LLM) method for assessing long text comprehension. Unlike the traditional Needle-in-a-Haystack (NIAH) test, which relies on keyword matching, the most important features of NoLiMA are Finding answers from long texts is only possible by crafting questions and key messages that force the model to engage in deep semantic understanding and reasoning.
NoLiMa: https://arxiv.org/abs/2502.05167
NoLiMA's results reveal an important issue: LLMs that claim to be able to process hundreds of thousands or even millions of tokens are significantly underperforming in tasks that really require comprehension of long texts. For example, under the length of 32K tokens, the performance of 10 tested models is not half as good as that of short texts (less than 1K tokens); even the best-performing model, GPT-4o, drops from a near-perfect performance of 99.3% to 69.7%.
Inspired by NoLiMA, we use the vector model jina-embeddings-v3
Similar experiments were done. The reason for investigating the vector model is that in retrieval augmented generation (RAG) systems, the retrieval model (also known as the vector model) is good or bad, which directly determines the effectiveness of the whole system.
Our research focuses on two central questions:
- Can Vector Models Perform "One-Hop Reasoning" in Long Text? With traditional NIAH tests, the questions and answers usually match directly (e.g., "What year did John go to Paris?" and "John went to Paris in 2019"). Unlike the "pin" we designed, it requires the model to reason semantically (e.g., the question is "Which character has been to France?"). The "pin" is "Yuki lives next to the Semper Opera House", and the model has to know that the Semper Opera House is in Germany).
- Can query extensions make long text retrieval work better? Query expansion is to add some related words to the query to make the semantics richer. We want to see if this approach can make up for the shortcomings of vector models when dealing with long text.
Traditional NIAH test (allowing keyword matching) vs. NOLIMA test (requiring semantic reasoning)
Experimental results with LLMs have shown that they rely too much on surface text matching and not enough on deeper reasoning. We wonder if the same is true for vector models. This may allow us to see what is still lacking in the current semantic search technology.
Construction of key messages and context
Construction of key information
In a traditional "needle in a haystack" test, the key information ("needle") is usually worded much like the question to be found. For example:
- QUESTION: "Which character has been to Dresden?"
- Key message: "Yuki lives in Dresden."
But that NoLiMa paper doesn't do that, and we don't want to. We wanted to look at the model's understanding of the semantics, not simply keyword matching. So we designed a variant of "single-hop" ("single-hop" means that the answer and the question need to be linked by a little inference), and deliberately used some words that did not appear in the text, and also used inverted sentences.
- QUESTION: "Which character has been to Dresden?"
- KEY INFORMATION (default): "In fact, Yuki lives next to the Semper Opera House."
- Key message (inverted): "The Semper Opera House is right next to where Yuki lives."
Following the methodology of the paper, we generated multiple categories of "question-key message" groups (each containing a question, a "one-hop" key message, and a version of the "one-hop" key message in reverse). "single-hop" key message, and a version with the "single-hop" key message inverted).
Examples are shown below:
form | concern | Original key information (for reference only) | Single-jump key messages | Inverted singling out key information |
---|---|---|---|---|
Dietary restrictions | Which character can't eat fish food? | Alice is not allowed to eat fish food. | Alice then mentions that she has been a vegetarian for many years. | A vegetarian diet has been important to Alice for many years. |
medical condition | Which character can't drink milk? | Bob can't drink milk. | Bob explains that he's lactose intolerant. | Lactose intolerance affects Bob every day. |
verbal ability | Which character speaks French? | Charlie speaks French. | Actually, Charlie studied at the Sorbonne. | Charlie completed his degree at the Sorbonne. |
Professional background | Which character is the musician? | Diane is a musician. | Diane conducted at the Sydney Opera House in 2013. | The Sydney Opera House performance was conducted by Diane. |
💡 The names above are just examples. In the actual "pin", the names are randomly selected from a list of names from different cultures.
In addition, the "original key information" (i.e., the literally matched version) in the table is just for your convenience, and will not be used in our experiments.
Contextualization
We prepared ten public books, each with at least 50,000 tokens, and randomly selected some short fragments from each book (each fragment has no more than 250 tokens), and then spliced these fragments together to form different lengths of "contexts", the lengths of which are 128, 256, 512, 1024, 2048, 4096, and 8192 tokens, respectively. These fragments are then spliced together to form "contexts" of different lengths: 128, 256, 512, 1024, 2048, 4096, and 8192 words, respectively. Then, we put a key message in each context:
Build context with short segments and key messages from the book
To be more specific, let's say we take the key message "In fact, Yuki lives next to the Semper Opera House" and put it in the 50th lemma in a context of 128 lemmas:
Needle in a haystack example
We use jina-embeddings-v3
The model is used to vectorize the text, and then the similarity scores of the "key information" text and the "context" text are calculated:
Question-Haystack similarity = 0.2391
In order to make sense of this similarity score, we need to do one more step of "normalization". This is done by first calculating the similarity score between the question and the default key message (i.e., no context, direct comparison). Then, divide the previous "key message-context" similarity score by the "question-key message" similarity score:
Question-Needle similarity = 0.3598
Normalized Query - Haystack similarity = 0.2391 / 0.3598 = 0.6644
Why normalize? Because the calculated similarity scores may be different for different vector models. And.jina-embeddings-v3
Models usually underestimate the similarity between two texts.
For each key message (both the default version and the flip-flop version), we generated 10 contexts of different lengths, and in each context, the key message appears in a different place. For the same key message and the same context length, these 10 contexts look like this:
Place key information at regular intervals in ten contexts
In addition, to have a control, we also generated a context without any key information for each test condition (different context lengths). This gives us a total of 3234 generated contexts.
Finally, we use thejina-embeddings-v3
model (using the default text-matching LoRA) encodes each context. If the total number of lexical elements of a context exceeds 8192 (which is the upper limit of the jina-embeddings-v3 model), we truncate the excess and encode the corresponding per-question as well.
Assessment of indicators
We designed an evaluation framework with several different metrics to measure the performance of vector models under different context lengths:
Key indicators
1. Normalized similarity scores
This is the core metric. It doesn't simply look at the semantic similarity between the question and the whole context, but also takes the question and key information and compares them separately. This gives us an idea of how far the model's performance in the context containing the key information is compared to how it would perform in the ideal case (a direct comparison of the question and the key information).
The specific calculation method is: first calculate the cosine similarity score between the question and its corresponding key information as a benchmark; then, divide the "question-context similarity" by this benchmark to get the normalized similarity score.
2. How much better than random guessing
For vector modeling, it only makes sense to compare the similarity between the same question and different texts. So, in addition to the normalized similarity score, we have to see if the question is really more similar to the whole context than it is to a random piece of text of the same length but without key information. In other words, we want to see if the answer the model finds is really more accurate than a blind guess.
Secondary indicators
1. Distinguishing competence analysis
This metric looks at the ability of the model to distinguish key information from other irrelevant content. There are two specific aspects:
- Average Separation: How big is the difference between passages that contain answers ("positive examples") and passages that do not ("negative examples").
- AUC (Area Under the Curve) Score: Measure the ability of the model to distinguish between key information and other content by calculating the area under the ROC curve (subject's work characteristic curve).
2. Positional effects
We will also examine whether the location of the key information in context affects how easy it is for the model to find it. We will analyze:
- Is there any relationship (correlation coefficient) between the location of key information and the similarity score.
- What happens to the model's performance (regression slopes) when you put the key information in different positions.
- Group the key messages by location and see how different groups behave differently.
study finds
Similarity Score and Accuracy Decline as Text Becomes Longer
The experimental results are clear: the longer the text context, the worse the model performs.The average similarity score drops from 0.37 at 128 words all the way down to 0.10 at 8K words, and this drop is not a straight line, but is particularly fast between 128 and 1K words.
Normalization performance versus context length
We also found thatReversing the statement of key information (inverting it) has little effect on the model finding it. Whether it's "In fact, Yuki lives near the Semper Opera House" (the default statement) or "The Semper Opera House is right next to where Yuki lives" (the inverted statement), the probability that the model will find them is almost the same:
Comparison of model performance under two accounts (default order vs. reverse order)
However.The type of content of key information has an impact on the difficulty of model finding. If it's information about locations and landmarks, the model is easier to find; but if it's information about diet and health conditions, the model is harder to find, and the difficulty increases faster as the text gets longer:
Relationship between finding difficulty (normalization performance) and text length for different types of information (grouping)
To see if the model is really better than guessing, we compared the results of the model to a "random guess". A "random guess" is a piece of text that is as long as the question, but does not contain key information. We found thatThe longer the context, the closer the model's results are to blind guessing, and picking a useless piece of text afterward is about right.
Comparison of model performance and random probability (probability 0.5)
We also grouped the data according to the content type of the key information, and then looked at the model's performance. The results were similar: for some types of information (e.g., dietary restrictions), the model was not much better than guessing, even if the text was not too long; for other types of information (e.g., locations and landmarks), the model performed well no matter how long the text was:
Probability of the model finding an answer versus random guessing for different types of information groupings
Reversing the statement of key information has essentially no effect on the probability of the model finding it. The figure below shows how much higher the probability of the model finding the text that correctly contains the key information is than the probability of making a random guess. Let's look at the two statements of key information (default and inverted) separately:
Default order vs. reverse order, how much more likely is the model to find the answer than a random guess?
As you can see in the figure, the trend of model performance is similar in both statements. Therefore, we will not distinguish between the two cases later.
Can the model still distinguish between useful and useless information?
One of our most important findings was about the ability of vector models to distinguish between useful and useless information in text of different lengths. We did a "separation analysis" and found that the model's ability to find the right answer drops off particularly fast between 128 and 1000 word elements. After that, the model's ability to find the correct answer drops off particularly fast, and while it still drops off, the rate slows down.
Relationship between separation and context length
In short texts (128 words), the model clearly distinguishes between useful and useless information.The average separation was 0.1, with an AUC of 0.81 (i.e., the passage containing the answer was ranked first 81 times out of 100).
However, as the text gets longer, the model performance dramaticallygo downAt 1000 words, the separation drops to 0.04 (down 601 TP3T) and the AUC drops to 0.66, indicating that the model is no longer able to distinguish. By 8000 words, the separation is almost zero (0.001) and the AUC is close to 0.5 (comparable to random guessing), meaning that the model is no longer able to distinguish useful information based on similarity scores.
The rate at which the model's ability to distinguish useful information decreases with increasing text length is striking.While the raw similarity score dropped by about 751 TP3T from 128 to 8000 words, the separation metric dropped by almost 991 TP3T, and the effect size dropped even more by 98.61 TP3T!The difficulty of vector modeling for processing long text lies not only in the degradation of similarity scores, but also in the severe degradation of the ability to distinguish between useful and useless information, which is much more than we had previously expected.
How does the location of key information affect the difficulty of finding it?
In general, it is easiest to find key information by placing it at the top of the text. But it's not necessarily true that placing it in the middle makes it harder to find:
The effect of placing key information in different locations in texts of different lengths on finding it
Experimental results also confirm that key information is easiest to find when it is placed at the beginning. And, if the text is short, it is also easier to find it when it is placed near the end. However, regardless of the length of the text, placing it in the middle is not so easy to find:
Compare the probability of finding the key information by placing it in different locations.
Can query extensions help?
We recently posted a blog about "query expansion". This is a common method used in search, which simply means that when you ask a question, you add relevant words to your question to make the search results more accurate.
LLM-based query expansion: more information, more accurate searches
Since the advent of vector models, the way we search has changed a lot. Is a method like "query expansion", which relies heavily on adding vocabulary, still useful in the age of AI? We think so.
In that blog, we used the Large Model (LLM) to generate some extended words, and then added these words to the query vector, and found that the search results were better. Now, we'd like to see if this helps us in a long text retrieval task like "finding a needle in a haystack". For example, when you ask:
Which character has been to Dresden?
Let's expand it with a big model (Gemini 2.0), add 100 related words and it will probably look like this:
Which character has been to Dresden? Character: fictional character Literary character Main character Villain Character Role Identity Plot character
Dresden: Dresden, Germany; World War II Bombing of Dresden Historical Fiction Kurt Vonnegut Slaughterhouse-Five Saxony City Elbe River Cultural Landmarks
Been: visited Been to Been to Appeared in Appeared in Characterized as Set in Happened in Location Background
How useful can query extensions be?
We ran an experiment that generated three sets of expanded queries, each with 100, 150, and 250 words added (for details on how to add them, check out this article). We then ran the previous experiment three more times, each time with a different set of expanded queries.
It turns out that no matter how many words are added, as soon as the text is long, the model performance pulls the crotch, about the same as when no query expansion is used:
Aggregate model performance for various query expansion scenarios
Compared to the problem with no extensions, all the cases where words are added, it's the same old story:The longer the text, the worse the performance. Moreover, this decline is still uneven, dropping the most between 128 words and 1K words:
The probability of the model finding the correct answer for various query expansion scenarios.
However! A closer look at the "Comparison Ratio" metric shows that query expansion is still useful:It makes it easier for the model to find text that contains key information. Without query expansion, the model performs about as well as a random guess at 8K lexical element lengths.
How do you interpret the results of query expansion?
These results are consistent with NoLiMa's paper and our previous findings on query expansion. It can be interpreted like this:
- Adding words in moderation works bestThe effect of adding 100 words is better than adding 150 or 250 words, which means that when expanding the query, there is a degree of adding words, and adding too many words will bring semantic noise instead of signals, which will interfere with the judgment of the model. When adding 250 words, it is very likely to add some terms with weak relevance to the question, and these words are not helpful in long text.
- Long texts remain a central challenge: Even with query expansion, model performance still drops significantly once the context is long. The current attention-based modeling architecture has a fundamental bottleneck when dealing with long text, a problem that cannot be solved by simply adding a few words.
- Inquiry outreach still has value: Although the challenge of long text could not be completely overcome, the Comparison Ratio metric was consistently above 0.5, indicating that query expansion is still effective. Even for 8,000 words of text, it is easier to find the correct answer to a query-expanded question than a random guess. This inspires us that query expansion is still a potential direction to improve the long text processing ability of vector model, which is worth exploring further.
Impact of literal matching on vector models?
In our previous experiments, in order to measure the vector model's ability to perform "one-hop inference" in long text, we deliberately avoided any literal repetition between questions and key information. The results show that even with query expansion, the model's ability to find relevant information in long text deteriorates. This phenomenon is interesting. By all rights, the vector model should be able to do this kind of reasoning by itself, without additional help. After all, we just replaced "Dresden" with "Semper Opera House", which is essentially replacing one word with a similar one.
So how important is literal matching in semantic matching? Or does text length have a greater impact? To find out, we redesigned our experiments so that there are literal repetitions between key messages and questions, for example:
- QUESTION: "Which character has been to Dresden?"
- Key message (default): "Actually, Yuki lives in Dresden."
- Key message (inverted): "Dresden is where Yuki lives."
Note that we give the information "Yuki lives in Dresden" directly, rather than requiring the reader to deduce "The Semper Opera House is in Dresden, so people who live in the neighborhood have been to Dresden" as we did before.
We changed all 22 groups of questions and key information to this straightforward form and then used the same vector modeling jina-embeddings-v3
Ran the experiment again, tried various text lengths and locations of key information.
Normalization performance versus context length
Model performance vs. random guess (0.5)
Comparative ratios at different locations
The results were unexpected. Even if there are the same words in the question and the answer, the model's ability to distinguish between the correct answer and a random guess still declines rapidly as soon as the text is long. Of course, it's still slightly better than the case where there are no identical words at all.
This ultimately demonstrates that the length of the context, and the location of the key information within it, has a greater impact on the performance of the vector model in the "needle in a haystack" task than the specific wording of the key information (semantic representation).
reach a verdict
Overall, the conclusions of our experiments with vector models are consistent with NoLiMA's experiments with large language models: the longer the text, the harder it is for the model to find the correct answer. Our experiments also show that even if the keywords in the question and answer are exactly the same, the model may not always find the right one.
Our experimental results are highly consistent with the findings of the NoLiMA paper on LLM:For vector models, context length is a key factor in retrieval performanceThe longer the text, the harder it is for the model to find the right answer. Even if the keywords in the question and answer are exactly the same, the model may not always find the right one.
- Performance decreases sharply with length: jina-embeddings-v3 performs well on short texts (128 words), but its performance drops rapidly on long texts. The normalized similarity score drops from 0.37 at 128 words to 0.10 at 8K words, and more importantly, the model's ability to distinguish between relevant and irrelevant information (which we refer to as "separation") disappears almost completely.
- "Single-jump reasoning" is difficult.: Even with short texts, the model performs significantly worse if there is no direct literal overlap between the question and the answer. This suggests that the vector model has difficulties with "one-hop reasoning" (e.g., inferring "have been to Dresden" from "live next to the Semper Opera House").
- Query extensions help, but they're not everything: The query extension can improve retrieval performance to some extent, especially with long text, making the model outperform random guesses. However, it does not completely solve the problems brought by long text, and the performance will still drop as the text gets longer. Moreover, adding words has to be careful, irrelevant words will instead introduce semantic noise and reduce performance.
- Literal matching is not the key: Even if there are the same keywords in the question and the answer, as long as the text is long, the model still can't find it. This shows that the position of the answer in the text has more influence on whether the model can find the answer than how the answer is said and how long the text is.
Overall, our research suggests that a study like jina-embeddings-v3
Such a vector model, which is good at handling short texts, is still not capable of handling long texts that require deeper understanding of semantics. This motivates us to continue exploring more effective techniques for long text retrieval, and we hope that in the future the jina-embeddings-v4
There is a breakthrough in the