LlamaIndex Team Introduces Next Generation Visual Document Retrieval Model vdr-2b-multi-v1

🚀 Invitation to Experience: China's First AI IDE Intelligent Programming Software Trae Chinese version downloadThe DeepSeek-R1 and Doubao-pro are available for unlimited use!

We launched vdr-2b-multi-v1 This is the best multilingual embedding model for visual document retrieval. We have also released it in plain English vdr-2b-v1 , and open-sourced the new vdr-multilingual-train The dataset contains 500,000 high-quality samples. This dataset contains 500,000 high-quality samples and is the largest open source multilingual synthetic dataset for visual document retrieval.

LlamaIndex Team Introduces Next Generation Visual Document Retrieval Model vdr-2b-multi-v1-1

proudly present vdr-2b-multi-v1 (🤗) , which is a multilingual embedding model designed for visual document retrieval across multiple languages and domains. This model is designed to encode document page screenshots into dense unidirectional representations, which will effectively allow searching and querying visually rich multilingual documents without the need for any OCR, data extraction pipelines, chunking...

vdr-2b-multi-v1 The model is based on MrLight/dse-qwen2-2b-mrl-v1 , and is trained on a large self-made multilingual query-image pair dataset. This model was constructed in collaboration with LlamaIndex and is mcdse-2b-v1 The next iteration of The next iteration of our vdr-2b-multi-v1 The learning and methods used to train it were extended and improved, resulting in a more powerful and better model.

Training on 🇮🇹 Italian, 🇪🇸 Spanish, 🇬🇧 English, 🇫🇷 French and 🇩🇪 German for training: Together they form a new large open-source multilingual training dataset containing 500,000 high-quality samples.
Low video memory and faster reasoning: In the Synthetic Visual Document Retrieval (ViDoRe) benchmark, our English-only model with 768 image blocks performed better than the base model with 2560 image blocks. This results in 3X faster inference and significantly lower video memory usage.
cross-language search: It is significantly better in real scenarios. For example, you can search for German documents using an Italian query.
Matryoshka Expressed Learning: You can reduce the size of the vectors by a factor of 3 while still maintaining the embedding quality of 98%. This can dramatically increase retrieval speeds while reducing storage costs.

usage

🎲 Try it now! vdr-2b-multi-v1The following is an example of a program that can be used in the Hugging Face Space Find on!

Through the direct integration of SentenceTransformers and LlamaIndex, using the vdr-2b-multi-v1 Generating embeds is easier than ever. Get started with just a few lines of code:

Via LlamaIndex

pip install -U llama-index-embeddings-huggingface

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
model = HuggingFaceEmbedding(
model_name="llamaindex/vdr-2b-multi-v1",
device="cpu",  # "mps" for mac, "cuda" for nvidia GPUs
trust_remote_code=True,
)
image_embedding = model.get_image_embedding("image.png")
query_embedding = model.get_query_embedding("Chi ha inventato Bitcoin?")

via SentenceTransformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    model_name_or_path="llamaindex/vdr-2b-multi-v1",
    device="cuda",
    trust_remote_code=True,
    # These are the recommended kwargs for the model, but change them as needed if you don't have CUDA
    model_kwargs={
        "torch_dtype": torch.bfloat16, 
        "device_map": "cuda:0", 
        "attn_implementation": "flash_attention_2"
    },
)

embeddings = model.encode("image.png")

Training dataset

Training good unidirectional models for visual document retrieval requires high-quality data, but current multimodal off-the-shelf datasets are scarce and not multilingual.

Therefore, we spent a significant amount of time building it from scratch. The original dataset contains 500,000 multilingual query-image samples that were collected and generated from scratch using public Internet PDFs. The query associated with each image was synthetically generated using Visual Language Modeling (VLM). As a comparison, our dataset is much larger than the largest previous open-source synthetic dataset for multimodal visual document retrieval (i.e., the ColPali training dataset (generated discarded documents) 10 times more samples.

LlamaIndex Team Introduces Next Generation Visual Document Retrieval Model vdr-2b-multi-v1-1

Data collection

For each language, we generate a long list of queries covering many different topics, which we then use to search for PDFs. we use the search engine's language filtering capabilities to crawl only documents in the specified language. This "search by topic" technique ensures that the model has seen many different topics and domains and performs well in real-life scenarios.

The crawling process resulted in approximately 50,000 multilingual documents. Compared to the previous mcdse-2b-v1 In contrast to the approach used in the model, pages are not extracted randomly. Instead, each page of each PDF is run through the Document Layout Analysis model to determine whether the page contains more textual or visual elements. The result is a number that categorizes the page as text only, visual only, or a mix. Approximately 100,000 pages were then sampled using this markup step to ensure they were evenly distributed by page type.

synthesis

Queries were then generated using gemini-1.5-pro and Qwen2-VL-72B. Their task is to pose a specific question and a general question. Only the specific question is used to train the model, but forcing Large Language Models (LLMs) to distinguish between the two usually leads to more powerful specific questions for information retrieval training.

Once generated, further cleanup steps ensure that the problem is sufficient for training. This includes:

Ensure correct language
Fix formatting issues
Delete markdown
Ensure that only one question is asked
Delete base phrases (e.g., "according to Figure 1", "this document", ...)

Filtering and hard negative case mining

This cleanup step ensures that the query is syntactically correct and follows some strict guidelines. However, it still does not ensure that the query is sufficient for information retrieval.

To filter out bad questions, we embedded and indexed each broad query using the voyage-3 embedding model. For each specific question, we searched the index. A broad question was labeled as "good" if its related broad question appeared in the first 100 results. This method removes questions that have low entropy, are duplicates, or are too similar. On average, 40% queries were removed from each language dataset.

The hard negative examples were then mined using voyage-3 only on specific problems with a fixed threshold of 0.75. It was also used as in the nvidia/NV-Retriever-v1 Experiments were conducted with positive-example-aware negative-example mining as described in , but on this dataset it seems to produce overly easy/distant negative examples.

downloading

(vdr-multilingual-train 🤗) The training dataset is now open source and available directly on Hugging Face. The training dataset contains 496,167 PDF pages, of which only 280,679 are associated with a filtered query (using the method described above). The remaining images without queries are still used as hard negative examples.

multilingualism	# Filtered queries	# Unfiltered queries
English (language)	53,512	94,225
Spanish language	58,738	102,685
Italian (language)	54,942	98,747
German (language)	58,217	100,713
French (language)	55,270	99,797
(grand) total	280,679	496,167

The dataset consists of 5 different subsets, each corresponding to a language. You can browse it directly here:

The dataset consists of 5 different subsets, each corresponding to a language. You can explore it directly here:

Alternatively, you can add the load_dataset Specify the subset of languages in which to download the languages individually:

from datasets import load_dataset
italian_dataset = load_dataset("llamaindex/vdr-multilingual-train", "it", split="train")
english_dataset = load_dataset("llamaindex/vdr-multilingual-train", "en", split="train")
french_dataset = load_dataset("llamaindex/vdr-multilingual-train", "fr", split="train")
german_dataset = load_dataset("llamaindex/vdr-multilingual-train", "de", split="train")
spanish_dataset = load_dataset("llamaindex/vdr-multilingual-train", "es", split="train")

valuation

LlamaIndex Team Introduces Next Generation Visual Document Retrieval Model vdr-2b-multi-v1-1

The model has been evaluated on ViDoRe benchmarks and custom-built evaluation sets that allow testing its multilingual functionality on text-only, visual-only, and hybrid page screenshots. The evaluation dataset is also publicly available on Hugging Face (vdr-multilingual-test 🤗).

We ensured that no pages from these datasets appeared in the training set to avoid any evaluation contamination. These datasets were collected and generated in the same way as the training datasets, but with smaller sample sizes. The filtering steps are all done manually: each query is evaluated, organized and refined (if necessary) to ensure high quality data.

All assessments were conducted through the use of 1536 ViVectors and can be used with the Up to 768 Token represents the image resolution to calculate the NDCG@5 Score to perform.

	on average	French (text)	French (visual)	French (mixed)
dse-qwen2-2b-mrl-v1	93.5	94.7	90.8	95.1
vdr-2b-multi-v1	95.6	95.6	93.3	97.9
	+2.2%

	on average	German (text)	German (visual)	German (mixed)
dse-qwen2-2b-mrl-v1	93.0	93.4	90.0	95.5
vdr-2b-multi-v1	96.2	94.8	95.7	98.1
	+3.4%

	on average	Italian (text)	Italian (visual)	Italian (mixed)
dse-qwen2-2b-mrl-v1	95.1	95.1	94.0	96.2
vdr-2b-multi-v1	97.0	96.4	96.3	98.4
	+2%

	on average	Spanish (text)	Spanish (visual)	Spanish (mixed)
dse-qwen2-2b-mrl-v1	96.7	97.2	94.7	98.2
vdr-2b-multi-v1	98.1	98.3	96.9	99.1
	+1.4%

	on average	English (text)	English (Visual)	English (mixed)
dse-qwen2-2b-mrl-v1	98.0	98.3	98.5	97.1
vdr-2b-multi-v1	98.1	97.9	99.1	97.3
	+0.1%

The multilingual model outperforms the base model in every language and every page type, improving by an average of +2.31 TP3T. it also performs slightly better in the ViDoRe benchmark (+0.51 TP3T). We fine-tuned the vdr-2b-multi-v1 Huge leaps in performance have been made, especially for non-English purely visual or hybrid pages. See, for example, the German language purely visual retrieval improves NDCG@5 by +6.33% compared to the base model.

We also trained a version on a subset of English (vdr-2b-v1 🤗). In the full ViDoRe benchmark (using 768 images) Token (for evaluation), both the multilingual and English-only versions outperform the base model.

	on average	shiftproject	government	healthcare	energy	ai	docvqa	arxivqa	tatdqa	infovqa	tabfquad
dse-qwen2-2b-mrl-v1	83.6	79.8	95.7	96.9	92.0	98.2	56.3	85.2	53.9	87.5	90.3
vdr-2b-multi-v1	84.0	82.4	95.5	96.5	91.2	98.5	58.5	84.7	53.6	87.1	92.2
vdr-2b-v1	84.3	83.4	96.9	97.2	92.6	96.8	57.4	85.1	54.1	87.9	91.3

Faster reasoning

LlamaIndex Team Introduces Next Generation Visual Document Retrieval Model vdr-2b-multi-v1-1

plain English vdr-2b-v1 The model also matches the performance of the base model on the ViDoRe benchmark synthetic dataset while using only 30% of image tokens (768 vs. 2560). This effectively increases inference speed by a factor of 3 and significantly reduces video memory usage.

	on average	shiftproject	government	healthcare	energy	ai
dse-qwen2-2b-mrl-v1 (2560 image tokens)	93.0	82	96	96.4	92.9	97.5
vdr-2b-v1 (768 image tokens)	93.4	83.4	96.9	97.2	92.6	96.8

cross-language search

Although the model was trained on each language separately, it has also improved in cross-language retrieval. To test this capability, German evaluation set queries were translated into Italian using DeepL. Screenshots of the documentation pages are preserved as original German.

	on average	Italian -> German (text)	Italian -> German (visual)	Italian -> German (mixed)
dse-qwen2-2b-mrl-v1	93.1	92.6	93.5	93.3
vdr-2b-multi-v1	95.3	95.0	95.8	95.1
	+2.3%

The model performs significantly better on all document types, with an average improvement of +2.31 TP3T.These search capabilities are critical for real-world use cases, especially in linguistically dispersed regions such as Europe. For example, it enables language-independent search on complex multilingual sources such as complex European binding decisions, instruction manuals, financial asset KIDs, pharmaceutical package inserts, and so on.

MRL and binary embedding

This model is trained using Matryoshka Representation Learning (MRL). The loss function used during training is calibrated to track performance on all these dimensions, allowing the model to be preloaded with the most important recognition information. This allows you to efficiently narrow down the embedding dimensions based on size and budget. To learn more about MRL, Hugging Face'sThis blog postA good explanation of this was given.

In order to test the retrieval ability of the model in different vector dimensions, it was evaluated in the Italian -> German cross-language benchmark.

NDCG@5 (floating point)

	on average	Italian -> German (text)	Italian -> German (visual)	Italian -> German (mixed)
1536 Vi
dse-qwen2-2b-mrl-v1	93.1	92.6	93.5	93.3
vdr-2b-multi-v1	95.3	95.0	95.9	95.1
	+2.3%
1024 V
dse-qwen2-2b-mrl-v1	92.2	90.9	92.3	93.5
vdr-2b-multi-v1	94.6	93.1	95.7	95.1
	+2.5%
512 Vi
dse-qwen2-2b-mrl-v1	89.8	87.9	89.4	92.2
vdr-2b-multi-v1	93.0	91.1	93.4	94.5
	+3.4%

NDCG@5 (binary)

	on average	Italian -> German (text)	Italian -> German (visual)	Italian -> German (mixed)
1536 Vi
dse-qwen2-2b-mrl-v1	89.8	88.2	90.3	90.8
vdr-2b-multi-v1	92.3	89.6	94.1	93.3
	+2.8%
1024 V
dse-qwen2-2b-mrl-v1	86.7	84.9	88.2	86.9
vdr-2b-multi-v1	90.8	87.0	92.6	92.8
	+4.6%
512 Vi
dse-qwen2-2b-mrl-v1	79.2	80.6	81.7	75.4
vdr-2b-multi-v1	82.6	77.7	86.7	83.3
	+4.0%

The 1024-dimensional floating-point vectors provide a very good balance between quality and size. They are about 30% smaller, but still retain the retrieval performance of 99%. The same is true for the 1536-dimensional binary vectors, where the number of bytes per vector is reduced by a factor of 10, but still retains the retrieval quality of 97%. Interestingly, the 1536 binary vectors almost match the performance of the base model's 1536 floating-point vectors.

Conclusions and next steps

We believe vdr-2b-multi-v1 cap (a poem) vdr-2b-v1 will prove useful to many users.

Our multilingual model is the first of its kind, significantly improving performance in multilingual and cross-language scenarios and making retrieval more efficient and faster than ever before thanks to MRL and binary quantization. We believe this will open up new use cases and opportunities, especially in linguistically dispersed regions such as Europe.

Its English-only version represents a major improvement to the underlying model, now enabling documents to be embedded 3 times faster, while reducing memory and maintaining the same (or better) search quality.

It's all thanks to the new vdr-multilingual-train The dataset contains 500,000 high-quality samples. The dataset contains 500,000 high-quality samples and is the largest multilingual open-source synthetic dataset for visual document retrieval.

Future work will explore how our model performs when adapted to new and specific domains. This is still in the early stages of development and more work needs to be done before results can be published, but early testing seems to have shown that taking thebundleImpressive retrieval gains are realized with less data and computational resources.

Stay tuned for future updates!

LlamaIndex Team Introduces Next Generation Visual Document Retrieval Model vdr-2b-multi-v1

usage

Training dataset

Data collection

synthesis

Filtering and hard negative case mining

downloading

valuation

Faster reasoning

cross-language search

MRL and binary embedding

NDCG@5 (floating point)

NDCG@5 (binary)

Conclusions and next steps

Related articles

Recommended

Can't find AI tools? Try here!

FLUX.1 image generator (supports Chinese input)

Recent AI Hotspots

AI Tools Recommendations

AI Tools Classification