Jina Embeddings v3 Our latest 570 million-parameter top-level text vector model that achieves current best-in-class SOTA on multilingual and long text retrieval tasks.
v3 is not only more powerful, but also has a lot of new and exciting features. If you are still using Jina Embeddings v2, which was released in October 2023, we strongly recommend that you migrate to v3 as soon as possible.
Let's start with a brief overview of the highlights of Jina Embeddings v3:
- Support for 89 languages : Breaking through the limitation that v2 can only handle a few bilingual languages, realizing true multilingual text processing.
- Built-in Lora adapter: v2 is a generic Embedding model, while v3 has a built-in Lora Adapter that generates vectors optimized specifically for your retrieval, classification, clustering, and other tasks for better performance.
- Longer text search is more accurate :: v3 Utilization of 8192 token The context length and the Late Chunking technique, which generates block vectors with richer contextual information, can significantly improve the accuracy of long text retrieval.
- Flexible and controllable vector dimensions The vector dimensions of : v3 can be flexibly adjusted to strike a balance between performance and storage space, avoiding the high storage overhead associated with high-dimensional vectors. This is made possible by Matryoshka Representation Learning (MRL).
Link to open source model: https://huggingface.co/jinaai/jina-embeddings-v3
Model API Link: https://jina.ai/?sui=apikey
Link to modeling paper: https://arxiv.org/abs/2409.10173
Quick Migration Guide
- v3 is a brand new model, so the vectors and indexes of v2 can't be reused directly, and you need to reindex the data again.
- In most scenarios (96%), v3 significantly outperforms v2, while v2 only occasionally ties or even slightly outperforms v3 in the English summary task. However, given v3's multi-language support and advanced features, v3 should be preferred in most scenarios.
- The v3 API now includes
task
,dimensions
cap (a poem)late_chunking
Three parameters, the exact usage of which can be found in our blog post.
Dimensional adjustment
- v3 outputs a 1024-dimensional vector by default, while v2 only has 768 dimensions. With Matryoshka representation learning, v3 can now theoretically output any dimension. Developers can set the
dimensions
Parameters flexibly control the dimensionality of the output vectors to find the best balance between storage cost and performance - If your previous project was developed based on the v2 API, change the model name directly to
jina-embeddings-v3
is not possible because the default dimensions have changed. If you want to keep the data structure or size consistent with v2, you can set thedimensions=768
The distribution of the vectors of v3 and v2 is completely different even if they have the same dimension. Even if the dimensions are the same, the vectors of v3 and v2 have completely different distributions on the semantic space and thus cannot be used directly interchangeably.
Model Replacement
- v3's strong multilingual support has fully replaced the bilingual model in v2 (v2-base-de, v2-base-es, v2-base-zh).
- For pure coding tasks, jina-embeddings-v2-based-code is still the best choice. Tests show it scores as high as 0.7753, compared to 0.7537 for v3 generic vectors (no task set) and 0.7564 for the LoRA adapter, giving v2 encoding a performance lead of about 2.81 TP3T over v3.
Mission parameters
- The v3 API generates good quality generic vectors when the task parameter is unspecified, but it is highly recommended to set the task parameter according to the specific task type to get a better vector representation.
- To make v3 emulate the behavior of v2, use the
task="text-matching"
We recommend trying different task options to find the best solution, though, rather than setting thetext-matching
As a universal program. - If your project uses v2 for information retrieval, it is recommended to switch to v3 for the retrieval task type (
retrieval.passage
cap (a poem)retrieval.query
), better retrieval results can be obtained.
Other considerations
- For brand new task types (which are rare), try setting the task parameter to None as a starting point.
- If you used the label rewriting technique in v2 for zero-sample classification tasks, then in v3 you can just set the
task="classification"
Similar results are obtained because v3 has optimized the vector representation for the classification task. - Both v2 and v3 support context lengths of up to 8192 tokens, but v3 is more efficient, thanks to FlashAttention2 technology, and lays the groundwork for v3's late-scoring feature.
Late Chunking
- v3 introduces a late-splitting function, using 8192 tokens long context, Mr. into a vector and then split into chunks, so that each small piece of the cut out contains contextual information, the retrieval will naturally be more accurate.
late_chunking
It's currently only available in the API, so if you're running models locally, you won't be able to use this feature for a while.- start using
late_chunking
The text length of each request cannot exceed 8192 tokens, because v3 can only process so much content at once.
Performance and speed
- In terms of speed, even though v3 has three times as many parameters as v2, the inference is faster than v2 or at least equal, mainly due to the FlashAttention2 technology.
- Not all GPUs support FlashAttention2. v3 will still run if you're using a GPU that doesn't, but it may be slightly slower than v2.
- When using the API, factors such as network latency, rate limitations, and availability zones also affect latency, so the API latency does not fully reflect the true performance of the v3 model.
Unlike v2, Jina Embeddings v3 is licensed under CC BY-NC 4.0. v3 can be used commercially via our API or Azure. v3 can be used commercially via our API, AWS, or Azure. research and non-commercial use is no problem. For local commercial deployment, please contact our sales team for licensing:
https://jina.ai/contact-sales
Multi-language support
v3 is currently the industry's leading multilingual vector model,** and is ranked #2 in the M****TEB charts for models with less than 1 billion parameters. **It supports 89 languages, covering most of the world's major languages.
These include Chinese, English, Japanese, Korean, German, Spanish, French, Arabic, Bengali, Danish, Dutch, Finnish, Georgian, Greek, Hindi, Indonesian, Italian, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Swedish, Thai, Turkish, Ukrainian, Urdu and Vietnamese. Turkish, Ukrainian, Urdu and Vietnamese.
If you were using v2's English, English/German, English/Spanish, or English/Chinese models, you now only need to modify the model
parameter and select the appropriate task
type, you can easily switch to v3.
# v2 English-German
data = {
"model": "jina-embeddings-v2-base-de",
"input": [
"The Force will be with you. Always.",
"Die Macht wird mit dir sein. Immer.",
"The ability to destroy a planet is insignificant next to the power of the Force.",
"Die Fähigkeit, einen Planeten zu zerstören, ist nichts im Vergleich zur Macht der Macht."
]
}
# v3 Multilingual
data = {
"model": "jina-embeddings-v3",
"task": "retrieval.passage",
"input": [
"The Force will be with you. Always.",
"Die Macht wird mit dir sein. Immer." ,
"The force is with you. Forever." ,
"La Forza sarà con te. Sempre." ,
"フォースと共にあらんこと. Itumo."
]
}
response = requests.post(url, headers=headers, json=data)
Task-specific vector representation
v2 uses a generic vector representation, i.e., all tasks share the same model. v3 provides vector representations optimized specifically for different tasks (e.g., retrieval, classification, clustering, etc.) to improve performance in specific scenarios.
Select different task
type, which is equivalent to telling the model which features relevant to that task to extract, generating a vector representation that is more adapted to the task requirements.
The following is an example of the Lightsaber Repair Knowledge Base, demonstrating how to migrate v2 code to v3 and experience the performance gains from task-specific vector representations:
# In real projects we will use larger datasets, this is just an example
knowledge_base = [
"Why is my lightsaber blade flickering? A flashing blade may indicate a low battery or an unstable crystal. Please charge the battery and check the stability of the crystal. If the flashing persists, the crystal may need to be recalibrated or replaced.",
"Why is my blade dimmer than before? A dimmer blade could mean a low battery or a problem with power distribution. First, please charge the battery. If the problem persists, the LED may need to be replaced.",
"Can I change my lightsaber blade color? Many lightsabers allow blade color to be customized by changing crystals or changing the color settings using the control panel on the hilt. Please refer to your model manual for detailed instructions.",
"What do I do if my lightsaber overheats? Overheating can be caused by prolonged use. Turn off your lightsaber and allow it to cool for at least 10 minutes. If it overheats frequently, it may indicate an internal problem that needs to be checked by a technician.",
"How do I charge my lightsaber? Connect your lightsaber to the provided charging cable through the port near the hilt, making sure to use an official charger to avoid damaging the battery and electronics.",
"Why is my lightsaber making strange noises? Strange noises may indicate a problem with the sound board or speakers. Try turning off your lightsaber and turning it back on. If the problem persists, contact our support team for a replacement sound board."
]
query = "Lightsabers are too dark."
For v2, there is only one task (text matching), so we only need one example code block:
# v2 Code: coding knowledge bases and queries using text matching tasks
data = {
"model": "jina-embeddings-v2-base-en".,
"normalized": True. # Note: This parameter is no longer required in v3.
"input": knowledge_base
}
docs_response = requests.post(url, headers=headers, json=data)
data = {
"model": "jina-embeddings-v2-base-en".,
"task": "text-matching",
"input": [query]
}
query_response = requests.post(url, headers=headers, json=data)
v3 provides vector representations optimized for specific tasks, including retrieval, separation, classification, and text matching.
Vector representation of the search task
We demonstrate the difference between v2 and v3 when dealing with text retrieval tasks, using a simple lightsaber repair knowledge base as an example.
For semantic retrieval tasks, v3 introduces asymmetric encoding using, respectively, the retrieval.passage
cap (a poem) retrieval.query
Coding documents and queries to improve retrieval performance and accuracy.
Document coding: retrieval.passage
data = {
"model": "jina-embeddings-v3",
"task": "retrieval.passage", # "task" Parameters are new in v3
"late_chunking": True.
"input": knowledge_base
}
response = requests.post(url, headers=headers, json=data)
Query Code: retrieval.query
data = {
"model": "jina-embeddings-v3",
"task": "retrieval.query",
"late_chunking": True.
"input": [query]
}
response = requests.post(url, headers=headers, json=data)
Note: The above code enables thelate_chunking
function, which can enhance the encoding of long text, we will introduce it in detail later.
Let's compare the performance of v2 and v3 for the query "lightsabers are too dark". v2 returns a set of less relevant matches based on cosine similarity, as shown below:
In contrast, v3 understands the intent of the query better and returns more accurate results related to the "appearance of lightsaber blades", as shown below.
v3 does more than just retrieval; it also provides several other task-specific vector representations:
Vector representation of separation tasks
v3's separation
The task is optimized for separation tasks such as clustering and re-ranking, for example, separating different types of entities, which is useful for organizing and visualizing large corpora.
Example: Distinguishing Star Wars and Disney Characters
data = {
"model": "jina-embeddings-v3",
"task": "separation", # Use separation Mission
"late_chunking": True,
"input": [
"Darth Vader",
"Luke. Skywalker",
"Mickey. Mouse",
"Donald Duck"
]
}
response = requests.post(url, headers=headers, json=data)
Vector representation of the classification task
v3's classification
The task is optimized for text categorization tasks such as sentiment analysis and document categorization, for example, categorizing text into positive and negative comments.
EXAMPLE: Analyzing the Emotional Tendencies of Star Wars Movie Reviews
data = {
"model": "jina-embeddings-v3",
"task": "classification",
"late_chunking": True.
"input": [
"Star Wars is a groundbreaking masterpiece that revolutionized the movie industry and redefined science fiction movies forever!",
"With its stunning visuals, unforgettable characters and legendary narrative, Star Wars is an unrivaled cultural phenomenon.",
"Star Wars is an overhyped disaster full of shallow characters with no meaningful plot!",
}
response = requests.post(url, headers=headers, json=data)
Vector representation of text matching
v3's text-matching
Focus on semantic similarity tasks such as sentence similarity or de-emphasis, e.g., excluding repeated sentences or paragraphs.
Example: Recognizing repetition in Star Wars lines
data = {
"model": "jina-embeddings-v3",
"task": "text-matching",
"late_chunking": True.
"input": [
"Luke, I am your father.",
"No, I am your father.",
"Fear leads to anger, anger leads to hate, hate leads to the dark side.",
"Fear leads to anger, anger leads to hate, hate leads to suffering."
]
}
response = requests.post(url, headers=headers, json=data)
Late Chunking: Improving Long Text Encoding Results
v3 introduces the late_chunking
parameter, when the late_chunking=True
When the model processes the whole document first and then splits it into multiple blocks to generate a block vector containing complete contextual information; when the late_chunking=False
When the model processes each block independently, the generated block vectors do not contain contextual information across blocks.
take note of
- start using
late_chunking=True
The total number of tokens per API request cannot exceed 8192, which is the maximum context length supported by v3. late_chunking=False
The total number of tokens is not limited, but is subject to the rate limit of the Embeddings API.
For long text processing, enable the late_chunking
can significantly improve coding because it preserves contextual information across blocks, making the resulting vector representation more complete and accurate.
We use a transcript of a chat to evaluate late_chunking
Impact on the effectiveness of long text retrieval.
history = [
"Sita, have you decided where you're going for your birthday dinner on Saturday?",
"I'm not sure, not too familiar with the restaurants here.",
"We can go online and look at recommendations.",
"That sounds good, let's do it!",
"What type of dish do you want for your birthday?",
"I especially like Mexican or Italian food.",
"How's the place, Bella Italia? It looks good.",
"Oh, I've heard about that place! Everyone says it's great there!",
"Shall we book a table then?",
"Okay, I think it's going to be perfect! Let's call and make a reservation."
]
Using v2 for the query "What are some good restaurant recommendations?" , the results obtained are not particularly relevant.
With v3 and no late chunking enabled, the results are equally unsatisfactory.
However, when using v3 and enabling late chunking
When the most relevant result (a good restaurant recommendation) was ranked exactly first.
Search results:
It is clear from the search results that enabling late_chunking
Afterwards, v3 is able to more accurately identify chat content relevant to the query, ranking the most relevant results first.
It also shows that late_chunking<span> </span>
The accuracy of long text retrieval can be effectively and efficiently improved, especially in scenarios that require deep understanding of contextual semantics.
Using Russian nested vectors to represent equilibrium efficiency and performance
v3 Adopted dimensions
The parameter supports flexible vector dimension control, you can adjust the dimension of the output vector according to the actual demand, and strike a balance between performance and storage space.
Smaller vector dimensions can reduce the storage overhead of vector databases and improve retrieval speed, but some information may be lost, resulting in performance degradation.
data = {
"model": "jina-embeddings-v3",
"task": "text-matching",
"dimensions": 768, # Sets the vector dimension to 768; the default value is 1024
"input": [
"The Force will be with you. Always.",
"The force is with you. Forever.",
"La Forza sarà con te. Sempre.",
"フォースと共にあらんこと. Itumo."
]
}
response = requests.post(url, headers=headers, json=data)
common problems
Q1: What are the advantages of using Late Chunking if I have already chunked the document before vectorization?
A1: The advantage of late splitting over pre-splitting is the ability to Processes the entire document before chunking, thus retaining more complete contextual information . Late chunking is important for processing complex or lengthy documents, it can help to provide a more relevant response during retrieval because the model has an overall understanding of the document before chunking. Whereas pre-segmentation blocks are processed independently of blocks without complete context.
Q2: Why does v2 have a higher benchmark score than v3 on the pairwise classification task? Do I need to worry?
A2: v2's seemingly higher scores on the pairwise classification task are mainly due to the fact that the average scores are computed differently. v3's test set contains more languages, so its average scores are likely to be lower than v2's. In fact, v3 performs as well as, if not better than, state-of-the-art models such as multilingual-e5 on the pairwise classification task in all languages.
Q3: Does v3 perform better on specific languages supported by the v2 bilingual model?
A3: Performance Comparison of v3 and v2 Bilingual Models on Specific Languages Depends on the specific language and type of task The bilingual model of v2 is highly optimized for specific languages, and therefore may perform better on some specific tasks. However, v3 is designed to support a wider range of multilingual scenarios, with stronger cross-language generalization capabilities and optimized for a variety of downstream tasks through task-specific LoRA adapters. As a result, v3 typically achieves better overall performance across multiple languages or in more complex task-specific scenarios such as semantic retrieval and text categorization.
If you only need to deal with one specific language supported by the v2 bilingual model (Chinese-English, English-German, Spanish-English) and your task is relatively simple, v2 is still a good choice and may even perform better in some cases.
But if you need to work with multiple languages, or if your task is more complex (e.g., you need to perform semantic retrieval or text categorization), then v3, with its strong cross-language generalization capabilities and optimization strategies based on downstream tasks, is a better choice.
Q4: Why does v2 outperform v3 on summary tasks and do I need to worry?
A4: v2 performs better on the summarization task, mainly because its model architecture is specifically optimized for tasks such as semantic similarity, which is closely related to the summarization task. v3 was designed with the goal of providing a wider range of task support, especially on the retrieval and classification tasks, and thus is not as optimized as v2 on the summarization task.
However, one should not worry too much, as the evaluation of the summarization task currently relies on SummEval, a test that measures semantic similarity and does not fully represent the model's overall ability on the summarization task. Given that v3 performs well on other critical tasks such as retrieval, slight performance differences on the summarization task usually do not have a significant impact on real-world applications.
summarize
Jina Embeddings v3 is our major model upgrade, reaching the current best level SOTA on multilingual and long text retrieval tasks, and it comes with a variety of built-in LoRA adapters that can be customized according to your needs for different scenarios of retrieval, clustering, classification, and matching for more accurate vectorization results. We strongly recommend that you migrate to v3 as soon as possible.
These are just some of our introductions to Jina Embeddings v3. We hope you find them helpful. If you have any questions, please feel free to leave a comment to discuss!