exist RAG domain, the multivector model ColBERT is used to model a document by assigning a single vector to each of the document's token The generation of independent vectors brings about an improvement in retrieval accuracy. However, it also brings a dramatic increase in storage requirements and only supports English, which limits its application scope.
To address these issues, we have improved the architecture and training process of ColBERT, especially making breakthroughs in multilingual processing. The latest Jina-ColBERT-v2 supports 89 languages and introduces the option of customized output dimensions, which significantly reduces storage requirements and improves the efficiency and accuracy of multilingual retrieval.
Core highlights of the new release
-
performance enhancement: 6.51 TP3T improvement in English search performance compared to the original ColBERT-v2, and 5.41 TP3T compared to its predecessor, jina-colbert-v1-en. -
Multi-language support: The new version supports up to 89 languages, including Arabic, Chinese, English, Japanese, Russian and other languages, as well as programming languages. -
Output dimensions are customizableThe new version utilizes Matryoshka Representation Learning (MRL) and offers 128, 96, and 64 dimensional output vectors, allowing the user to choose the appropriate dimensionality for their needs.
The full technical report can be found on arXiv: https://arxiv.org/abs/2408.16672
jina-colbert-v2 performance at a glance
In terms of retrieval performance, jina-colbert-v2 demonstrates a significant advantage over its predecessor, both in terms of English retrieval tasks and multi-language support. The advantages of this multi-vector model can also be fully utilized with the input length of 8192 tokens inherited from Jina AI. Below is a comparison with other versions, where the core improvements can be clearly seen:
jina-colbert-v2 |
jina-colbert-v1-en |
Original ColBERTv2 | |
---|---|---|---|
Average of English BEIR tasks | 0.521 | 0.494 | 0.489 |
Multi-language support | 89 languages | English only | English only |
output dimension | 128, 96 or 64 | Fixed 128 | Fixed 128 |
Maximum query length | 32 tokens | 32 tokens | 32 tokens |
Maximum Document Length | 8192 tokens | 8192 tokens | 512 tokens |
Number of parameters | 560 million | 137 million | 110 million |
Model size | 1.1GB | 550MB | 438MB |
1. Performance enhancements
In the English retrieval task, jina-colbert-v2 outperforms the previous generation of jina-colbert-v1-en and the original ColBERT v2, approaching the level of the AnswerAI-ColBERT-small model, which was designed specifically for English.
Model name | Average score on the English BEIR benchmark test | Multi-language support |
---|---|---|
jina-colbert-v2 |
0.521 | Multi-language support |
jina-colbert-v1-en |
0.494 | English only |
ColBERT v2.0 | 0.489 | English only |
AnswerAI-ColBERT-small | 0.549 | English only |
2. Multilingual support
Jina-ColBERT-v2 is the only ColBERT model that supports multiple languages.The result is a very compact Embedding vector, which is significantly better than the traditional one. BM25 Retrieval methods (in all languages benchmarked by MIRACL).
jina-colbert-v2
The training corpus covers 89 languages and includes 450 million weakly supervised semantically related sentence pairs, question-answer pairs, and query-document pairs. Half of them are in English, and the remaining part includes 29 different non-English languages, in addition to 3.0% of programming language data and 4.3% of cross-language data.
We also specialize in the Arabic, Chinese, French, German, Japanese, Russian, Spanish and other mainstream languages cap (a poem) programming language Additional training was performed, which enabled the model to perform well in handling cross-language tasks by introducing an aligned bilingual text corpus.
The following figure shows the performance comparison results between Jina-ColBERT-v2 and BM25 on 16 languages in the MIRACL benchmark.
3. Russian nesting doll representation learning
Matryoshka Representation Learning (MRL) is a flexible and efficient training method that minimizes accuracy loss while supporting different output vector dimensions. It is implemented by introducing multiple linear layers in the hidden layer of the model, with each layer processing a different dimension. For more details on this technique, see: https://arxiv.org/abs/2205.13147
By default, theJina-ColBERT-v2 generates 128-dimensional Embedding, but you can choose shorter vectors, such as 96 or 64 dimensions.. Even if the vector is shortened by 25% or 50%, it has almost negligible impact on the performance (drop is less than 1.5%). In other words, no matter how long you need the vectors to be, the model maintains efficient and accurate performance.
The table below shows the performance of Jina-ColBERT-v2 on the top ten results (nDGC@10) of these six BEIR benchmarks. It can be seen thatThe performance difference between 128 and 96 dimensions is less than 11 TP3T, while the difference between 128 and 64 dimensions is less than 1.51 TP3T.
output dimension | Average score (nDGC@10 for 6 benchmarks) |
---|---|
128 | 0.565 |
96 | 0.558 |
64 | 0.556 |
The following figure visualizes the performance of Jina-ColBERT-v2 under different output dimensions.
Choosing smaller output vectors not only saves storage space, but also improves computational speed, especially in scenarios where vectors need to be compared or distances need to be computed, such as vector retrieval systems.
There is also a very noticeable drop in cost. According to Qdrant's Cloud Cost Estimator, storing 100 million documents on AWS with 128-dimensional vectors costs $1,319.24 per month, whereas with 64-dimensional vectors, the cost can be cut right in half, down to $659.62 per month. That's not only a speed boost, but also a significant cost reduction.
How to get started with Jina ColBERT v2
Jina ColBERT v2 is now live on the Jina Search Foundation API, AWS Marketplace and Azure platforms. It is also available under the CC BY-NC-4.0 license on the Hugging Face On the open source for non-commercial use.
-
AWS Marketplace: https://aws.amazon.com/marketplace/seller-profile?id=seller-stch2ludm6vgy -
Azure: https://azuremarketplace.microsoft.com/en-gb/marketplace/apps?search=Jina -
Hugging Face: https://huggingface.co/jinaai/jina-colbert-v2
via the Jina Search Foundation API
for Embedding
Using the Jina Embedding API to get the Embedding of jina-colbert-v2 is the easiest and most direct way.
Two key parameters:
dimensions
: Indicates the output dimension of Embedding, default is 128, you can also choose 64.
input_type
: determines the input type.query
is limited to 32 tokens, which will be automatically truncated, and the document
The document encoding can be processed offline so that only the query portion of the query needs to be encoded at the time of querying, which can significantly increase processing speed.
The Jina API Key is available at jina.ai/embeddings.
curl https://api.jina.ai/v1/multi-vector \\\
-H "Content-Type: application/json" \\\\\\
-H "Authorization: Bearer " \\\\\
-d '{
"model": "jina-colbert-v2", "dimensions": 128, 1.0, 1.0, 1.0
"dimensions": 128, # or 64
"input_type": "document", # Note whether the input_type is document or query.
"embedding_type": "float", "input": [ "float", #
"input": [
"Enter the text of your document here.", "input_type": "float", "input": [
"You can send more than one text.", "text_type": "float", "input": [ "Input your document text here.
"Each text can be up to 8192 tokens long."
]}'
For Reranker
To use the Jina Reranker API via the jina-colbert-v2
To do this, pass in a query and multiple documents, and return a Rerankable match score, construct the following request:
curl https://api.jina.ai/v1/rerank \\\
-H "Content-Type: application/json" \\\\\\
-H "Authorization: Bearer " \\\\\
-d '{
"model": "jina-colbert-v2",
"query": "What is the population of Berlin?" ,
"top_n": 3,
"documents": [
"The population of Berlin in 2023 increased by 0.71 TP3T compared to last year.Thus, by the end of last year, Berlin had about 27,300 more inhabitants than in 2022.The age group with the highest number of inhabitants is the 30 to 40 year olds. Berlin has about 881,000 foreign residents from about 170 countries, with an average age of 42.5 years." ,
"Mount Berlin is a glacier-covered volcano in Marie Byrd Land, Antarctica, about 100 kilometers (62 miles) from the Amundsen Sea. It is a peak about 20 kilometers (12 mi) wide with parasitic craters, and consists of two merged volcanoes: the Berlin Volcano, whose crater is 2 kilometers (1.2 mi) wide, and the Merrem Peak, which is about 3.5 kilometers (2.2 mi) away, with a crater that is 2.5 x 1 kilometer (1.55 mi x 0.62 mi) wide." ,
"Population data by nationality and state for each federal state as of December 31, 2023", ,
"With a population of over 4.5 million, Berlin's metropolitan area is the most populous metropolitan area in Germany. With a population of approximately 6.2 million, the Berlin-Brandenburg metropolitan area is the second largest metropolitan area in Germany after the Rhine-Ruhr region and the sixth largest metropolitan area in the European Union (by GDP)." ,
"Irving Berlin (formerly known as Israel Behring) was an American composer and songwriter. His music is part of The Great American Songbook. Berlin received many honors, including an Academy Award, a Grammy Award, and a Tony Award." ,
"Berlin is a town in the Capital Planning District of the U.S. state of Connecticut.The population was 20,175 at the 2020 census.", .
"Berlin is the capital and largest city of Germany, both in size and population. Its more than 3.85 million inhabitants make it the most populous city in the European Union (by population within the city limits)." ,
"Berlin, Berlin" is a TV series produced for ARD that aired on the evening program of Germany's first TV station from 2002 to 2005. Directors include Franziska Meyer-Price, Christoph Schnee, Sven Winterwalt and Titus Selger."
]
}'
Note also. top_n
parameter, which specifies the number of documents to be returned in the search. If you only need optimal matches, you can set the top_n
Set to 1.
For more sample code in Python or other languages, visit the https://jina.ai/embeddings page, or select from the drop-down menu at https://jina.ai/reranker/ jina-colbert-v2
The
Via Stanford ColBERT
If you have used the Stanford ColBERT library, you can now seamlessly replace it with the Jina ColBERT v2 version. Simply specify the jinaai/jina-colbert-v2
as a model source.
from colbert.infra import ColBERTConfig
from colbert.modeling.checkpoint import Checkpoint
ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
docs = ["List of your texts"]
query_vectors = ckpt.queryFromText(docs)
Via RAGatouille
Jina ColBERT v2 is also integrated into the RAGatouille system through the RAGPretrainedModel.from_pretrained()
The method is easy to download and use.
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
docs = ["your text list"]
RAG.index(docs, index_name="your_index_name")
query = "your query"
results = RAG.search(query)
via Qdrant
Starting with Qdrant version 1.10, Qdrant supports multivectors and the late-interaction model, which you can use directly with jina-colbert-v2. Whether you are deploying locally or cloud-hosted Qdrant, you can insert documents into a multivector collection by simply configuring the multivector_config parameter correctly in the client.
Creating a new collection using the MAX_SIM operation
from qdrant_client import QdrantClient, models
qdrant_client = QdrantClient(
api_key="",
)
qdrant_client.create_collection(
collection_name="{collection_name}",
vectors_config={
"colbert": models.VectorParams(
size=128,
distance=models.Distance.COSINE,
multivector_config=models.MultiVectorConfig(
comparator=models.MultiVectorComparator.MAX_SIM
),
)
}
)
⚠️ Correct setting multivector_config
Parameters are the key to using the ColBERT model in Qdrant.
Inserting a document into a multivector set
import requests
from qdrant_client import QdrantClient, models
url = 'https://api.jina.ai/v1/multi-vector'
headers = {
'Authorization': 'Bearer '
}
data = {
'input': [
'Input your text here', 'input_type'.
'Up to 8192 tokens per text'
]
}
response = requests.post(url, headers=headers, json=data)
rows = response.json()["data"]
qdrant_client = QdrantClient(
url="",
api_key="",
)
for i, row in enumerate(rows):
qdrant_client.upsert(
collection_name="{collection_name}",
points=[
models.PointStruct(
points=[ models.PointStruct( id=i, vector=row["embeddings"])
vector=row["embeddings"],
payload={"text": data["input"][i]}
)
]
)
lookup collection
from qdrant_client import QdrantClient, models
import requests
url = 'https://api.jina.ai/v1/multi-vector'
headers = {
'Content-Type': 'application/json', 'Authorization': 'Bearer '
'Authorization': 'Bearer '
}
data = {
'model': 'jina-colbert-v2', 'input_type'.
'input_type': 'query', 'embedding_type': 'float', 'query', 'query'
"embedding_type": "float",
"input": [
"How many labeled inputs does Jina AI's Embedding model support?"
]
}
response = requests.post(url, headers=headers, json=data)
vector = response.json()["data"][0]["embeddings"]
qdrant_client = QdrantClient(
url="",
api_key="",
)
results = qdrant_client.query_points(
collection_name="{collection_name}",
query=vector, )
)
print(results)
summarize
Based on its predecessor, jina-colbert-v2 covers 89 global languages and a variety of Embedding output dimensions, which enables users to flexibly balance the needs of accuracy and efficiency, and save computation and time costs. Click https://jina.ai to try it now and get 1 million free Token.