Choosing the right Embedding model is a crucial step when building a RAG system, here are my key factors and suggestions to consider when choosing an Embedding model for your reference:
Define application scenarios
First, the specific application scenarios and requirements of the RAG system need to be clarified. For example, is it dealing with text data, image data or multimodal data? Different data types may require different Embedding models. For example, for text data, you can refer to HuggingFace's MTEB (Massive Text Embedding Benchmark: a collection of evaluation metrics for measuring text embedding models) leaderboards to choose a suitable model, or go to the domestic Magic Matching community to look at the leaderboards.
Generic vs. domain-specific requirements
Second, choose a model based on the generality or specificity of the task. If the task you want to realize is more general and does not involve too much domain expertise, you can choose a generic Embedding model; if the task involves a specific domain (e.g., law, healthcare, etc., education, finance, etc.), you need to choose a model that is more suitable for that domain.
multilingualism
If the content of the knowledge base exists in your system and you need to support multiple languages, you can choose multilingual Embedding models, such as BAAI/bge-M3, bce_embedding (Chinese-English), etc., which perform better in a multilingual environment. If your knowledge base contains mainly Chinese data, you can choose models such as iic/nlp_gte_sentence-embedding_chinese-base etc. The effect will be better.
Performance Evaluation
Check out benchmarking frameworks such as MTEB Leaderboards to evaluate the performance of different models. These leaderboards cover multiple languages and task types, and can help you find the model that performs best on a particular task. Next, you need to consider the size and resource constraints of the model. Larger models may provide higher performance, but they also increase computational costs and memory requirements. In addition, larger embedding dimensions usually provide richer semantic information, but may also lead to higher computational costs. Therefore, one needs to weigh the choice based on actual hardware resources and performance requirements.
Practical testing and validation
Finally, if possible, you can select 2-3 models for effect comparison, test and validate the performance of the selected models in real business scenarios, observe metrics such as accuracy and recall to evaluate the performance of the models on specific datasets, and make adjustments based on the results.
Embedding model recommendation
The following are 5 mainstream Embedding models, recommended for building RAG systems for reference:
BGE Embedding: Developed by Wisdom Source Research Institute, it supports multiple languages and offers several versions, including the highly efficient reranker. the model is open source and loosely licensed, and is suitable for tasks such as retrieval, classification, and clustering.
GTE Embedding: launched by Alibaba Dharma Institute, based on the BERT framework, it is applicable to scenarios such as information retrieval and semantic similarity judgment with excellent performance.
Jina Embedding: built by Jina AI's Finetuner team, trained on the Linnaeus-Clean dataset, it is suitable for information retrieval and semantic similarity judgment with outstanding performance.
Conan-Embedding: this is an Embedding model optimized for Chinese, which reaches the SOTA (State-of-the-Art) level on C-MTEB, and is especially suitable for RAG systems that require high-precision Chinese semantic representation.
text-embedding-ada-002: Developed by the Xenova team, it is compatible with the Hugging Face library and provides high-quality text vector representations for a wide range of NLP tasks.
Of course, there are also Sentence-BERT, E5-embedding, Instructor and so on, the performance of these models in different scenarios will be a little different, according to your specific needs and the considerations I listed above, you can choose the right model to build your own RAG system.