其次,是否有特定的需求?例如模态(如仅文本或图片,关于多模态Embedding选择参见《如何选择合适的 Embedding 模型》)、特定领域(如法律、医学等)
如何选择通用模型呢?HuggingFace中Massive Text Embedding Benchmark(MTEB)排行榜罗列了当前各种专有和开源文本Embedding模型,对于每个Embedding模型,MTEB列出了各种指标,包括模型参数、内存、Embedding维度、最大token数量,以及其在检索、摘要等任务中的得分。
任务:在MTEB排行榜顶部,我们会看到各种任务选项卡。对于一个RAG应用程序,我们可能需要更关注“检索”任务,我们可以选择 Retrial
3.1 数据集
Language | Description |
C/C++ | A general-purpose programming language known for its performance and efficiency. It provides low-level memory manipulation capabilities and is widely used in system/software development, game development, and applications requiring high performance. |
Java | A versatile, object-oriented programming language designed to have as few implementation dependencies as possible. It is widely used for building enterprise-scale applications, mobile applications (especially Android), and web applications due to its portability and robustness. |
Python | A high-level, interpreted programming language known for its readability and simplicity. It supports multiple programming paradigms and is widely used in web development, data analysis, artificial intelligence, scientific computing, and automation. |
JavaScript | A high-level, dynamic programming language primarily used for creating interactive and dynamic content on the web. It is an essential technology for front-end web development and is increasingly used on the server-side with environments like Node.js. |
C# | A modern, object-oriented programming language developed by Microsoft. It is used for developing a wide range of applications, including web, desktop, mobile, and games, particularly within the Microsoft ecosystem. |
SQL | A domain-specific language used in programming and managing relational databases. It is essential for querying, updating, and managing data in databases, and is widely used in data analysis and business intelligence. |
PHP | A server-side scripting language designed primarily for web development. It is embedded into HTML and is widely used for building dynamic web pages and applications, with a strong presence in content management systems like WordPress. |
Golang | A statically typed, compiled programming language designed by Google. Known for its simplicity and efficiency, it is used for building scalable and high-performance applications, particularly in cloud services and distributed systems. |
Rust | A systems programming language focused on safety and concurrency. It provides memory safety without using a garbage collector and is used for building reliable and efficient software, particularly in systems programming and web assembly. |
3.2 创建Embedding
对于上述数据集生成相应的向量Embedding。关于 pymilvus[model]
def gen_embedding(model_name): openai_ef = model.dense.OpenAIEmbeddingFunction( model_name=model_name, api_key=os.environ["OPENAI_API_KEY"] ) docs_embeddings = openai_ef.encode_documents(df['description'].tolist()) return docs_embeddings, openai_ef
然后,把生成的Embedding存入到Milvus 的collection。
def save_embedding(docs_embeddings, collection_name, dim):
data = [
{"id": i, "vector": docs_embeddings[i].data, "text": row.language}
for i, row in df.iterrows()
if milvus_client.has_collection(collection_name=collection_name):
milvus_client.create_collection(collection_name=collection_name, dimension=dim)
res = milvus_client.insert(collection_name=collection_name, data=data)
3.3 查询
def query_results(query, collection_name, openai_ef): query_embeddings = openai_ef.encode_queries(query) res = milvus_client.search( collection_name=collection_name, data=query_embeddings, limit=4, output_fields=["text"], ) result = {} for items in res: for item in items: result[item.get("entity").get("text")] = item.get('distance') return result
3.4 评估Embedding模型性能
我们采用 OpenAI的两个 Embedding模型,text-embedding-3-small
和 text-embedding-3-large
准确率(Precision) 评估检索结果中的真正相关内容的占比,即返回的结果中有多少与搜索查询相关。
Precision = TP / (TP + FP)
其中,检索结果中与查询真正相关的内容 True Positives(TP), 而 False Positives(FP) 指的是检索结果中不相关的内容。
召回率 (Recall)评估从整个数据集中成功检索到相关内容的数量。
Recall = TP / (TP + FN)
其中,False Negatives (FN) 指的是所有未包含在最终结果集中的相关项目
查询 1:auto garbage collection
相关项:Java, Python, JavaScript, Golang
Rank | text-embedding-3-small | text-embedding-3-large |
1 | ❎ Rust | ❎ Rust |
2 | ❎ C/C++ | ❎ C/C++ |
3 | ✅ Golang | ✅ Java |
4 | ✅ Java | ✅ Golang |
Precision | 0.50 | 0.50 |
Recall | 0.50 | 0.50 |
查询 2:suite for web backend server development
相关项:Java, JavaScript, PHP, Python (答案包含主观判断)
Rank | text-embedding-3-small | text-embedding-3-large |
1 | ✅ PHP | ✅ JavaScript |
2 | ✅ Java | ✅ Java |
3 | ✅ JavaScript | ✅ PHP |
4 | ❎ C# | ✅Python |
Precision | 0.75 | 1.0 |
Recall | 0.75 | 1.0 |
在这两个查询中,我们通过准确率和召回率对比了两个Embedding模型 text-embedding-3-small
和 text-embedding-3-large