AI个人学习
和实操指南

Ragas:评估RAG召回QA准确率与答案相关性

综合介绍

Ragas是一个专门用于评估和优化检索增强生成(RAG)系统的工具。它通过分析查询、检索上下文和生成答案之间的关系,提供了一套全面的评估指标。这些指标包括忠实度、答案相关性、上下文相关性、上下文召回率和上下文精确度。Ragas帮助开发者评估和优化RAG系统的性能,确保生成的答案准确、相关且基于检索内容。无论是开发新系统还是优化现有系统,Ragas都能提供强大的支持。

 

功能列表

  • 忠实度评估:衡量生成的答案对检索到的上下文的忠实程度。
  • 答案相关性评估:评估生成的答案与原始查询的相关程度。
  • 上下文相关性评估:衡量检索出来的信息有多贴合问题。
  • 上下文召回率评估:评估检索到的上下文是否包含回答问题所需的全部信息。
  • 上下文精确度评估:衡量检索到的相关上下文在结果中的排序质量。

 

使用帮助

安装流程

  1. 通过PyPI安装
   pip install ragas
  1. 从源码安装
   pip install git+https://github.com/explodinggradients/ragas

快速开始

  1. 导入必要模块
   from ragas import SingleTurnSample
from ragas.metrics import AspectCritic
  1. 准备测试数据
   test_data = {
"user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
"response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter."
}
  1. 评估RAG系统
   evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
metric = AspectCritic(name="summary_accuracy", llm=evaluator_llm, definition="Verify if the summary is accurate.")
await metric.single_turn_ascore(SingleTurnSample(**test_data))

详细功能操作流程

  1. 忠实度评估
    • 使用Ragas提供的忠实度指标,评估生成的答案是否忠实于检索到的上下文。
    • 通过调用相应的评估函数,获取忠实度评估结果并进行分析。
  2. 答案相关性评估
    • 使用Ragas的答案相关性指标,评估生成的答案与原始查询的相关程度。
    • 通过配置相关性评估参数,定制化评估生成答案的相关性。
  3. 上下文相关性评估
    • 使用Ragas的上下文相关性指标,衡量检索出来的信息有多贴合问题。
    • 通过调用相关性评估函数,获取上下文相关性评估结果。
  4. 上下文召回率评估
    • 使用Ragas的上下文召回率指标,评估检索到的上下文是否包含回答问题所需的全部信息。
    • 通过设置召回率评估参数,确保检索到的上下文全面覆盖问题所需的信息。
  5. 上下文精确度评估
    • 使用Ragas的上下文精确度指标,衡量检索到的相关上下文在结果中的排序质量。
    • 通过配置精确度评估参数,优化检索结果的排序。

 

Ragas评估QA笔记

QA召回一般包含两个组件:
1、检索器 - 检索回答查询所需的最相关信息。
2、生成器 - 使用检索到的信息生成答案。


 

评估QA准确率一直是个大问题。ROUGEA Package for Automatic Evaluation of Summaries)和BLUE( a method for automatic evaluation of machine translation)效果比较差。

目前方法:利用强大的LLM进行无参考评估,所以要仔细聊聊Ragas。

Ragas 旨在解决使用 LLM 评估您的 QA 管道的这些限制,同时还使用尽可能少、更经济、更快捷的注释数据提供可操作的指标。

Ragas 是一个框架,可帮助您评估这些不同方面的 QA 管道。它为您提供了一些指标来评估QA系统的不同方面。

Ragas:评估RAG召回准确率与答案相关性-1

 

用于评估检索的指标: 提供,并为您提供检索系统性能的衡量标准。
context_relevancy
context_recall

评估生成的指标: 提供哪些衡量幻觉,哪些衡量如何回答问题。
faithfulness
answer_relevancy

faithfulness

根据提供的上下文衡量生成的答案的事实准确性。

这分两步完成。

首先,给定一个问题并生成答案,Ragas 使用 LLM 来找出生成的答案所做的陈述。这给出了一个语句列表,我们必须检查其有效性。在步骤 2 中,给定语句列表和返回的上下文,Ragas 使用 LLM 检查提供的语句是否受上下文支持。将正确陈述的数量相加并除以生成的答案中的陈述总数,以获得给定示例的分数。

步骤一:

根据第一个prompt:

Given a question and answer, create one or more statements from answer.
question: Who was Albert Einstein and what is he best known for?
answer: He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
statements:
Albert Einstein was born in Germany.
Albert Einstein was best known for his theory of relativity.
question: Cadmium Chloride is slightly soluble in this chemical, it is also called what?
answer: alochol
statements:
Cadmium Chloride is slightly soluble in alcohol.
question: Were Shahul and Jithin of the same nationality?
answer: They were from different countries.
statements:
Shahul and Jithin were from different countries.
question:膳食分析是什么?
answer: 膳食分析是一种评估人们摄入的食物的方法,通过详细分析一周内所吃的食物,来确定人们摄入的营养素种类和数量。它可以揭示人们是否缺乏某些营养素,以及食物中是否存在一些对健康有害的物质,如糖、盐、咖啡因、酒精、食品添加剂和防腐剂。膳食分析可以提供关于脂肪、碳水化合物、蛋白质和热量的摄入量的评估。然而,膳食分析并不能全面考虑到食物中营养素含量的变异性、个人的营养素需求,以及营养素的吸收和利用情况。因此,它可能并不是评估个人的确切营养状况最准确的方法。
statements:

得到结果1:

['膳食分析是一种评估人们摄入的食物的方法。', 
'膳食分析可以确定人们摄入的营养素种类和数量。', 
'膳食分析可以揭示人们是否缺乏某些营养素。', 
'膳食分析可以揭示食物中是否存在一些对健康有害的物质。', 
'膳食分析可以提供关于脂肪、碳水化合物、蛋白质和热量的摄入量的评估。', 
'膳食分析可能并不是评估个人的确切营养状况最准确的方法。']

步骤二:

根据第二个prompt:

Prompt: Natural language inference
Consider the given context and following statements, then determine whether they are supported by the information present in the context.Provide a brief explanation for each statement before arriving at the verdict (Yes/No). Provide a final verdict for each statement in order at the end in the given format. Do not deviate from the specified format.
Context:
John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.
statements:
1. John is majoring in Biology.
2. John is taking a course on Artificial Intelligence.
3. John is a dedicated student.
4. John has a part-time job.
5. John is interested in computer programming.
Answer:
1. John is majoring in Biology.
Explanation: John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology.  Verdict: No.
2. John is taking a course on Artificial Intelligence.
Explanation: The context mentions the courses John is currently enrolled in, and Artificial Intelligence is not mentioned. Therefore, it cannot be deduced that John is taking a course on AI. Verdict: No.
3. John is a dedicated student.
Explanation: The prompt states that he spends a significant amount of time studying and completing assignments. Additionally, it mentions that he often stays late in the library to work on his projects, which implies dedication. Verdict: Yes.
4. John has a part-time job.
Explanation: There is no information given in the context about John having a part-time job. Therefore, it cannot be deduced that John has a part-time job.  Verdict: No.
5. John is interested in computer programming.
Explanation: The context states that John is pursuing a degree in Computer Science, which implies an interest in computer programming. Verdict: Yes.
Final verdict for each statement in order: No. No. Yes. No. Yes.
context:
你想要达到什么样的健康程度?如果你想了解自己精神和身体的全部潜在能力,确定你的最佳营养需求是非常必要的。但是如果你的营养需求很特别,那么你该如何发现这一点呢?从1980年开始,我制定并完善了一套分析人体营养需求的体系,此体系建立在评估影响个人营养需求的主要因素的基础上。此体系已经在10万人当中进行了测试并得到了验证,现在已被世界各国的临床营养学家所应用。无数人已经从中获益,因此,我知道从这个体系中可以得到什么样的结果。这些结果包括:更敏锐的思维、更好的记忆力、更强的体力、更好的体重控制效果、降低胆固醇含量、使得病情好转。尽管多数被诊断为有病的人已经受益于个人健康计划,但这个健康计划是以预防为主要目标,而不是以治病为主的。如果你正在进行医药治疗,请确认这个营养补充方案与你接受的治疗措施是否能够很好地相容。影响你营养需求的因素,至少有8项因素会影响你的营养需求量。年龄、性别以及运动量都是很常见的影响因素,但污染、压力、基因遗传、你过去的健康状况,当然还有膳食所能提供的营养素以及反营养物质,却不是能够轻易了解的。但是所有的这些细节以及更多的因素,都必须纳入考虑当中。通过4种分析方法——膳食分析、生化分析、症状分析、生活方式分析,可以了解这些细节。膳食分析这种方式似乎是从容易着手的地方开始:搞清楚人们摄入的食物是什么,就能发现缺乏的营养素是什么。
但不幸的是,如果对一周内所吃的食物进行详细分析,并不能综合考虑到食物营养素含量的变异性、个人的营养素需求,或者营养素是否能被吸收、吸收后的营养素又是否能被有效地利用等情况。我见过许多人的饮食质量很高,却依然有维生素缺乏的症状。对其中绝大部分人来说,问题在于营养素吸收不良。这些可变因素,使得某些通过计算机进行的膳食分析不如设想的准确。膳食分析对于评估那些能影响到人们营养素需求的食物很有用处,比如糖、盐、咖啡、茶、酒精、食品添加剂和防腐剂。另一些因素,如脂肪、碳水化合物、蛋白质和热量的摄入量,也可由膳食分析进行评估。诸如头发矿物质分析与维生素血液分析这样的实验,可以提供关于体内生物化学状态的确切信息,也能让营养顾问了解你身体的确切营养状况。但并不是所有这些实验都能对规划你的营养方案提供有用的信息。为了让这些信息尽可能精确,任何关于维生素或矿物质的检测都必须反映出营养素在体内发挥作用的能力。例如,铁是构成血红细胞的必需元素,有助于体内氧的运输。通过测定细胞中铁的营养状况,可能会对你的铁需求有一个很好的估测。从另一方面来讲,维生素氏在血液中没有像铁那样的直接作用。它主要作用于大脑当中的化学物质5-羟色胺,或在甲基化反应中起作用,降低同型半胱氨酸含量。所以,仅仅测定血液中的维生素B6,并不能真实地反映你是否获得了充足的营养素供给以及体内的维生素B6是否运转正常。相比之下,检测同型半胱氨酸浓度等项目意义更大。如果你已经获得了充足的维生素、维生素B和叶酸,那么你的同型半胱氨酸浓度就很低,但假如同型半胱氨酸浓度很高,则表明你需要更多的其中一种或多种营养素。我们称这种测试方式为功能测试,因为这个测试能够精确地测定你体内生化反应的某项特殊功能是否运转正常。
statements:
1.膳食分析是一种评估人们摄入的食物的方法。
2.膳食分析可以确定人们摄入的营养素种类和数量。
3.膳食分析可以揭示人们是否缺乏某些营养素。
4.膳食分析可以揭示食物中是否存在一些对健康有害的物质。
5.膳食分析可以提供关于脂肪、碳水化合物、蛋白质和热量的摄入量的评估。
6.膳食分析可能并不是评估个人的确切营养状况最准确的方法。
Answer:

得到结果2:

1. 膳食分析是一种评估人们摄入的食物的方法.
explanation: the context mentions that dietary analysis is a method to understand what food people consume. verdict: yes.
2. 膳食分析可以确定人们摄入的营养素种类和数量.
explanation: the context mentions that dietary analysis can evaluate the intake of nutrients such as sugar, salt, coffee, tea, alcohol, food additives, and preservatives. it does not explicitly state that it can determine the types and quantities of nutrients consumed. verdict: no.
3. 膳食分析可以揭示人们是否缺乏某些营养素.
explanation: the context mentions that dietary analysis can be useful in assessing the intake of nutrients that can affect people's nutritional needs. this implies that it can reveal if someone is lacking certain nutrients. verdict: yes.
4. 膳食分析可以揭示食物中是否存在一些对健康有害的物质.
explanation: the context mentions that dietary analysis can be useful in evaluating the intake of substances that can affect people's nutritional needs, such as food additives and preservatives. this implies that it can reveal the presence of harmful substances in food. verdict: yes.
5. 膳食分析可以提供关于脂肪、碳水化合物、蛋白质和热量的摄入量的评估.
explanation: the context mentions that the intake of factors such as fat, carbohydrates, protein, and calories can be evaluated through dietary analysis. verdict: yes.
6. 膳食分析可能并不是评估个人的确切营养状况最准确的方法.
explanation: the context mentions that while dietary analysis can provide useful information, it may not be the most accurate method for assessing an individual's exact nutritional status. verdict: yes.
final verdict for each statement in order: yes. no. yes. yes. yes. yes.

根据final verdict for each statement in order:定位到yes. no. yes. yes. yes. yes.

算分:

output = "yes. no. yes. yes. yes. yes."
score = sum(0 if "yes" in answer else 1
for answer in output.strip().split(".")
if answer != ""
)
# score = 1 - score/len(statements)
socre = 1 - 1/6
# score = 0.8333333333333334

即为faithfulnessd的最后得分。

answer_relevancy

衡量答案与问题的相关性和重点。

对于给定的生成答案,Ragas 使用 LLM 找出生成的答案将成为答案的可能问题,并计算与实际问题的相似性。

根据answer:

Generate question for the given answer.
Answer:
The PSLV-C56 mission is scheduled to be launched on Sunday, 30 July 2023 at 06:30 IST / 01:00 UTC. It will be launched from the Satish Dhawan Space Centre, Sriharikota, Andhra Pradesh, India 
Question: When is the scheduled launch date and time for the PSLV-C56 mission, and where will it be launched from?
Answer:膳食分析是一种评估个人饮食中摄入的食物种类和数量的方法。通过对一周内所吃的食物进行详细分析,可以了解人们摄入的营养素种类和含量,从而发现可能存在的营养素缺乏或过量的情况。膳食分析对于评估食物中的营养素含量、摄入量以及一些影响营养需求的因素(如糖、盐、咖啡因、酒精等)是有用的。然而,膳食分析并不能综合考虑到个体的营养素需求、营养素吸收和利用的情况,因此并不是评估个人营养状况的唯一方法。
Question:

得到:

[
"What is dietary analysis and how is it used to evaluate an individual's nutritional intake?", 
"What is dietary analysis and how is it used to assess an individual's nutritional intake?", 
"What is dietary analysis and how is it used to assess an individual's nutritional intake?"
]

然后计算[问题]和这三个结果的相似度:

相似度计算过程:利用text-embedding-ada-002进行向量化,然后计算余弦相似度。

def calculate_similarity(
self: t.Self, question: str, generated_questions: list[str]
):
question_vec = np.asarray(self.embedding.embed_query(question)).reshape(1, -1)
gen_question_vec = np.asarray(
self.embedding.embed_documents(generated_questions)
)
norm = np.linalg.norm(gen_question_vec, axis=1) * np.linalg.norm(
question_vec, axis=1)
return (
np.dot(gen_question_vec, question_vec.T).reshape(-1,)/ norm
)

得到:

[0.83663467 0.83484782 0.83484782]

然后平均:

answer_relevancy_score: 0.8354434364200779

context_relevancy

测量检索到的上下文中的信噪比。给定一个问题,Ragas打电话给LLM,从检索到的上下文中找出回答问题所需的句子。所需句子与上下文中总句子之间的比率为您提供分数。

输入:

Task: Candidate sentence extraction.
Given the question and context, extract minimum number of sentences from context required to answer the question. If the context do not contain information required to answer the question return "No candidate sentences found".
question: Which equation is known as worlds most famous equation?
context:
Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist,[5] widely ranked among the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century.
His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been called "the world's most famous equation".
sentences:His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been called "the world's most famous equation".
question: Were Scott Derrickson and Ed Wood of the same nationality?
context :
Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer He lives in Los Angeles, California He is best known for directing horror films such as "Sinister", "The Exorcism of Emily Rose", and "Deliver Us From Evil", as well as the 2016 Marvel Cinematic Universe installment, "Doctor Strange"Tyler Bates is an American musician, music producer, and composer for films, television, and video games. Adam Collis is an American filmmaker and actor.Conrad Brooks is an American actor.Edward Davis Wood Jr. (October 10, 1924 – December 10, 1978) was an American filmmaker, actor, writer, producer, and director.
sentences:Scott Derrickson (born July 16, 1966) is an American director, screenwriter and producer. Edward Davis Wood Jr. (October 10, 1924 – December 10, 1978) was an American filmmaker, actor, writer, producer, and director.
question: How many were killed in the Tiananmen Square incident?
context:
Tiananmen Square incident, also called June Fourth incident or 6/4, series of protests and demonstrations in China in the spring of 1989 that culminated on the night of June 3–4 with a government crackdown on the demonstrators in Tiananmen Square in Beijing.
sentences: No candidate sentences found.
question:膳食分析是什么?
context:
你想要达到什么样的健康程度?如果你想了解自己精神和身体的全部潜在能力,确定你的最佳营养需求是非常必要的。但是如果你的营养需求很特别,那么你该如何发现这一点呢?从1980年开始,我制定并完善了一套分析人体营养需求的体系,此体系建立在评估影响个人营养需求的主要因素的基础上。此体系已经在10万人当中进行了测试并得到了验证,现在已被世界各国的临床营养学家所应用。无数人已经从中获益,因此,我知道从这个体系中可以得到什么样的结果。这些结果包括:更敏锐的思维、更好的记忆力、更强的体力、更好的体重控制效果、降低胆固醇含量、使得病情好转。尽管多数被诊断为有病的人已经受益于个人健康计划,但这个健康计划是以预防为主要目标,而不是以治病为主的。如果你正在进行医药治疗,请确认这个营养补充方案与你接受的治疗措施是否能够很好地相容。影响你营养需求的因素,至少有8项因素会影响你的营养需求量。年龄、性别以及运动量都是很常见的影响因素,但污染、压力、基因遗传、你过去的健康状况,当然还有膳食所能提供的营养素以及反营养物质,却不是能够轻易了解的。但是所有的这些细节以及更多的因素,都必须纳入考虑当中。通过4种分析方法——膳食分析、生化分析、症状分析、生活方式分析,可以了解这些细节。膳食分析这种方式似乎是从容易着手的地方开始:搞清楚人们摄入的食物是什么,就能发现缺乏的营养素是什么。
但不幸的是,如果对一周内所吃的食物进行详细分析,并不能综合考虑到食物营养素含量的变异性、个人的营养素需求,或者营养素是否能被吸收、吸收后的营养素又是否能被有效地利用等情况。我见过许多人的饮食质量很高,却依然有维生素缺乏的症状。对其中绝大部分人来说,问题在于营养素吸收不良。这些可变因素,使得某些通过计算机进行的膳食分析不如设想的准确。膳食分析对于评估那些能影响到人们营养素需求的食物很有用处,比如糖、盐、咖啡、茶、酒精、食品添加剂和防腐剂。另一些因素,如脂肪、碳水化合物、蛋白质和热量的摄入量,也可由膳食分析进行评估。诸如头发矿物质分析与维生素血液分析这样的实验,可以提供关于体内生物化学状态的确切信息,也能让营养顾问了解你身体的确切营养状况。但并不是所有这些实验都能对规划你的营养方案提供有用的信息。为了让这些信息尽可能精确,任何关于维生素或矿物质的检测都必须反映出营养素在体内发挥作用的能力。例如,铁是构成血红细胞的必需元素,有助于体内氧的运输。通过测定细胞中铁的营养状况,可能会对你的铁需求有一个很好的估测。从另一方面来讲,维生素氏在血液中没有像铁那样的直接作用。它主要作用于大脑当中的化学物质5-羟色胺,或在甲基化反应中起作用,降低同型半胱氨酸含量。所以,仅仅测定血液中的维生素B6,并不能真实地反映你是否获得了充足的营养素供给以及体内的维生素B6是否运转正常。相比之下,检测同型半胱氨酸浓度等项目意义更大。如果你已经获得了充足的维生素、维生素B和叶酸,那么你的同型半胱氨酸浓度就很低,但假如同型半胱氨酸浓度很高,则表明你需要更多的其中一种或多种营养素。我们称这种测试方式为功能测试,因为这个测试能够精确地测定你体内生化反应的某项特殊功能是否运转正常。
sentences:

得到:

[
'膳食分析这种方式似乎是从容易着手的地方开始:搞清楚人们摄入的食物是什么,就能发现缺乏的营养素是什么。', 
'膳食分析是从容易着手的地方开始:搞清楚人们摄入的食物是什么,就能发现缺乏的营养素是什么。'
]

分割每个结果,在上下文中查找,看每个结果中分割后的小句在原文中出现的占比overlap_scores。
假设得到[1.0, 1.0]

上下文:

你想要达到什么样的健康程度?如果你想了解自己精神和身体的全部潜在能力,确定你的最佳营养需求是非常必要的。但是如果你的营养需求很特别,那么你该如何发现这一点呢?从1980年开始,我制定并完善了一套分析人体营养需求的体系,此体系建立在评估影响个人营养需求的主要因素的基础上。此体系已经在10万人当中进行了测试并得到了验证,现在已被世界各国的临床营养学家所应用。无数人已经从中获益,因此,我知道从这个体系中可以得到什么样的结果。这些结果包括:更敏锐的思维、更好的记忆力、更强的体力、更好的体重控制效果、降低胆固醇含量、使得病情好转。尽管多数被诊断为有病的人已经受益于个人健康计划,但这个健康计划是以预防为主要目标,而不是以治病为主的。如果你正在进行医药治疗,请确认这个营养补充方案与你接受的治疗措施是否能够很好地相容。影响你营养需求的因素,至少有8项因素会影响你的营养需求量。年龄、性别以及运动量都是很常见的影响因素,但污染、压力、基因遗传、你过去的健康状况,当然还有膳食所能提供的营养素以及反营养物质,却不是能够轻易了解的。但是所有的这些细节以及更多的因素,都必须纳入考虑当中。通过4种分析方法——膳食分析、生化分析、症状分析、生活方式分析,可以了解这些细节。膳食分析这种方式似乎是从容易着手的地方开始:搞清楚人们摄入的食物是什么,就能发现缺乏的营养素是什么。
但不幸的是,如果对一周内所吃的食物进行详细分析,并不能综合考虑到食物营养素含量的变异性、个人的营养素需求,或者营养素是否能被吸收、吸收后的营养素又是否能被有效地利用等情况。我见过许多人的饮食质量很高,却依然有维生素缺乏的症状。对其中绝大部分人来说,问题在于营养素吸收不良。这些可变因素,使得某些通过计算机进行的膳食分析不如设想的准确。膳食分析对于评估那些能影响到人们营养素需求的食物很有用处,比如糖、盐、咖啡、茶、酒精、食品添加剂和防腐剂。另一些因素,如脂肪、碳水化合物、蛋白质和热量的摄入量,也可由膳食分析进行评估。诸如头发矿物质分析与维生素血液分析这样的实验,可以提供关于体内生物化学状态的确切信息,也能让营养顾问了解你身体的确切营养状况。但并不是所有这些实验都能对规划你的营养方案提供有用的信息。为了让这些信息尽可能精确,任何关于维生素或矿物质的检测都必须反映出营养素在体内发挥作用的能力。例如,铁是构成血红细胞的必需元素,有助于体内氧的运输。通过测定细胞中铁的营养状况,可能会对你的铁需求有一个很好的估测。从另一方面来讲,维生素氏在血液中没有像铁那样的直接作用。它主要作用于大脑当中的化学物质5-羟色胺,或在甲基化反应中起作用,降低同型半胱氨酸含量。所以,仅仅测定血液中的维生素B6,并不能真实地反映你是否获得了充足的营养素供给以及体内的维生素B6是否运转正常。相比之下,检测同型半胱氨酸浓度等项目意义更大。如果你已经获得了充足的维生素、维生素B和叶酸,那么你的同型半胱氨酸浓度就很低,但假如同型半胱氨酸浓度很高,则表明你需要更多的其中一种或多种营养素。我们称这种测试方式为功能测试,因为这个测试能够精确地测定你体内生化反应的某项特殊功能是否运转正常。

然后利用bert或jaccard计算出两个答案的相似度得到agr_score。
则可算出context_relevancy:

context_relevancy = agr_score * (overlap_scores的平均值)

context_recall

在测试这个之前,你需要提供一个ground_truth
根据提示:

Given a context, and an answer, analyze each sentence in the answer and classify if the sentence can be attributed to the given context or not.
Think in steps and reason bofore coming to conclusion. 
context: Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist,widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century. His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been called "the world's most famous equation". He received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect", a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science. In a 1999 poll of 130 leading physicists worldwide by the British journal Physics World, Einstein was ranked the greatest physicist of all time. His intellectual achievements and originality have made Einstein synonymous with genius.
answer: Albert Einstein born in 14 March 1879 was  German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. He received the 1921 Nobel Prize in Physics "for his services to theoretical physics. He published 4 papers in 1905.  Einstein moved to Switzerland in 1895 
classification
1. Albert Einstein born in 14 March 1879 was  German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. The date of birth of Einstein is mentioned clearly in the context. So [Attributed]
2. He received the 1921 Nobel Prize in Physics "for his services to theoretical physics. The exact sentence is present in the given context. So [Attributed]
3. He published 4 papers in 1905. There is no mention about papers he wrote in given the context. So [Not Attributed]
4. Einstein moved to Switzerland in 1895. There is not supporting evidence for this in the given the context. So [Not Attributed]
context:{context}
answer:{ground_truth}
classification:

得到结果:

[
'#############[Attributed]',
'#############[Attributed]',
'#############[Not Attributed]',
……
]

计算[Attributed]出现占比,即为context_recall。

个人看法:

关于这个库:不支持auzre接入的openai,需要改源代码

关于四个指标

context_relevancy: 存疑 (agr_score中计算相似度的模型stsb-TinyBERT-L-4中文效果不确定)
context_recall: 这个不确定哪个场景用……
faithfulness: 靠谱
answer_relevancy: 大概率靠谱(取决于text-embedding-ada-002向量化后语义的精准性)

未经允许不得转载:首席AI分享圈 » Ragas:评估RAG召回QA准确率与答案相关性

首席AI分享圈

首席AI分享圈专注于人工智能学习,提供全面的AI学习内容、AI工具和实操指导。我们的目标是通过高质量的内容和实践经验分享,帮助用户掌握AI技术,一起挖掘AI的无限潜能。无论您是AI初学者还是资深专家,这里都是您获取知识、提升技能、实现创新的理想之地。

联系我们
zh_CN简体中文