AI Personal Learning
and practical guidance
讯飞绘镜

Confident AI: A Framework for Automated Large Language Model Evaluation, Comparing the Output Quality of Different Large Model Cue Words

General Introduction

DeepEval is an easy-to-use open source LLM evaluation framework for evaluating and testing large language modeling systems. It is similar to Pytest but focuses on unit testing of LLM outputs.DeepEval combines the latest research results with metrics such as G-Eval, Phantom Detection, Answer Relevance, RAGAS, etc. to evaluate LLM outputs. Whether your application is implemented or fine-tuned by RAG, DeepEval can help you determine the best hyperparameters to improve model performance. In addition, it generates synthetic datasets, integrates seamlessly into any CI/CD environment, and provides red team testing capabilities for over 40 security vulnerabilities. The framework is also fully integrated with Confident AI to support the entire evaluation lifecycle of the platform.

Confident AI:自动化大语言模型评估框架,对比不同大模型提示词输出质量-1


 

Function List

  • Multiple LLM assessment metrics such as G-Eval, hallucination testing, answer correlation, RAGAS, etc.
  • Support for customized evaluation metrics and automatic integration into the DeepEval ecosystem
  • Generation of synthetic datasets for evaluation
  • Seamless integration into any CI/CD environment
  • Red team testing feature to detect more than 40 security vulnerabilities
  • Benchmarking with support for multiple benchmarks such as MMLU, HellaSwag, DROP, etc.
  • Fully integrated with Confident AI to support the entire assessment lifecycle from dataset creation to debugging of assessment results

 

Using Help

mounting

You can install DeepEval via pip:

pip install -U deepeval

It is recommended to create an account to generate shareable cloud-based test reports:

deepeval login

Writing Test Cases

Create a test file:

touch test_chatbot.py

existtest_chatbot.pyThe first test case is written in the

import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
def test_case():
correctness_metric = GEval(
name="Correctness",
criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
threshold=0.5
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
assert_test(test_case, [correctness_metric])

return yourOPENAI_API_KEYSet as an environment variable:

export OPENAI_API_KEY="..."

Run the test file in the CLI:

deepeval test run test_chatbot.py

Use of independent indicators

DeepEval is extremely modular, making it easy for anyone to use its metrics:

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
print(answer_relevancy_metric.reason)

Bulk assessment data sets

In DeepEval, datasets are just collections of test cases. Here's how to evaluate these datasets in bulk:

import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
first_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])
second_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])
dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])
@pytest.mark.parametrize("test_case", dataset)
def test_customer_chatbot(test_case: LLMTestCase):
hallucination_metric = HallucinationMetric(threshold=0.3)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
assert_test(test_case, [hallucination_metric, answer_relevancy_metric])

Run the test file in the CLI:

deepeval test run test_<filename>.py -n 4

LLM Assessment with Confident AI

Log in to the DeepEval platform:

deepeval login

Run the test file:

deepeval test run test_chatbot.py

Once the test is complete, you will see a link in the CLI to paste it into your browser to view the results.

May not be reproduced without permission:Chief AI Sharing Circle " Confident AI: A Framework for Automated Large Language Model Evaluation, Comparing the Output Quality of Different Large Model Cue Words
en_USEnglish