Confident AI: Automated Large Language Model Evaluation Framework to Compare the Output Quality of Different Large Model Cue Words

🚀 Invitation to Experience: China's First AI IDE Intelligent Programming Software Trae Chinese version downloadThe DeepSeek-R1 and Doubao-pro are available for unlimited use!

General Introduction

DeepEval is an easy-to-use open source LLM evaluation framework for evaluating and testing large language modeling systems. It is similar to Pytest but focuses on unit testing of LLM outputs.DeepEval combines the latest research results with metrics such as G-Eval, Phantom Detection, Answer Relevance, RAGAS, etc. to evaluate LLM outputs. Whether your application is implemented or fine-tuned by RAG, DeepEval can help you determine the best hyperparameters to improve model performance. In addition, it generates synthetic datasets, integrates seamlessly into any CI/CD environment, and provides red team testing capabilities for over 40 security vulnerabilities. The framework is also fully integrated with Confident AI to support the entire evaluation lifecycle of the platform.

Confident AI：自动化大语言模型评估框架，对比不同大模型提示词输出质量-1

Function List

Multiple LLM assessment metrics such as G-Eval, hallucination testing, answer correlation, RAGAS, etc.
Support for customized evaluation metrics and automatic integration into the DeepEval ecosystem
Generation of synthetic datasets for evaluation
Seamless integration into any CI/CD environment
Red team testing feature to detect more than 40 security vulnerabilities
Benchmarking with support for multiple benchmarks such as MMLU, HellaSwag, DROP, etc.
Fully integrated with Confident AI to support the entire assessment lifecycle from dataset creation to debugging of assessment results

Using Help

mounting

You can install DeepEval via pip:

pip install -U deepeval

It is recommended to create an account to generate shareable cloud-based test reports:

deepeval login

Writing Test Cases

Create a test file:

touch test_chatbot.py

existtest_chatbot.pyThe first test case is written in the

import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
def test_case():
correctness_metric = GEval(
name="Correctness",
criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
threshold=0.5
)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
assert_test(test_case, [correctness_metric])

return yourOPENAI_API_KEYSet as an environment variable:

export OPENAI_API_KEY="..."

Run the test file in the CLI:

deepeval test run test_chatbot.py

Use of independent indicators

DeepEval is extremely modular, making it easy for anyone to use its metrics:

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
test_case = LLMTestCase(
input="What if these shoes don't fit?",
actual_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
print(answer_relevancy_metric.reason)

Bulk assessment data sets

In DeepEval, datasets are just collections of test cases. Here's how to evaluate these datasets in bulk:

import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
first_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])
second_test_case = LLMTestCase(input="...", actual_output="...", context=["..."])
dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])
@pytest.mark.parametrize("test_case", dataset)
def test_customer_chatbot(test_case: LLMTestCase):
hallucination_metric = HallucinationMetric(threshold=0.3)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
assert_test(test_case, [hallucination_metric, answer_relevancy_metric])

Run the test file in the CLI:

deepeval test run test_<filename>.py -n 4

LLM Assessment with Confident AI

deepeval login

Run the test file:

deepeval test run test_chatbot.py

Once the test is complete, you will see a link in the CLI to paste it into your browser to view the results.

Confident AI: A Framework for Automated Large Language Model Evaluation, Comparing the Output Quality of Different Large Model Cue Words

General Introduction

Function List

Using Help

mounting

Writing Test Cases

Use of independent indicators

Bulk assessment data sets

LLM Assessment with Confident AI

Related articles

Recommended

Can't find AI tools? Try here!

FLUX.1 image generator (supports Chinese input)

Recent AI Hotspots

AI Tools Recommendations

AI Tools Classification