AI Personal Learning
and practical guidance
Resource Recommendation 1

Confident AI: A Framework for Automated Large Language Model Evaluation, Comparing the Output Quality of Different Large Model Cue Words

General Introduction

DeepEval is an easy-to-use open source LLM evaluation framework for evaluating and testing large language modeling systems. It is similar to Pytest but focuses on unit testing of LLM outputs.DeepEval combines the latest research results with metrics such as G-Eval, Phantom Detection, Answer Relevance, RAGAS, etc. to evaluate LLM outputs. Whether your application is implemented or fine-tuned by RAG, DeepEval can help you determine the best hyperparameters to improve model performance. In addition, it generates synthetic datasets, integrates seamlessly into any CI/CD environment, and provides red team testing capabilities for over 40 security vulnerabilities. The framework is also fully integrated with Confident AI to support the entire evaluation lifecycle of the platform.

Confident AI: A Framework for Automated Large Language Model Evaluation to Compare the Output Quality of Different Large Model Cue Words-1


 

Function List

  • Multiple LLM assessment metrics such as G-Eval, hallucination testing, answer correlation, RAGAS, etc.
  • Support for customized evaluation metrics and automatic integration into the DeepEval ecosystem
  • Generation of synthetic datasets for evaluation
  • Seamless integration into any CI/CD environment
  • Red team testing feature to detect more than 40 security vulnerabilities
  • Benchmarking with support for multiple benchmarks such as MMLU, HellaSwag, DROP, etc.
  • Fully integrated with Confident AI to support the entire assessment lifecycle from dataset creation to debugging of assessment results

 

Using Help

mounting

You can install DeepEval via pip:

pip install -U deepeval

It is recommended to create an account to generate shareable cloud-based test reports:

deepeval login

Writing Test Cases

Create a test file:

touch test_chatbot.py

existtest_chatbot.pyThe first test case is written in the

import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
def test_case(): correctness_metric = GEval
correctness_metric = GEval(
name="Correctness", criteria="Determine if the
criteria="Determine if the 'actual output' is correct based on the 'expected output'." ,
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT], threshold=0.5
threshold=0.5
)
test_case = LLMTestCase(
input="What if these shoes don't fit?", actual_output="We offer the same shoes in a different way.
actual_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30-day full refund at no extra costs."]
)
assert_test(test_case, [correctness_metric])

return yourOPENAI_API_KEYSet as an environment variable:

export OPENAI_API_KEY="..."

Run the test file in the CLI:

deepeval test run test_chatbot.py

Use of independent indicators

DeepEval is extremely modular, making it easy for anyone to use its metrics:

from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7) test_case = LLMTestCase(
input="What if these shoes don't fit?", actual_output="We offer the same shoes in a different way.
actual_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
print(answer_relevancy_metric.reason)

Bulk assessment data sets

In DeepEval, datasets are just collections of test cases. Here's how to evaluate these datasets in bulk:

import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
first_test_case = LLMTestCase(input="..." , actual_output="..." , context=["..."])
second_test_case = LLMTestCase(input="..." , actual_output="..." , context=["..."])
dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])
@pytest.mark.parametrize("test_case", dataset)
def test_customer_chatbot(test_case: LLMTestCase):
hallucination_metric = HallucinationMetric(threshold=0.3)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
assert_test(test_case, [hallucination_metric, answer_relevancy_metric])

Run the test file in the CLI:

deepeval test run test_.py -n 4

LLM Assessment with Confident AI

Log in to the DeepEval platform:

deepeval login

Run the test file:

deepeval test run test_chatbot.py

Once the test is complete, you will see a link in the CLI to paste it into your browser to view the results.

Tools Download
May not be reproduced without permission:Chief AI Sharing Circle " Confident AI: A Framework for Automated Large Language Model Evaluation, Comparing the Output Quality of Different Large Model Cue Words

Chief AI Sharing Circle

Chief AI Sharing Circle specializes in AI learning, providing comprehensive AI learning content, AI tools and hands-on guidance. Our goal is to help users master AI technology and explore the unlimited potential of AI together through high-quality content and practical experience sharing. Whether you are an AI beginner or a senior expert, this is the ideal place for you to gain knowledge, improve your skills and realize innovation.

Contact Us
en_USEnglish