General Introduction
DeepEval is an easy-to-use open source LLM evaluation framework for evaluating and testing large language modeling systems. It is similar to Pytest but focuses on unit testing of LLM outputs.DeepEval combines the latest research results with metrics such as G-Eval, Phantom Detection, Answer Relevance, RAGAS, etc. to evaluate LLM outputs. Whether your application is implemented or fine-tuned by RAG, DeepEval can help you determine the best hyperparameters to improve model performance. In addition, it generates synthetic datasets, integrates seamlessly into any CI/CD environment, and provides red team testing capabilities for over 40 security vulnerabilities. The framework is also fully integrated with Confident AI to support the entire evaluation lifecycle of the platform.
Function List
- Multiple LLM assessment metrics such as G-Eval, hallucination testing, answer correlation, RAGAS, etc.
- Support for customized evaluation metrics and automatic integration into the DeepEval ecosystem
- Generation of synthetic datasets for evaluation
- Seamless integration into any CI/CD environment
- Red team testing feature to detect more than 40 security vulnerabilities
- Benchmarking with support for multiple benchmarks such as MMLU, HellaSwag, DROP, etc.
- Fully integrated with Confident AI to support the entire assessment lifecycle from dataset creation to debugging of assessment results
Using Help
mounting
You can install DeepEval via pip:
pip install -U deepeval
It is recommended to create an account to generate shareable cloud-based test reports:
deepeval login
Writing Test Cases
Create a test file:
touch test_chatbot.py
existtest_chatbot.py
The first test case is written in the
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
def test_case(): correctness_metric = GEval
correctness_metric = GEval(
name="Correctness", criteria="Determine if the
criteria="Determine if the 'actual output' is correct based on the 'expected output'." ,
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT], threshold=0.5
threshold=0.5
)
test_case = LLMTestCase(
input="What if these shoes don't fit?", actual_output="We offer the same shoes in a different way.
actual_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30-day full refund at no extra costs."]
)
assert_test(test_case, [correctness_metric])
return yourOPENAI_API_KEY
Set as an environment variable:
export OPENAI_API_KEY="..."
Run the test file in the CLI:
deepeval test run test_chatbot.py
Use of independent indicators
DeepEval is extremely modular, making it easy for anyone to use its metrics:
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7) test_case = LLMTestCase(
input="What if these shoes don't fit?", actual_output="We offer the same shoes in a different way.
actual_output="We offer a 30-day full refund at no extra costs.",
retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
)
answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
print(answer_relevancy_metric.reason)
Bulk assessment data sets
In DeepEval, datasets are just collections of test cases. Here's how to evaluate these datasets in bulk:
import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
first_test_case = LLMTestCase(input="..." , actual_output="..." , context=["..."])
second_test_case = LLMTestCase(input="..." , actual_output="..." , context=["..."])
dataset = EvaluationDataset(test_cases=[first_test_case, second_test_case])
@pytest.mark.parametrize("test_case", dataset)
def test_customer_chatbot(test_case: LLMTestCase):
hallucination_metric = HallucinationMetric(threshold=0.3)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
assert_test(test_case, [hallucination_metric, answer_relevancy_metric])
Run the test file in the CLI:
deepeval test run test_.py -n 4
LLM Assessment with Confident AI
Log in to the DeepEval platform:
deepeval login
Run the test file:
deepeval test run test_chatbot.py
Once the test is complete, you will see a link in the CLI to paste it into your browser to view the results.