DeepEval is often called the "Pytest for LLMs" because it allows you to write evaluation scripts that feel exactly like standard software unit tests.
For your CLI JSON project, DeepEval is particularly useful because it provides Reasoning. If a command fails the test, it will tell you exactly why (e.g., "The model suggested the --force flag, but the JSON context only mentions --recursive").
1. Prerequisites
pip install deepeval
2. The DeepEval Test File (test_cli_rag.py)
This script uses the RAG Triad (Faithfulness, Answer Relevancy, and Contextual Precision) to test your CLI commands.
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric, ContextualPrecisionMetric
# 1. Setup the metrics with passing thresholds
# Threshold 0.7 means the score must be > 0.7 to "Pass" the unit test
faithfulness_metric = FaithfulnessMetric(threshold=0.7)
relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
precision_metric = ContextualPrecisionMetric(threshold=0.7)
def test_docker_ps_command():
# --- SIMULATED RAG OUTPUT ---
# In a real test, you would call your query_engine.query() here
input_query = "How do I see all my containers, even stopped ones?"
actual_output = "Use the command 'docker ps -a' to list all containers including stopped ones."
retrieval_context = [
"Command: docker ps. Description: List running containers. Examples: docker ps -a"
]
# 2. Create the Test Case
test_case = LLMTestCase(
input=input_query,
actual_output=actual_output,
retrieval_context=retrieval_context
)
# 3. Assert the test with multiple metrics
assert_test(test_case, [faithfulness_metric, relevancy_metric, precision_metric])
def test_non_existent_command():
input_query = "How do I hack into NASA?"
actual_output = "I'm sorry, I don't have information on that."
retrieval_context = [] # Nothing found in your CLI JSON
test_case = LLMTestCase(
input=input_query,
actual_output=actual_output,
retrieval_context=retrieval_context
)
assert_test(test_case, [relevancy_metric])
3. Running the Test
You run this from your terminal just like a normal python test:
deepeval test run test_cli_rag.py
4. Why DeepEval is better than Ragas for CLI:
* The Dashboard: If you run deepeval login, all your results are uploaded to a web dashboard where you can see how your CLI tool's accuracy changes over time as you add more commands to your JSON.
* Strict Flags: You can create a custom GEval metric in DeepEval specifically to check for "Flag Accuracy"—ensuring the LLM never hallucinates a CLI flag that isn't in your documentation.
* CI/CD Integration: You can block a GitHub Pull Request from merging if the "CLI Accuracy" score drops below 80%.
Comparison: Ragas vs. DeepEval
| Feature | Ragas | DeepEval |
|---|---|---|
| Primary Use | Research / Bulk Data Eval | Engineering / Unit Testing |
| Output | Raw Scores (0.0 - 1.0) | Pass/Fail + Detailed Reasoning |
| Integration | Pandas / Notebooks | Pytest / GitHub Actions |
| UI | None (requires 3rd party) | Built-in Cloud Dashboard |
Would you like me to show you how to create a "Custom Flag Metric" to ensure the LLM never invents fake CLI arguments?
No comments:
Post a Comment