Confident AI / DeepEval

Pytest-style LLM unit testing in DeepEval plus the Confident AI hosted layer for dataset management and dashboards.

Positioning

DeepEval is an open-source LLM testing framework styled after pytest: tests live alongside code, run on every PR, and read like unit tests. The library covers G-Eval (rubric-driven LLM judge), hallucination, RAG faithfulness and contextual relevance, agentic tool-call evaluation, and multi-turn conversational metrics, plus an OWASP Top 10 for Agentic Apps red-team suite. Confident AI is the hosted layer that adds dashboards, dataset versioning, and team workflows on top of the OSS library ¹.

The G-Eval rubric pattern comes from the original G-Eval paper ²; DeepEval's implementation is one of the most-used in production. The OSS project lists 500+ companies as users with Panasonic, Toshiba, Samsung, and Epic Games among named customers.

Strengths

Pytest ergonomics. Tests run with deepeval test run, integrate with GitHub Actions, and fail the build on regressions. Developers who already write pytest do not learn a new harness.
Wide metric library. G-Eval, hallucination, summarization, faithfulness, contextual relevance, answer relevance, bias, toxicity, tool correctness, task completion, and multi-turn conversational quality, all callable from the same API.
Active red-team suite. The DeepTeam module covers the OWASP Top 10 for Agentic Apps, which is a useful starting point for adversarial testing without rolling your own attack library.

Limitations

Hosted upsell. The OSS library is genuinely useful, but the most polished workflows (dataset versioning, team dashboards) live in Confident AI. Teams need to decide where to draw the line.
LLM-judge bias still applies. G-Eval inherits the standard judge biases (position, verbosity, self-preference) flagged in the LLM-judge literature. Calibration against humans remains the user's responsibility.
Vendor-published benchmarks. Performance and metric-quality claims from Confident AI are vendor-published.

Best fit

Offline regression suites in CI, RAG eval with the built-in faithfulness and contextual-relevance metrics, agentic tool-call evaluation, and red-team programs that want a structured starting point with the OWASP categories.

Getting started

from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

def test_rag_answer():
    case = LLMTestCase(
        input="Is the lentil bowl vegan?",
        actual_output=run_rag("Is the lentil bowl vegan?"),
        retrieval_context=["Lentil bowl ingredients: lentils, oil, salt."],
        expected_output="Yes.",
    )
    assert_test(case, [
        AnswerRelevancyMetric(threshold=0.7),
        FaithfulnessMetric(threshold=0.8),
    ])

Run with deepeval test run test_app.py. The CLI prints per-test scores and a summary; results push to Confident AI if the project is configured.

Pricing notes

DeepEval is free under Apache 2.0. Confident AI has hosted plans with team and enterprise tiers; pricing is sales-led for higher tiers. Confirm current rates with the vendor. The OSS library is genuinely standalone, so teams that want a CI regression suite without the hosted dashboard pay nothing for the framework itself; LLM-judge calls are billed per token to the underlying provider. Teams that need dataset versioning, team workflows, or compliance-grade reporting are the audience for the hosted SKU.

Alternatives

Consider Promptfoo for a CLI-first OSS regression suite. Consider Ragas for RAG-specific metrics. Consider Inspect AI for sandboxed agentic eval. For hosted dashboards without DeepEval, Braintrust, Langfuse, or LangSmith.

Citations and last verified

Verified 2026-05-29 against the DeepEval GitHub repository.

DeepEval, https://github.com/confident-ai/deepeval. ↩
Liu et al., "G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment," arXiv:2303.16634. ↩