Comet Opik

Apache 2.0 LLM evaluation and observability with 30+ judge metrics, a prompt optimizer, and an in-IDE coding assistant.

Positioning

Opik is the open-source LLM eval and observability project from Comet, the experiment-tracking company. The codebase is Apache 2.0, which makes it the most permissive license among the major OSS eval hubs, and the project ships frequently: v2.0.51 was tagged May 28, 2026, with 471 releases on record ¹.

The product covers tracing, prompt management, datasets, 30+ LLM-judge metrics (hallucination, contextual relevance, moderation, conversational quality, and more), a prompt optimizer with six search algorithms, and "Ollie," a coding assistant that suggests fixes back into your codebase based on observed eval failures.

Strengths

Apache 2.0 license. No commercial-use restriction, no enterprise-only modules in the OSS core. Resellable, embeddable, and self-hostable without legal review. Permissive licensing is one of the differentiators flagged in surveys of popular open-source AI tools ².
Wide metric library. Out-of-the-box LLM-judge metrics cover the most common cases (hallucination, faithfulness, relevance, conversational quality, moderation, custom rubric). The prompt optimizer adds an experimentation loop on top.
Public customer roster. AssemblyAI, NatWest, Stellantis, Uber, Netflix, Autodesk, and Etsy are listed as customers. The roster signals real production traffic and an active community.

Limitations

Younger than Phoenix and Langfuse. While the release cadence is high, the community is smaller than the more established OSS hubs. Some integrations that exist for Langfuse and Phoenix are not yet first class.
Comet brand confusion. Opik is a separate product from Comet's experiment-tracking core. Teams onboarding the OSS hub sometimes confuse the two SKUs.
Vendor-published benchmarks. Performance comparisons from Comet are vendor-published; treat them as positioning.

Best fit

OSS observability hubs where Apache 2.0 is preferred over ELv2 or MIT-with-EE, RAG evaluation with the Opik metric library, and hybrid setups where the OSS hub runs alongside Promptfoo or DeepEval in CI.

Getting started

import opik
from opik.evaluation.metrics import Hallucination, AnswerRelevance
from opik.evaluation import evaluate

@opik.track  # auto-traces the function and its LLM calls
def app(question: str, context: list[str]) -> str:
    return run_rag(question, context)

dataset = opik.Opik().get_or_create_dataset("rag-dev")
dataset.insert([
    {"question": "Is the dish vegan?", "context": ["Lentil bowl: no dairy."], "expected": "yes"},
])

evaluate(
    experiment_name="rag-v1",
    dataset=dataset,
    task=lambda item: app(item["question"], item["context"]),
    scoring_metrics=[Hallucination(), AnswerRelevance()],
)

The trace, the metric scores, and the per-item judge reasoning render in the Opik UI. Subsequent runs against the same dataset diff automatically.

Pricing notes

Opik OSS is free under Apache 2.0. Comet Cloud has a hosted tier with paid plans; pricing details vary by seat and trace volume. Confirm current rates with the vendor. Teams that already license Comet for experiment tracking often find Opik bundled at favorable rates; teams that buy Opik standalone should price-check against Langfuse Pro and the Phoenix hosted tier before committing. The Apache 2.0 license is the strong tiebreaker when legal review is on the critical path.

Alternatives

Consider Langfuse if MIT plus EE modules are preferred to Apache 2.0. Consider Arize Phoenix for OpenTelemetry-native instrumentation. Consider LangSmith or Braintrust for hosted-only alternatives. For pure CI regression, DeepEval or Promptfoo.

Citations and last verified

Verified 2026-05-29 against the Opik GitHub repository.

Comet Opik, https://github.com/comet-ml/opik. ↩
Chip Huyen, "900 Most Popular Open Source AI Tools." ↩