TruLens

OpenTelemetry-emitting feedback functions for RAG and agent evals, stewarded by Snowflake.

Positioning

TruLens is an OSS evaluation framework whose central abstraction is the feedback function: a callable that takes a trace (or part of one) and returns a score. The library ships feedback functions for the standard RAG metrics (groundedness, context relevance, answer relevance), plus general-purpose LLM-judge feedback functions for tone, conciseness, and custom rubrics. The project moved under Snowflake's stewardship but is community-driven; v2.8.1 shipped May 14, 2026 ¹.

TruLens emits OpenTelemetry spans, which makes it portable: the same instrumented application can fan out to Phoenix, Langfuse, or any other OTel backend, with TruLens scoring the spans.

Strengths

Composable feedback functions. The library treats evaluators as first-class objects you compose, so the same groundedness function can be reused across notebooks, CI, and production. The RAG metrics overlap with Ragas and are derived from the same line of work ².
OTel emission. Spans are OpenTelemetry-shaped. You can pipe TruLens traces into the same OTel backend as the rest of your platform without a parallel format.
Snowflake stewardship. Under Snowflake, the project has a clear long-term owner. The license remains permissive.

Limitations

Less polished UI than hosted hubs. The TruLens dashboard exists but is more notebook-companion than production-grade. Most teams pair TruLens with a separate trace hub for visualization.
Documentation lags features. Newer feedback functions and integrations sometimes ship before the docs catch up. Reading the source is occasionally necessary.
Vendor-published benchmarks. Performance claims from TruLens or Snowflake should be treated as vendor positioning.

Best fit

RAG evaluation where you want composable feedback functions in code rather than a hosted metric library, OpenTelemetry-native eval pipelines, and Snowflake-aligned ML stacks that want native RAG eval primitives.

Getting started

from trulens.core import TruSession, Feedback
from trulens.providers.openai import OpenAI
from trulens.apps.langchain import TruChain

session = TruSession()

provider = OpenAI()
f_groundedness = Feedback(provider.groundedness_measure_with_cot_reasons).on_input_output()
f_context_relevance = Feedback(provider.context_relevance_with_cot_reasons).on_input().on_output()

# Wrap any RAG chain (LangChain, LlamaIndex, or custom).
tru_chain = TruChain(rag_chain, app_name="rag-prod", feedbacks=[f_groundedness, f_context_relevance])

with tru_chain as recording:
    rag_chain.invoke("Is the lentil bowl vegan?")

# Aggregate feedback scores from the session.
records, feedback = session.get_records_and_feedback(app_ids=["rag-prod"])
print(feedback.head())

The TruLens dashboard renders per-record traces and feedback scores; OTel spans are also available for export.

Pricing notes

TruLens is free under the MIT license. LLM-call costs (for feedback functions that use an LLM judge) are paid to the underlying provider. Snowflake-aligned teams that run TruLens against Cortex or Arctic models can keep judge inference on their existing data warehouse compute, which often makes the marginal cost of an additional feedback function negligible compared to running the same loop through an external LLM API. Cache feedback function results by trace id to avoid re-scoring on iterations.

Alternatives

Consider Ragas for a DataFrame-native RAG metric library. Consider DeepEval for pytest-style ergonomics. Consider Arize Phoenix for a stronger UI with OTel traces. For hosted dashboards, Langfuse or Braintrust.

Citations and last verified

Verified 2026-05-29 against the TruLens documentation.

TruLens Docs, https://www.trulens.org/. ↩
Es et al., "RAGAS: Automated Evaluation of Retrieval Augmented Generation," arXiv:2309.15217. ↩