Galileo

Eval-to-guardrail platform: the Luna distilled-judge models grade pre-production and become production guardrails.

Positioning

Galileo sells an eval-to-guardrail story: the same evaluator that scores your offline test set in development can be redeployed inline as a production guardrail. The differentiator is the Luna family of distilled judges, small purpose-trained models that approximate a frontier LLM judge at a fraction of the latency and cost. Distilled judges sit in a research lineage that goes back to Prometheus, Auto-J, JudgeLM, and follow-ups ¹.

The product covers traces, datasets, evaluators (including Luna), an Insights engine for clustering failures, and inline guardrail deployment. Hosted SaaS, VPC, and on-prem are all options. The product is not open source.

Strengths

Luna distilled judges. Purpose-trained small models that score quality dimensions (groundedness, relevance, safety) at production latency. Useful when an LLM-judge call per request from a frontier model is too slow or too expensive.
Eval-to-guardrail loop. The same scoring function that grades offline runs can be wired inline as a guardrail in production, which reduces the gap between pre-prod and prod metrics. The classical risk with this design, well documented in independent reviews of LLM-evaluator systems, is that the same biases that fool the judge offline also fool it inline ².
Enterprise deployment options. VPC and on-prem are available for teams with strict data-residency rules.

Limitations

No transparent pricing. Pricing is enterprise-only with no public rate card. Procurement is sales-led and expect long cycles.
Closed source. Luna is a proprietary model family; you cannot inspect or fine-tune the underlying judges. The eval product is similarly closed.
Vendor-published benchmarks. Galileo publishes Luna comparison numbers against frontier judges; these are vendor-published and should be treated as positioning.

Best fit

Enterprise teams that already need a production guardrail layer and want to avoid maintaining two parallel scoring pipelines (one for eval, one for guardrails). Workloads where a distilled judge with sub-second latency is preferable to a frontier-model LLM-judge call per request. Regulated industries that need VPC or on-prem deployment.

Getting started

from galileo_observe import GalileoObserveCallback
from galileo_protect import ProtectClient

# Pre-production: score offline eval runs with Luna.
observer = GalileoObserveCallback(
    project_name="rag-prod",
    metrics=["luna_groundedness", "luna_relevance"],
)

# Production: redeploy the same metric as an inline guardrail.
protect = ProtectClient(project_name="rag-prod")
result = protect.invoke(
    payload={"input": question, "output": answer, "context": context},
    metrics=["luna_groundedness"],
    threshold=0.7,
)
if result.action == "block":
    return safe_fallback()

The Galileo console renders the offline experiment and the inline guardrail trips against the same metric definitions.

Pricing notes

Pricing varies. Galileo does not publish a rate card; contact the vendor for a quote. VPC and on-prem deployment are part of the enterprise SKU.

Alternatives

Consider Promptfoo or DeepEval for OSS CI evals if you do not need an inline guardrail. Consider Langfuse or Arize Phoenix for OSS observability where the guardrail is a separate component. Consider Braintrust or LangSmith for hosted eval platforms without a first-party distilled-judge model family.

Citations and last verified

Verified 2026-05-29 against the Galileo product pages.

Kim et al., "Prometheus: Inducing Fine-grained Evaluation Capability," arXiv:2310.08491. ↩
Eugene Yan, "LLM Evaluators." ↩