Positioning
Promptfoo is a CLI-first eval and red-team tool that ships YAML test suites runnable against any provider (OpenAI, Anthropic, Google, Bedrock, Azure, local). The product covers offline evals (model-graded, deterministic, classifier-based, embedding similarity, structured-output assertions) and a separate red-team mode that generates adversarial inputs across OWASP-style categories. It was acquired by OpenAI in 2025 with the MIT license preserved; the company reports adoption at 156 of the Fortune 500 1.
The format that works best is keeping the YAML suites in your repo, running Promptfoo in CI, and gating PRs on score deltas.
Strengths
- CLI-first ergonomics.
promptfoo eval runs a YAML suite, prints a table, and returns a non-zero exit code on regressions. The CI integration is as simple as a GitHub Actions step.
- Provider-agnostic. OpenAI, Anthropic, Google, Bedrock, Azure, local Llama via Ollama, custom HTTP endpoints, and more all plug in via a small provider config.
- First-class red-team mode.
promptfoo redteam generates adversarial prompts and grades model responses against safety categories. The red-team suite covers prompt injection, jailbreaks, harmful content, and PII leakage 2.
Limitations
- YAML-heavy. Complex evals with shared fixtures and multi-step pipelines can grow YAML files past the ergonomic point. A pytest-style framework (DeepEval) handles that case more cleanly.
- Limited observability. Promptfoo focuses on offline eval and red-team; it is not a tracing or production-observability platform. Pair with Langfuse, Phoenix, or LangSmith for that.
- OpenAI ownership. The MIT license is preserved, but the project's strategic direction is now under OpenAI. Treat that as a useful signal rather than a problem; reasonable people disagree.
Best fit
CI regression suites for prompt and model changes, structured red-team programs that run nightly, side-by-side prompt comparison across providers, and teams that prefer keeping eval logic in repo-versioned YAML rather than a vendor UI.
Getting started
prompts:
- "You are a helpful assistant. {{question}}"
providers:
- openai:gpt-4o-mini
- anthropic:claude-3-5-haiku-latest
tests:
- vars:
question: "Is the lentil bowl vegan?"
assert:
- type: contains
value: "yes"
- type: llm-rubric
value: "Answer is concise and accurate."
- vars:
question: "Refund policy?"
assert:
- type: similar
value: "30-day refund window"
threshold: 0.7
Run with promptfoo eval. The CLI prints a pass-rate table per provider and writes a JSON report. Use promptfoo redteam for the adversarial mode.
Pricing notes
Promptfoo OSS is free under MIT. Promptfoo Cloud has paid plans for team workflows, cloud-hosted history, and SSO; pricing is sales-led for enterprise tiers. Confirm current rates with the vendor. The OSS CLI is genuinely standalone, so teams that want a CI regression suite and a nightly red-team without a hosted UI pay nothing for the framework itself; provider API costs remain the only real bill.
Alternatives
Consider DeepEval for pytest-style ergonomics. Consider Inspect AI for sandboxed and agentic evals. Consider Giskard for AI-specific red-team and SOC 2-compatible reporting. For hosted dashboards, pair with Langfuse, Arize Phoenix, or Braintrust.
Citations and last verified
Verified 2026-05-29 against the Promptfoo documentation.