Promptfoo

MIT-licensed CLI and library for eval and red-team; acquired by OpenAI with the OSS license preserved.

Positioning

Promptfoo is a CLI-first eval and red-team tool that ships YAML test suites runnable against any provider (OpenAI, Anthropic, Google, Bedrock, Azure, local). The product covers offline evals (model-graded, deterministic, classifier-based, embedding similarity, structured-output assertions) and a separate red-team mode that generates adversarial inputs across OWASP-style categories. It was acquired by OpenAI in 2025 with the MIT license preserved; the company reports adoption at 156 of the Fortune 500 ¹.

The format that works best is keeping the YAML suites in your repo, running Promptfoo in CI, and gating PRs on score deltas.

Strengths

CLI-first ergonomics. promptfoo eval runs a YAML suite, prints a table, and returns a non-zero exit code on regressions. The CI integration is as simple as a GitHub Actions step.
Provider-agnostic. OpenAI, Anthropic, Google, Bedrock, Azure, local Llama via Ollama, custom HTTP endpoints, and more all plug in via a small provider config.
First-class red-team mode. promptfoo redteam generates adversarial prompts and grades model responses against safety categories. The red-team suite covers prompt injection, jailbreaks, harmful content, and PII leakage ².

Limitations

YAML-heavy. Complex evals with shared fixtures and multi-step pipelines can grow YAML files past the ergonomic point. A pytest-style framework (DeepEval) handles that case more cleanly.
Limited observability. Promptfoo focuses on offline eval and red-team; it is not a tracing or production-observability platform. Pair with Langfuse, Phoenix, or LangSmith for that.
OpenAI ownership. The MIT license is preserved, but the project's strategic direction is now under OpenAI. Treat that as a useful signal rather than a problem; reasonable people disagree.

Best fit

CI regression suites for prompt and model changes, structured red-team programs that run nightly, side-by-side prompt comparison across providers, and teams that prefer keeping eval logic in repo-versioned YAML rather than a vendor UI.

Getting started

prompts:
  - "You are a helpful assistant. {{question}}"

providers:
  - openai:gpt-4o-mini
  - anthropic:claude-3-5-haiku-latest

tests:
  - vars:
      question: "Is the lentil bowl vegan?"
    assert:
      - type: contains
        value: "yes"
      - type: llm-rubric
        value: "Answer is concise and accurate."

  - vars:
      question: "Refund policy?"
    assert:
      - type: similar
        value: "30-day refund window"
        threshold: 0.7

Run with promptfoo eval. The CLI prints a pass-rate table per provider and writes a JSON report. Use promptfoo redteam for the adversarial mode.

Pricing notes

Promptfoo OSS is free under MIT. Promptfoo Cloud has paid plans for team workflows, cloud-hosted history, and SSO; pricing is sales-led for enterprise tiers. Confirm current rates with the vendor. The OSS CLI is genuinely standalone, so teams that want a CI regression suite and a nightly red-team without a hosted UI pay nothing for the framework itself; provider API costs remain the only real bill.

Alternatives

Consider DeepEval for pytest-style ergonomics. Consider Inspect AI for sandboxed and agentic evals. Consider Giskard for AI-specific red-team and SOC 2-compatible reporting. For hosted dashboards, pair with Langfuse, Arize Phoenix, or Braintrust.

Citations and last verified

Verified 2026-05-29 against the Promptfoo documentation.

Promptfoo Docs, https://www.promptfoo.dev/. ↩
Promptfoo Red-team, https://www.promptfoo.dev/docs/red-team/. ↩