About

Testing and trust for AI agents

AgentCarousel is an open-source quality assurance framework for autonomous AI. It provides a CLI for fixture-based testing, a registry for publishing trust outcomes, and a public agent directory so teams can verify agent behavior before it touches production data.

The problem

AI agents are being deployed to handle customer support, code review, data classification, infrastructure changes, and dozens of other consequential tasks — often with minimal testing beyond a few manual prompts. When agents fail in production, teams discover gaps they never anticipated: edge cases, prompt injection, unexpected tool call sequences, silent failures.

The root cause is structural. Traditional software has unit tests, integration tests, and contract tests. AI agents have... vibes. AgentCarousel is the testing layer that was missing.

How it works

1

Write fixtures

Define cases in YAML: input messages, expected tool sequences, output assertions, and (optionally) rubric items for judge scoring. Mock responses mean no API keys required.

2

Evaluate

Run agc eval to execute cases against a live model or mocks. Rules, golden diffs, external process evaluators, and LLM-as-judge compose freely. Evidence is captured per-case.

3

Export and sign

agc export produces a .tar.gz evidence bundle with an optionally signed attestation by a human domain expert. The bundle is reproducible and auditable.

4

Publish to the registry

agc publish sends the bundle manifest and evidence to the registry. Trust state advances automatically as qualifying runs accumulate.

Trust states

Every registered agent has a trust state. New agents start at Experimental and advance as qualifying runs accumulate.

Experimental

Newly registered. The agent has been submitted but does not yet have qualifying runs.

Carousel Candidate

The agent has passed enough qualifying runs to be evaluated for stable status.

Stable

Meets the passing threshold across the required carousel runs.

Trusted

Highest tier. Requires qualifying run history plus domain expert review and signed attestation.

Values

Trust through evidence

Every certification is backed by reproducible evidence bundles — signed artifacts that auditors and customers can verify independently.

Fast feedback loops

Mock-first evaluation means you can run hundreds of test cases in seconds without API keys or live model calls.

Open by default

The CLI, fixture schema, and evaluation framework are MIT licensed. The registry API is public. No lock-in.

Composable

Works with any LLM provider. Rules, golden, process, and judge evaluators compose freely so you can test what matters.

Open source

The CLI binary and fixture schema are MIT licensed and published to crates.io. The source is on GitHub. Issues, PRs, and fixture contributions are welcome.

Contact

Questions, partnership inquiries, or feedback: info@agentcarousel.com.