About
Testing and trust for AI agents
AgentCarousel is an open-source quality assurance framework for autonomous AI. It provides a CLI for fixture-based testing, a registry for publishing trust outcomes, and a public agent directory so teams can verify agent behavior before it touches production data.
The problem
AI agents are being deployed to handle customer support, code review, data classification, infrastructure changes, and dozens of other consequential tasks — often with minimal testing beyond a few manual prompts. When agents fail in production, teams discover gaps they never anticipated: edge cases, prompt injection, unexpected tool call sequences, silent failures.
The root cause is structural. Traditional software has unit tests, integration tests, and contract tests. AI agents have... vibes. AgentCarousel is the testing layer that was missing.
How it works
Write fixtures
Define cases in YAML: input messages, expected tool sequences, output assertions, and (optionally) rubric items for judge scoring. Mock responses mean no API keys required.
Evaluate
Run agc eval to execute cases against a live model or mocks. Rules, golden diffs, external process evaluators, and LLM-as-judge compose freely. Evidence is captured per-case.
Export and sign
agc export produces a .tar.gz evidence bundle with an optionally signed attestation by a human domain expert. The bundle is reproducible and auditable.
Publish to the registry
agc publish sends the bundle manifest and evidence to the registry. Trust state advances automatically as qualifying runs accumulate.
Trust states
Every registered agent has a trust state. New agents start at Experimental and advance as qualifying runs accumulate.
Newly registered. The agent has been submitted but does not yet have qualifying runs.
The agent has passed enough qualifying runs to be evaluated for stable status.
Meets the passing threshold across the required carousel runs.
Highest tier. Requires qualifying run history plus domain expert review and signed attestation.
Values
Trust through evidence
Every certification is backed by reproducible evidence bundles — signed artifacts that auditors and customers can verify independently.
Fast feedback loops
Mock-first evaluation means you can run hundreds of test cases in seconds without API keys or live model calls.
Open by default
The CLI, fixture schema, and evaluation framework are MIT licensed. The registry API is public. No lock-in.
Composable
Works with any LLM provider. Rules, golden, process, and judge evaluators compose freely so you can test what matters.
Open source
The CLI binary and fixture schema are MIT licensed and published to crates.io. The source is on GitHub. Issues, PRs, and fixture contributions are welcome.
Contact
Questions, partnership inquiries, or feedback: info@agentcarousel.com.