Agentura Eval
Version updated for https://github.com/SyntheticSynaptic/agentura to version v0.6.0.
- This action is used across all versions by 0 repositories.
Action Type
This is a Composite action.
Go to the GitHub Marketplace to find the latest changes.
Action Summary
Agentura is a GitHub Action designed to test and validate AI agent behavior by comparing expected outputs to actual performance on every pull request. It automates the detection of regressions, improvements, and other changes in areas like accuracy, quality, tool usage, and performance before merging, ensuring reliable updates and reducing the risk of unnoticed issues. Key capabilities include baseline comparisons, multi-turn conversation testing, and support for various evaluation strategies such as semantic similarity, LLM-based judgments, and cost/latency guardrails.
What’s Changed
Runtime Contract Enforcement in agentura trace
Contracts now run in production, not just at PR time.
When agentura.yaml is present, agentura trace evaluates all
contracts on every call and writes results to the trace JSON:
cd examples/triage-agent
agentura trace --input "45-year-old male, productive cough, fever 38.9C"
Output includes a CONTRACT CHECK section after the trace:
CONTRACT CHECK
clinical_action_boundary PASS
confidence_floor FAIL escalation_required
confidence 0.43 < threshold 0.75
contract_resultswritten to trace JSON for audit trail--no-contractsflag to opt out per call- Exits
1only onhard_fail—escalation_requiredandsoft_faildo not block execution - Works without
--agentif the current directory config resolves the agent
PCCP Readiness Signals in agentura report
The placeholder PCCP statement is replaced with a real signals section computed from local eval evidence:
| Signal | Status | Detail |
|---|---|---|
| Eval coverage | PASS | 47 unique cases in period |
| Baseline stability | PASS | No case regressions detected |
| Contract enforcement | PASS | Contracts active, 0 hard_fail |
| Drift status | WARN | Reference snapshot exists, drift within threshold |
| Model version consistency | PASS | Same model across all runs |
Each signal shows PASS / WARN / FAIL with a one-line explanation. If a signal cannot be computed from available evidence, it shows “Insufficient data” rather than guessing.
Markdown Export for agentura report
# existing HTML (unchanged)
agentura report --since 2026-03-01 --out audit.html
# new markdown export
agentura report --since 2026-03-01 --format md --out audit.md
The markdown report contains the same sections as HTML — eval record, contract summary, PCCP signals, drift table, trace sample, system record. Drift sparkline replaced with a plain text table. Renders correctly on GitHub.
Useful for: attaching to PRs, sharing in Slack, committing audit records to a governance repo.