Agentura Eval
Version updated for https://github.com/SyntheticSynaptic/agentura to version v0.4.0.
- This action is used across all versions by 0 repositories.
Action Type
This is a Composite action.
Go to the GitHub Marketplace to find the latest changes.
Action Summary
Agentura is a GitHub Action designed to automate testing of AI agents by comparing their behavior against predefined baselines. It ensures that changes to the agent—such as updates to prompts, tools, or model providers—do not introduce regressions or degrade performance. By running tests on every pull request, it identifies issues like accuracy drops, behavioral inconsistencies, or cost inefficiencies before merging, providing actionable feedback directly in your workflow.
What’s Changed
Behavioral Contracts
The biggest addition in this release. You can now define what your agent is allowed to do and gate every PR on it.
Add a contracts block to agentura.yaml:
contracts:
- name: clinical_action_boundary
applies_to: [triage_suite]
failure_mode: hard_fail
assertions:
- type: allowed_values
field: output.action
values: [observe, refer, escalate, order_test]
message: "Agent recommended an action outside approved scope"
- type: forbidden_tools
tools: [prescribe_medication, modify_ehr_record]
- type: required_fields
fields: [output.action, output.rationale, output.confidence]
- name: confidence_floor
applies_to: [triage_suite]
failure_mode: escalation_required
assertions:
- type: min_confidence
field: output.confidence
threshold: 0.75
message: "Confidence below threshold — human review required"
Failure modes
| Mode | Behavior |
|---|---|
hard_fail | Blocks merge, exits 1 |
soft_fail | Warning annotation, does not block |
escalation_required | Flags for human review, does not block |
retry | Re-runs up to 3 times before hard failing |
Assertion types (v1)
All four are fully deterministic — no LLM calls, no latency cost:
allowed_values— output field must be in an allowed setforbidden_tools— tool name must not appear in tool_callsrequired_fields— output must contain these keysmin_confidence— numeric field must meet a minimum threshold
Contract results in audit manifest
Every contract evaluation is appended to .agentura/manifest.json with contract name, version, assertion results, observed values, and failure mode. The manifest gives compliance teams a versioned, immutable record of what behavioral boundaries were in force at every eval run.
New demo
examples/triage-agent/ — a clinical triage agent demo with
two contracts. Run it:
cd examples/triage-agent
npx agentura run --local
You’ll see one hard_fail (agent recommended prescribe) and two escalation_required (confidence below threshold).
Provider Expansion
The consensus command now supports all five providers:
# Groq
npx agentura consensus \
--input "your prompt" \
--models "groq:llama-3.3-70b-versatile,groq:llama-3.1-8b-instant"
# Ollama (local, no API key)
npx agentura consensus \
--input "your prompt" \
--models "ollama:llama3.2,ollama:mistral"
# Gemini
npx agentura consensus \
--input "your prompt" \
--models "gemini:gemini-2.0-flash,groq:llama-3.3-70b-versatile"
# Mixed providers for true heterogeneous consensus
npx agentura consensus \
--input "your prompt" \
--models "anthropic:claude-sonnet-4-6,openai:gpt-4o,gemini:gemini-2.0-flash"
Provider support is now consistent across all three eval surfaces:
| Provider | llm_judge | semantic_similarity | consensus |
|---|---|---|---|
| Anthropic | ✅ | ✅ | ✅ |
| OpenAI | ✅ | ✅ | ✅ |
| Gemini | ✅ | ✅ | ✅ |
| Groq | ✅ | ✅ | ✅ |
| Ollama | ✅ | ✅ | ✅ |
Better CLI errors
agentura consensuswith no flags now prompts interactivelyagentura tracewith no flags now prompts for agent path- Missing API keys print an actionable export command instead of a raw error