Agentura Eval
Version updated for https://github.com/SyntheticSynaptic/agentura to version v0.3.0.
- This action is used across all versions by ? repositories.
Action Type
This is a Composite action.
Go to the GitHub Marketplace to find the latest changes.
Action Summary
Agentura is a CI/CD evaluation platform designed to test AI agents, ensuring consistent and accurate performance before deployment to production. It automates the detection of regressions, evaluates agents across various scenarios (e.g., multi-turn conversations, tool usage, semantic accuracy), and generates governance-ready audit reports for compliance. The tool supports multiple frameworks, is open-source, and integrates seamlessly with GitHub for streamlined testing workflows.
Release notes
Agentura v0.3.0 — Agent Governance Runtime
This release ships four new capability layers that transform Agentura from a pre-deployment eval tool into a full agent governance runtime covering the complete AI lifecycle: before you ship, at runtime, after execution, and over time.
What’s new
🔀 Heterogeneous consensus runtime (agentura consensus)
Route any agent input to multiple model families simultaneously and vote on the best response. Disagreement between models is surfaced as an explicit safety flag rather than silently discarded.
The key design decision: consensus only works if the models are genuinely different. Running three instances of the same model catches nothing — they share the same failure modes. Agentura’s consensus runner is designed for heterogeneous model families (Anthropic, OpenAI, Google) with different training distributions and different systematic errors.
consensus:
models:
- provider: anthropic
model: claude-sonnet-4-6
- provider: openai
model: gpt-4o
- provider: google
model: gemini-pro
agreement_threshold: 0.80
on_disagreement: flag
- Majority vote for factual outputs, centroid selection for open-ended
- Configurable agreement threshold with disagreement flagging
- Degraded-mode handling when one provider is unavailable
- Full integration with eval suites via
type: consensusstrategy
🔍 Production trace layer (agentura trace)
Every agent action, tool call, data access, and decision is logged to a structured trace file. Traces are the audit record that connects what your agent was supposed to do with what it actually did.
agentura trace --agent ./agent.ts --input "summarize patient history"
agentura trace diff
- Structured JSON trace schema with tool call records and timing
- PII redaction via
--redactflag for sensitive environments - Eval failures automatically captured as traces for review
trace diffcomputes semantic similarity and tool call delta between any two runs
📊 Frozen reference snapshots and drift detection
(agentura reference)
Pin a reference version of your agent and detect behavioral drift
against it across model upgrades, prompt changes, and policy updates.
agentura reference snapshot --label "v1.0-pre-prompt-change"
agentura reference diff --against v1.0-pre-prompt-change
agentura run --local --drift-check
- Semantic drift score, tool call drift, and latency delta vs reference
- Configurable thresholds that fail CI when drift exceeds bounds
- Drift history table across all reference comparisons
- Manifest integration: drift summary written to every run’s audit record