Agentura Eval

April 2, 2026

Version updated for https://github.com/SyntheticSynaptic/agentura to version v0.5.0.

This action is used across all versions by 0 repositories.

Action Type

This is a Composite action.

Go to the GitHub Marketplace to find the latest changes.

Action Summary

Agentura is a GitHub Action designed to test and validate the performance of AI agents by comparing their behavior against predefined baselines during each pull request. It automates the detection of regressions, improvements, and deviations in areas like accuracy, tone, tool usage, and performance, ensuring that changes to code, prompts, or model updates do not negatively impact the agent’s behavior. Key capabilities include customizable evaluation strategies, multi-turn conversational testing, and integration with GitHub for seamless feedback on pull requests.

What’s Changed

Confidence Propagation Across Multi-Turn Evals

Multi-turn evals now track accumulated confidence across conversation turns instead of scoring each turn independently. A failure early in a conversation degrades the confidence score for subsequent turns — which is how real agent reliability actually compounds.

This ships as heuristic_v1 and is applied automatically to all multi-turn eval runs. No config changes required.

How it works:

Turn 1: score 0.95 → confidence 0.95
Turn 2: score 0.80 → confidence 0.76 (carries forward)
Turn 3: hard_fail → hard_fail multiplier ×0.50 → confidence 0.38
         → escalation_required fires

When accumulated confidence drops below the contract threshold, escalation_required fires on the run — even if individual turns would have passed in isolation.

Contract Violations in Audit Report

agentura report now includes a Contract Summary section.

Every contract that was active during the eval period appears in the report with:

Contract name and version
Assertion results (pass / fail / escalation_required)
Observed values that triggered violations
Failure mode applied

This closes the loop between agentura run and agentura report — contracts now produce evidence that compliance teams can read, not just CI gates that block merges.

agentura report --since 2026-03-01 --out audit.html

The contract summary section appears after the eval record and before the drift report.

Realistic Triage Demo

The triage demo has been updated with natural language inputs that reflect how a real clinical triage agent is actually prompted — not toy fixture strings.

cd examples/triage-agent
npx agentura run --local

The demo fires:

One hard_fail on an out-of-scope action
One escalation_required when accumulated confidence drops below threshold across turns

The demo is the fastest way to understand how contracts and confidence propagation interact.

Provider Support

All five providers are now consistent across every eval surface:

Provider	`llm_judge`	`semantic_similarity`	`consensus`	`trace`
Anthropic	✅	✅	✅	✅
OpenAI	✅	✅	✅	✅
Gemini	✅	✅	✅	✅
Groq	✅	✅	✅	✅
Ollama	✅	✅	✅	✅

Full changelog: CHANGELOG.md

Upgrading from 0.4.x: No breaking changes. All existing agentura run, trace, consensus, and report workflows are unchanged. Confidence propagation and contract reporting are additive.