EvalView - AI Agent Testing
Version updated for https://github.com/hidai25/eval-view to version v0.5.0.
- This action is used across all versions by 0 repositories.
Action Type
This is a Composite action.
Go to the GitHub Marketplace to find the latest changes.
Action Summary
EvalView is a GitHub Action designed for regression testing of AI agents by automating the process of detecting behavioral changes and preventing regressions in continuous integration (CI) pipelines. It captures and snapshots the agent’s API behavior, including tool calls, sequences, parameters, outputs, costs, and latencies, creating a golden baseline for comparisons. The action offers multiple layers of evaluation, from free offline checks to advanced semantic and quality-based scoring using AI models, ensuring robust and cost-effective testing.
Release notes
What’s New
Production Monitoring (evalview monitor)
- Continuous regression detection — runs
evalview checkin a loop with configurable interval (default: 5 min) - Slack alerts — webhook notifications on new regressions, recovery notifications when resolved
- Smart dedup — only alerts on NEW failures, no re-alerts on persistent issues
- JSONL history export —
--history monitor.jsonlappends cycle data for trend analysis and dashboards - Graceful shutdown — Ctrl+C stops cleanly with cost summary
- Config support — CLI flags,
config.yaml, orEVALVIEW_SLACK_WEBHOOKenv var
evalview monitor # Check every 5 min
evalview monitor --interval 60 # Every minute
evalview monitor --slack-webhook https://hooks.slack.com/services/...
evalview monitor --history monitor.jsonl # Save trends
Community Contributions
- CSV export —
evalview check --csv results.csv(@muhammadrashid4587) - Timeout flag —
evalview check --timeout 60(@zamadye) - Better errors — human-friendly connection failure messages (@passionworkeer)
- JSONL history —
--historyflag for monitor (@clawtom)
Bug Fixes & Refactoring
- Fixed severity comparison bug (was using string matching instead of enum comparison)
- Fixed JSONL history pass count (was using fail_on filter instead of actual counts)
- Extracted shared
_parse_fail_statusesutility for consistent fail_on parsing - Eliminated redundant config loading in monitor loop
Deployment
# Quick background run
nohup evalview monitor --slack-webhook https://... &
# Docker
docker run -d -v $(pwd):/app -w /app evalview monitor --slack-webhook https://...
Full Changelog: https://github.com/hidai25/eval-view/compare/v0.4.1...v0.5.0