EvalView - AI Agent Testing
Version updated for https://github.com/hidai25/eval-view to version v0.2.3.
- This action is used across all versions by ? repositories.
Go to the GitHub Marketplace to find the latest changes.
Action Summary
EvalView is a GitHub Action designed to automate the testing and validation of AI agent behavior in CI/CD pipelines. It detects issues like tool changes, output inconsistencies, cost increases, and latency regressions by comparing current runs to a golden baseline, ensuring that regressions are caught before deployment. The tool provides capabilities such as built-in CI integration, tool call tracking, cost/latency monitoring, and interactive debugging, helping developers maintain reliable and efficient agent performance.
Release notes
What’s New
CLI Statistical Mode Flags
--runs Nflag: Run each test N times for statistical evaluation (pass@k metrics)--pass-rateflag: Set required pass rate for--runsmode (default: 0.8)--difficultyfilter: Filter tests by difficulty level
Difficulty Levels for Test Cases
- New
difficultyfield:trivial,easy,medium,hard,expert - Console reporter shows difficulty column and breakdown
Partial Credit for Sequence Evaluation
- Sequence scoring now uses partial credit instead of binary pass/fail
- Example: 3/5 expected steps completed = 60% score (not 0%)
Fixed
--runsCLI flag now properly implemented (was documented but missing)