EvalView - AI Agent Testing

February 20, 2026

Version updated for https://github.com/hidai25/eval-view to version v0.3.0.

This action is used across all versions by 0 repositories.

Action Type

This is a Composite action.

Go to the GitHub Marketplace to find the latest changes.

Action Summary

EvalView is a GitHub Action and CLI tool designed to detect regressions in AI agent behavior by comparing their current outputs against a saved baseline. It automates the process of identifying changes in prompt outputs, tool usage, and overall performance, helping developers confidently ensure their agents continue to function correctly after updates or modifications. Key features include regression detection, streak tracking, stability scoring, and support for non-deterministic agents with multi-reference baselines.

Release notes

What’s New in 0.3

🤖 Claude Code MCP Integration

EvalView now runs as an MCP server inside Claude Code — test your agent without leaving the conversation.

claude mcp add --transport stdio evalview -- evalview mcp serve
cp CLAUDE.md.example CLAUDE.md

7 MCP tools available:

Tool	What it does
`create_test`	Generate test cases from natural language
`run_snapshot`	Capture golden baseline
`run_check`	Detect regressions inline
`list_tests`	Show all baselines
`validate_skill`	Validate SKILL.md structure
`generate_skill_tests`	Auto-generate skill test suite
`run_skill_test`	Run Phase 1 (deterministic) + Phase 2 (rubric)

📊 Telemetry Improvements

Users now show as EvalView-3f8a2b instead of raw UUIDs in PostHog
Session duration tracking (session_duration_ms)
Set EVALVIEW_DEV=1 to tag your own events for filtering

🐕 Dogfood Regression Testing

EvalView now tests itself using its own evaluation logic on every CI run.

Bug Fixes

Fixed PIPESTATUS CI bug (regression checks now correctly fail CI)
Fixed deprecated asyncio.get_event_loop() → get_running_loop()
Fixed silent failures in --json mode
ANSI escape stripping improved in MCP output

Upgrade

pip install --upgrade evalview