sensei-eval

March 14, 2026

Version updated for https://github.com/CodeJonesW/sensei-eval to version v0.8.0.

This action is used across all versions by 0 repositories.

Action Type

This is a Composite action.

Go to the GitHub Marketplace to find the latest changes.

Action Summary

The sensei-eval GitHub Action and TypeScript library streamline the evaluation of AI-generated educational content by performing deterministic checks and leveraging LLM scoring. It automates the detection of content quality regressions in CI workflows, enabling teams to maintain consistent prompt quality. Key features include baseline generation, regression detection, deterministic quick checks, and integration with CI pipelines to ensure scalable and cost-efficient quality control.

Release notes

Summary

Judge usage tracking: Judge.score() now returns optional usage (input_tokens, output_tokens) alongside score results. createJudge passes through response.usage from the Anthropic API.
EvalResult aggregation: EvalRunner aggregates token usage across all judge-scored criteria (including inline rubrics) into EvalResult.usage. Omitted when no judge calls are made (e.g. quickCheck).
Default model change: createJudge default model changed from claude-sonnet-4 to claude-haiku-4-5-20251001 — better default for cost-sensitive eval workloads. Callers can still override via opts.model.

Test plan

Existing judge tests updated with usage in mock responses
New runner tests verify usage aggregation from LLM criteria
New runner test verifies usage aggregation from inline rubrics
New runner test verifies usage is undefined for deterministic-only evals
All 216 tests pass

🤖 Generated with Claude Code