BreakPoint Evaluate
Version updated for https://github.com/cholmess/breakpoint-ai to version v1.
- This action is used across all versions by ? repositories.
Action Type
This is a Composite action.
Go to the GitHub Marketplace to find the latest changes.
Action Summary
BreakPoint AI is a GitHub Action designed to evaluate AI model changes and prevent problematic deployments. It automates the detection of issues such as increased costs, sensitive data leakage, and output regressions by comparing a baseline and candidate model, providing a clear status (ALLOW, WARN, or BLOCK). This tool ensures reliable and compliant model updates by integrating seamlessly into CI workflows for pre-merge gating and policy enforcement.
Release notes
BreakPoint AI
Prevent bad AI releases before they hit production.
pip install breakpoint-ai
You change a model. The output looks fine. But:
- Cost jumps +38%.
- A phone number slips into the response.
- The format breaks your downstream parser.
BreakPoint catches it before you deploy.
It runs locally. Policy evaluation is deterministic from your saved artifacts. It gives you one clear answer:
ALLOW · WARN · BLOCK
Quick Example
breakpoint evaluate baseline.json candidate.json
STATUS: BLOCK
Reasons:
- Cost increased by 38% (baseline: 1,000 tokens -> candidate: 1,380)
- Detected US phone number pattern
Ship with confidence.
Lite First (Default)
This is all you need to get started:
breakpoint evaluate baseline.json candidate.json
Lite is local, deterministic, and zero-config. Out of the box:
- Cost:
WARNat+20%,BLOCKat+40% - PII:
BLOCKon first detection (email, phone, credit card) - Drift:
WARNat+35%,BLOCKat+70% - Empty output: always
BLOCK
Advanced option: Need config-driven policies, output contract, latency, presets, or waivers? Use --mode full and see docs/user-guide-full-mode.md.
Full Mode (If You Need It)
Add --mode full when you need config-driven policies, output contract, latency, presets, or waivers. Full details: docs/user-guide-full-mode.md.
breakpoint evaluate baseline.json candidate.json --mode full --json --fail-on warn
CI First (Recommended)
breakpoint evaluate baseline.json candidate.json --json --fail-on warn
Why this is the default integration path:
- Machine-readable decision payload (
schema_version,status,reason_codes, metrics). - Non-zero exit code on risky changes.
- Easy to wire into existing CI without additional services.
Default policy posture (out of the box, Lite):
- Cost:
WARNat+20%,BLOCKat+40% - PII:
BLOCKon first detection - Drift:
WARNat+35%,BLOCKat+70%
GitHub Action (Marketplace)
Use the BreakPoint Evaluate action in any workflow:
- uses: cholmess/breakpoint-ai@v1
with:
baseline: baseline.json
candidate: candidate.json
fail_on: warn
mode: lite
Pre-merge gate example:
name: BreakPoint Gate
on:
pull_request:
branches: [main]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Generate candidate
run: # ... produce candidate.json from your model
- name: BreakPoint Evaluate
uses: cholmess/breakpoint-ai@v1
with:
baseline: baseline.json
candidate: candidate.json
fail_on: warn
Or copy the template: examples/ci/github-actions-breakpoint.yml → .github/workflows/breakpoint-gate.yml
What --fail-on warn means:
- Any
WARNorBLOCKfails the CI step. - Exit behavior remains deterministic:
ALLOW=0,WARN=1,BLOCK=2.
If you only want to fail on BLOCK, change:
BREAKPOINT_FAIL_ON: warnto:BREAKPOINT_FAIL_ON: block
Try In 60 Seconds
pip install -e .
make demo
What you should see:
- Scenario A:
BLOCK(cost spike) - Scenario B:
BLOCK(format/contract regression) - Scenario C:
BLOCK(PII + verbosity drift) - Scenario D:
BLOCK(small prompt change -> cost blowup)
Four Realistic Examples
Baseline for all examples:
examples/install_worthy/baseline.json
1) Cost regression after model swap
breakpoint evaluate examples/install_worthy/baseline.json examples/install_worthy/candidate_cost_model_swap.json
Expected: BLOCK
Why it matters: output appears equivalent, but cost increases enough to violate policy.
2) Structured-output behavior regression
breakpoint evaluate examples/install_worthy/baseline.json examples/install_worthy/candidate_format_regression.json
Expected: BLOCK
Why it matters: candidate drops expected structure and drifts from baseline behavior.
3) PII appears in candidate output
breakpoint evaluate examples/install_worthy/baseline.json examples/install_worthy/candidate_pii_verbosity.json
Expected: BLOCK
Why it matters: candidate introduces PII and adds verbosity drift.
4) Small prompt change -> big cost blowup
breakpoint evaluate examples/install_worthy/baseline.json examples/install_worthy/candidate_killer_tradeoff.json
Expected: BLOCK
Why it matters: output still looks workable, but detail-heavy prompt changes plus a model upgrade create large cost and latency increases with output-contract drift.
More scenario details:
docs/install-worthy-examples.md
CLI
Evaluate two JSON files:
breakpoint evaluate baseline.json candidate.json
Evaluate a single combined JSON file:
breakpoint evaluate payload.json
JSON output for CI/parsing:
breakpoint evaluate baseline.json candidate.json --json
Exit-code gating options:
# fail on WARN or BLOCK
breakpoint evaluate baseline.json candidate.json --fail-on warn
# fail only on BLOCK
breakpoint evaluate baseline.json candidate.json --fail-on block
Stable exit codes:
0=ALLOW1=WARN2=BLOCK
Waivers, config, presets: see docs/user-guide-full-mode.md.
Input Schema
Each input JSON is an object with at least:
output(string)
Optional fields used by policies:
cost_usd(number)model(string)tokens_total(number)tokens_in/tokens_out(number)latency_ms(number)
Combined input format:
{
"baseline": { "output": "..." },
"candidate": { "output": "..." }
}
Python API
from breakpoint import evaluate
decision = evaluate(
baseline_output="hello",
candidate_output="hello there",
metadata={"baseline_tokens": 100, "candidate_tokens": 140},
)
print(decision.status)
print(decision.reasons)
Additional Docs
docs/user-guide.mddocs/user-guide-full-mode.md(Full mode: config, presets, environments, waivers)docs/terminal-output-lite-vs-full.md(Lite vs Full terminal output, same format)docs/quickstart-10min.mddocs/install-worthy-examples.mddocs/baseline-lifecycle.mddocs/ci-templates.mddocs/value-metrics.mddocs/policy-presets.mddocs/release-gate-audit.md
Topics
Add these topics in your repo settings for discoverability: ai, llm, evaluation, ci, quality-gate, github-actions, breakpoint.
Contact
Suggestions and feedback: c.holmes.silva@gmail.com or open an issue.