-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
Bug Report: Benchmark tab renders empty in eval viewer
Summary
The Benchmark tab in the skill-creator eval viewer renders empty when benchmark.json uses the configurations[] array format (which is the format Claude agents commonly produce). The viewer's viewer.html expects a different nested format with run_summary and runs[].
Root Cause
generate_review.py embeds benchmark.json as-is without normalizing the schema. viewer.html (line ~1122+) expects:
data.run_summary— a dict keyed by configuration name, each value containing{mean, stddev}stat objectsdata.runs[]— array of objects withconfigurationstring and nestedresult.pass_rate
But Claude agents producing benchmark.json commonly output:
{
"configurations": [
{
"name": "with_skill (v2)",
"pass_rate": {"mean": 1.0, "stddev": 0.0},
"tokens": {"mean": 27973, "stddev": 4157},
"time_seconds": {"mean": 56.5, "stddev": 12.2}
}
],
"delta_v2_vs_baseline": {
"pass_rate": "+11.1% (100% vs 89%)",
"tokens": "+68%"
},
"analyst_observations": ["v2 skill maintains 100% pass rate..."]
}Fix
Add a normalize_benchmark() function in generate_review.py that converts between formats:
configurations[]array →run_summarydict (keyed byname)- Auto-generate
runs[]fromconfigurationsif missing config→configurationfield rename in existing runs- Flat stat values →
{mean, stddev}objects analyst_observationsaccepted as alias fornotes
Reproduction
- Run skill-creator eval workflow
- Aggregate produces
benchmark.jsonwithconfigurations[]format - Run
generate_review.py --benchmark benchmark.json - Benchmark tab is empty — no data rendered
Related
This is the companion issue to #604 (Formal Grades display broken). Both are schema mismatch bugs between generate_review.py output and viewer.html expectations.
Environment
- skill-creator plugin (commit
bd041495bd2a) - Python 3.12, macOS
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels