Skip to content

skill-creator eval viewer: Benchmark tab empty due to configurations[] format not normalized #605

@mr-magaia

Description

@mr-magaia

Bug Report: Benchmark tab renders empty in eval viewer

Summary

The Benchmark tab in the skill-creator eval viewer renders empty when benchmark.json uses the configurations[] array format (which is the format Claude agents commonly produce). The viewer's viewer.html expects a different nested format with run_summary and runs[].

Root Cause

generate_review.py embeds benchmark.json as-is without normalizing the schema. viewer.html (line ~1122+) expects:

  • data.run_summary — a dict keyed by configuration name, each value containing {mean, stddev} stat objects
  • data.runs[] — array of objects with configuration string and nested result.pass_rate

But Claude agents producing benchmark.json commonly output:

{
  "configurations": [
    {
      "name": "with_skill (v2)",
      "pass_rate": {"mean": 1.0, "stddev": 0.0},
      "tokens": {"mean": 27973, "stddev": 4157},
      "time_seconds": {"mean": 56.5, "stddev": 12.2}
    }
  ],
  "delta_v2_vs_baseline": {
    "pass_rate": "+11.1% (100% vs 89%)",
    "tokens": "+68%"
  },
  "analyst_observations": ["v2 skill maintains 100% pass rate..."]
}

Fix

Add a normalize_benchmark() function in generate_review.py that converts between formats:

  1. configurations[] array → run_summary dict (keyed by name)
  2. Auto-generate runs[] from configurations if missing
  3. configconfiguration field rename in existing runs
  4. Flat stat values → {mean, stddev} objects
  5. analyst_observations accepted as alias for notes

Reproduction

  1. Run skill-creator eval workflow
  2. Aggregate produces benchmark.json with configurations[] format
  3. Run generate_review.py --benchmark benchmark.json
  4. Benchmark tab is empty — no data rendered

Related

This is the companion issue to #604 (Formal Grades display broken). Both are schema mismatch bugs between generate_review.py output and viewer.html expectations.

Environment

  • skill-creator plugin (commit bd041495bd2a)
  • Python 3.12, macOS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions