SkillBench vs Skill Creator
Two tools for testing Claude Code skills. Different layers, complementary goals.
The short version
Skill Creator
“Does this skill produce good output?”
Tests the execution layer. Given that Claude activated the skill, does the skill's content lead to quality results?
SkillBench
“Does Claude find and use the right skills?”
Tests the retrieval/activation layer. Does the configuration make Claude discover, load, and read the correct skill and references?
How Skill Creator works
Skill Creator is a meta-skill bundled with Claude Code. It helps you write a skill, test it against a few prompts, and iterate until the output quality is good.
Write test prompts
2-3 realistic prompts saved to evals.json. Each prompt describes a task a real user would give Claude.
Spawn parallel subagents
For each prompt, two subagents run simultaneously: one with the skill, one without (or with the old version).
Grade with assertions
User-defined assertions are evaluated by a grader subagent or script. Results include pass/fail, timing, and token usage.
Human review in viewer
A Python-generated HTML viewer shows side-by-side outputs. The user leaves qualitative feedback per test case.
Iterate
Revise the skill based on feedback, rerun all prompts into iteration-N+1/, compare with previous iteration in the viewer.
There's also a separate description optimization loop that generates 20 should-trigger/should-not-trigger queries and tunes the SKILL.md description field for better activation accuracy.
How SkillBench works
SkillBench tests the system-level configuration, not individual skill content. It measures whether a change to the skill index format, hook injection strategy, or description framing affects Claude's ability to find and use the right skills.
Hypothesis from a claim
Start with a blog post, documentation change, or intuition. “Adding a TOC to reference files helps Claude find the right section.”
Template with planted signals
A test project with CSS/HTML/JS skills whose references embed unique markers. If Claude outputs data-rack, it read layout-patterns.
Docker-isolated A/B runs
Each prompt runs in a fresh container. Event hooks log every skill load, reference read, and tool call. No context bleeds between prompts.
Automated grading
Signal presence/absence, skill activation, reference reads. Binary, no interpretation needed. Aggregated into pass rates with mean/stddev/delta.
Publish to dashboard
Results publish as JSON to this dashboard. Each experiment becomes a permanent, browsable record with per-prompt breakdowns.
Side-by-side comparison
The differences are architectural. Each tool optimizes a different part of the skill lifecycle.
| Skill Creator | SkillBench | |
|---|---|---|
| Scope | One skill at a time | Entire skill system (index format, hooks, config patterns) |
| What varies | Skill content (body, references, description) | Infrastructure (CLAUDE.md index, hook injection, frontmatter, description framing) |
| Isolation | Subagent processes (same machine) | Docker containers (full environment isolation) |
| Control vs Treatment | with_skill vs without_skill (or old vs new) | ProjectA vs ProjectB (identical template, one config change) |
| Measurement | Output assertions + timing + tokens | Event logs (skill loads, ref reads, tool calls) + output verification signals |
| Signal design | User-defined assertions per prompt | Embedded naming conventions that prove a reference was read |
| Prompt design | 2-3 ad hoc prompts per skill | 7-tier structured system (10+ prompts per experiment) |
| Grading | Subagent or script evaluates assertions | Automated signal detection + assertion framework |
| Iteration | Edit skill, rerun, compare iterations | Edit config, rerun in Docker, compare A/B |
| Viewer | Python HTTP server or static HTML | Next.js dashboard (Vercel-deployed) |
| Persistence | Workspace directories per iteration | Published JSON reports, permanent dashboard |
Visual comparison
The two pipelines are structurally similar but operate at different layers.
Skill Creator flow
SkillBench flow
Key differences
What they measure
Skill Creator asks “given that Claude is using this skill, does the output meet quality expectations?” SkillBench asks “given this system configuration, does Claude correctly discover, load, and apply the right skill for each prompt?” One tests execution quality. The other tests retrieval accuracy.
How they verify
Skill Creator uses grader subagents that interpret output quality. This works well for subjective assessment but introduces variance. SkillBench uses planted verification signals: unique data attributes, CSS custom properties, and function names embedded in reference files. If data-rack appears in the output, the layout-patterns reference was read. Binary, deterministic, no interpretation needed.
Isolation model
Skill Creator runs subagent processes on the same machine. State can theoretically bleed. SkillBench runs each prompt in a fresh Docker container with only the API key injected. No conversation context, no filesystem state, no ambient configuration can affect results.
Scale and structure
Skill Creator uses 2-3 ad hoc prompts per iteration, optimized for fast feedback loops during development. SkillBench uses a structured 7-tier prompt design with 8-10 prompts per experiment, covering broad tasks, single-reference, multi-reference, multi-skill, edge cases, modification, and refactoring. This tests activation accuracy across the full spectrum of prompt types.
How they complement each other
The ideal workflow uses both:
Develop the skill
Use Skill Creator to write the skill body, test output quality, iterate on content, and optimize the description for triggering. End result: a skill that produces good outputs when activated.
Test the system
Use SkillBench to test whether the skill is discovered reliably in the context of the full system: CLAUDE.md index, hook injection, other competing skills. End result: confidence that the skill activates when it should and stays quiet when it shouldn't.