SkillBench vs Skill Creator

Two tools for testing Claude Code skills. Different layers, complementary goals.

The short version

Skill Creator

“Does this skill produce good output?”

Tests the execution layer. Given that Claude activated the skill, does the skill's content lead to quality results?

SkillBench

“Does Claude find and use the right skills?”

Tests the retrieval/activation layer. Does the configuration make Claude discover, load, and read the correct skill and references?

How Skill Creator works

Skill Creator is a meta-skill bundled with Claude Code. It helps you write a skill, test it against a few prompts, and iterate until the output quality is good.

Write test prompts

2-3 realistic prompts saved to evals.json. Each prompt describes a task a real user would give Claude.

Spawn parallel subagents

For each prompt, two subagents run simultaneously: one with the skill, one without (or with the old version).

Grade with assertions

User-defined assertions are evaluated by a grader subagent or script. Results include pass/fail, timing, and token usage.

Human review in viewer

A Python-generated HTML viewer shows side-by-side outputs. The user leaves qualitative feedback per test case.

Iterate

Revise the skill based on feedback, rerun all prompts into iteration-N+1/, compare with previous iteration in the viewer.

There's also a separate description optimization loop that generates 20 should-trigger/should-not-trigger queries and tunes the SKILL.md description field for better activation accuracy.

How SkillBench works

SkillBench tests the system-level configuration, not individual skill content. It measures whether a change to the skill index format, hook injection strategy, or description framing affects Claude's ability to find and use the right skills.

Hypothesis from a claim

Start with a blog post, documentation change, or intuition. “Adding a TOC to reference files helps Claude find the right section.”

Template with planted signals

A test project with CSS/HTML/JS skills whose references embed unique markers. If Claude outputs data-rack, it read layout-patterns.

Docker-isolated A/B runs

Each prompt runs in a fresh container. Event hooks log every skill load, reference read, and tool call. No context bleeds between prompts.

Automated grading

Signal presence/absence, skill activation, reference reads. Binary, no interpretation needed. Aggregated into pass rates with mean/stddev/delta.

Publish to dashboard

Results publish as JSON to this dashboard. Each experiment becomes a permanent, browsable record with per-prompt breakdowns.

Side-by-side comparison

The differences are architectural. Each tool optimizes a different part of the skill lifecycle.

	Skill Creator	SkillBench
Scope	One skill at a time	Entire skill system (index format, hooks, config patterns)
What varies	Skill content (body, references, description)	Infrastructure (CLAUDE.md index, hook injection, frontmatter, description framing)
Isolation	Subagent processes (same machine)	Docker containers (full environment isolation)
Control vs Treatment	with_skill vs without_skill (or old vs new)	ProjectA vs ProjectB (identical template, one config change)
Measurement	Output assertions + timing + tokens	Event logs (skill loads, ref reads, tool calls) + output verification signals
Signal design	User-defined assertions per prompt	Embedded naming conventions that prove a reference was read
Prompt design	2-3 ad hoc prompts per skill	7-tier structured system (10+ prompts per experiment)
Grading	Subagent or script evaluates assertions	Automated signal detection + assertion framework
Iteration	Edit skill, rerun, compare iterations	Edit config, rerun in Docker, compare A/B
Viewer	Python HTTP server or static HTML	Next.js dashboard (Vercel-deployed)
Persistence	Workspace directories per iteration	Published JSON reports, permanent dashboard

Visual comparison

The two pipelines are structurally similar but operate at different layers.

Skill Creator flow

Draft skill→Spawn subagents→with_skill + baseline→Grade assertions→Human review→Iterate

SkillBench flow

Hypothesis→Template + treatment→Docker isolation→Signal detection→Grade + benchmark→Publish

Key differences

What they measure

Skill Creator asks “given that Claude is using this skill, does the output meet quality expectations?” SkillBench asks “given this system configuration, does Claude correctly discover, load, and apply the right skill for each prompt?” One tests execution quality. The other tests retrieval accuracy.

How they verify

Skill Creator uses grader subagents that interpret output quality. This works well for subjective assessment but introduces variance. SkillBench uses planted verification signals: unique data attributes, CSS custom properties, and function names embedded in reference files. If data-rack appears in the output, the layout-patterns reference was read. Binary, deterministic, no interpretation needed.

Isolation model

Skill Creator runs subagent processes on the same machine. State can theoretically bleed. SkillBench runs each prompt in a fresh Docker container with only the API key injected. No conversation context, no filesystem state, no ambient configuration can affect results.

Scale and structure

Skill Creator uses 2-3 ad hoc prompts per iteration, optimized for fast feedback loops during development. SkillBench uses a structured 7-tier prompt design with 8-10 prompts per experiment, covering broad tasks, single-reference, multi-reference, multi-skill, edge cases, modification, and refactoring. This tests activation accuracy across the full spectrum of prompt types.

How they complement each other

The ideal workflow uses both:

Phase 1

Develop the skill

Use Skill Creator to write the skill body, test output quality, iterate on content, and optimize the description for triggering. End result: a skill that produces good outputs when activated.

Phase 2

Test the system

Use SkillBench to test whether the skill is discovered reliably in the context of the full system: CLAUDE.md index, hook injection, other competing skills. End result: confidence that the skill activates when it should and stays quiet when it shouldn't.

View experiment results / GitHub