Does your skill config
actually work?
SkillBench is an experimentation framework that measures whether changes to Claude Code skills, hooks, and configurations produce real behavioral differences, or just feel like they should.
What is this?
Claude Code supports skills (reusable instruction files), hooks (lifecycle event handlers), and commands(slash-invocable workflows). But when you tweak a skill description, rename a reference file, or add a hook, how do you know if it actually changes Claude's behavior?
SkillBench answers that by running the same prompts against two configurations (control vs. treatment) in isolated Docker containers, then comparing what skills were loaded, which references were read, and whether the outputs contain expected verification signals.
How it works
Each experiment follows a five-step pipeline, fully automated via /experiment:test
Hypothesis
Start with a claim or idea. "Keyword-dense descriptions improve skill activation." "Adding a TOC to reference files helps Claude find the right section."
Template + Treatment
A template project with test skills (CSS, HTML, JavaScript) gets copied into ProjectA (control) and ProjectB (treatment). The treatment applies the change being tested.
Isolated execution
Each prompt runs inside a fresh Docker container with the ANTHROPIC_API_KEY injected. No conversation context bleeds between prompts. Event hooks log every skill load, reference read, and tool call.
Signal detection
The template skills embed unique verification signals (data attributes, CSS custom properties, function names) in their references. If Claude read the right reference, those signals appear in the output.
Grading + Report
Automated grading checks each prompt's output against assertions (signal present/absent, skill triggered, reference read). Results are aggregated into pass rates, deltas, and statistical benchmarks.
Verification signals
Each reference file embeds unique naming conventions that would never appear naturally. If Claude outputs data-rack in its HTML, it read the layout-patterns reference. No guessing required.
CSS
data-racklayout-patterns--seamlayout-patternsdata-zapanimation-patterns--pulseanimation-patternsdata-coattheming--ink-*themingHTML
data-forge-idform-patternsforge-triggerform-patternsdata-hatch-iddialog-patternshatch-triggerdialog-patternsdata-slab-idtable-patternsrow-levertable-patternsJavaScript
zap()event-handlingon_x_yevent-handlingcreateVault()state-managementvault.tap()state-managementskyFetch()fetch-patterns_landedfetch-patterns7-tier prompt design
Test prompts are structured in tiers of increasing complexity. Each tier tests a different aspect of skill activation accuracy.
Generic tasks that should NOT trigger specific references
Each reference gets one direct prompt
Two references from the same skill
Cross 2-3 skills in one prompt
Seem related but should NOT trigger any skill
Modify existing files, should preserve conventions
Restructure code while following conventions
Run your own experiments
SkillBench is open source. Clone the repo, define a hypothesis, and test it.
$ git clone git@github.com:Techfolk-AS/skill-bench.git
$ cd skill-bench
# Install dependencies
$ cd scripts && pnpm install && cd ..
# Start an experiment
$ claude
> /experiment:test "Adding a TOC to reference files improves retrieval"
# Or step by step:
> /experiment:start my-hypothesis
# Edit experiments/ProjectB/.claude/ with your treatment
> /experiment:report
> /experiment:publish my-hypothesisRequirements
- Docker (for isolated runs)
- Claude Code CLI
- Node.js 22+
- An Anthropic API key
What you get
- Per-prompt grading with pass/fail evidence
- Statistical benchmarks (mean/stddev/delta)
- Standalone HTML viewer
- Publishable report for this dashboard
Experiment results
13 experiments tracked
description-with-examples
negativeProjectB change: Append 2-3 concrete example prompts to each skill description field. E.g., CSS description gets "Examples: 'add dark mode', 'make the sidebar responsive', 'add hover animations'". Same domain terms, different format.
gotchas-vs-rules-framing
negativeProjectB change: Add a "Common failures" section to each SKILL.md body that reframes the key conventions as failure modes. Control keeps the current neutral "This project uses custom X conventions" framing.
gotchas-vs-rules-rerun
negativeProjectB change: Add "Common Failures" section to each SKILL.md body reframing key conventions as failure modes. Same treatment as original experiment.
pure-routing-skill-body
positiveProjectB change: Rewrite all three SKILL.md bodies to pure routing format. Remove inline context lines (e.g., "Layouts use a data-rack attribute system"). Keep only numbered steps pointing to references. All domain knowledge stays exclusively in reference files.
reference-toc-header
positiveProjectB change: Add a ToC header to all 11 reference files listing sections and key concepts. Keep all content identical below the ToC.
descriptive-reference-names
positiveProjectB change: Rename all 11 reference files to generic names (css/ref-1.md through ref-4.md, html/ref-1.md through ref-4.md, js/ref-1.md through ref-4.md). Update SKILL.md pointers. Keep file contents and SKILL.md descriptions identical.
hook-prompt-scanning
neutralProjectB change: Add a UserPromptSubmit hook (prompt-scanner.ts) that pattern-matches the prompt against keyword sets per domain and outputs additionalContext with specific reference file paths to read.
keyword-dense-descriptions
neutralProjectB change: Rewrite all three skill descriptions from natural language prose to keyword-dense format (action verbs + domain nouns, no grammar). Keep skill body and references identical.
paths-auto-activation
negativeProjectB change: Add paths: frontmatter to all three skills mapping file extensions to skill activation (css: *.css, html: *.html, javascript: *.js).
short-vs-long-descriptions
positiveProjectB change: Shorten all three skill descriptions to generic ~30-36 char versions that remove domain terms (layout, animation, theming, forms, dialogs, events, state, fetch).
skill-index-read-instruction
positiveProjectB change: Replace skill-index placeholder in CLAUDE.md with a full docs index listing all 11 reference files grouped by domain, plus the instruction "Before starting any task, identify which docs below are relevant and read them first."
forced-eval-hook-v4
positiveProjectB change: Add `skill-forced-eval-hook.sh` that outputs forced evaluation instructions on every prompt, requiring Claude to reason about each skill YES/NO and activate via Skill tool before proceeding.
example
neutralAdded skill-index hints to CLAUDE.md