Open source A/B testing for AI coding agents

Does your skill config
actually work?

SkillBench is an experimentation framework that measures whether changes to Claude Code skills, hooks, and configurations produce real behavioral differences, or just feel like they should.

View experiments GitHub repo

What is this?

Claude Code supports skills (reusable instruction files), hooks (lifecycle event handlers), and commands(slash-invocable workflows). But when you tweak a skill description, rename a reference file, or add a hook, how do you know if it actually changes Claude's behavior?

SkillBench answers that by running the same prompts against two configurations (control vs. treatment) in isolated Docker containers, then comparing what skills were loaded, which references were read, and whether the outputs contain expected verification signals.

How it works

Each experiment follows a five-step pipeline, fully automated via /experiment:test

Hypothesis

Start with a claim or idea. "Keyword-dense descriptions improve skill activation." "Adding a TOC to reference files helps Claude find the right section."

Template + Treatment

A template project with test skills (CSS, HTML, JavaScript) gets copied into ProjectA (control) and ProjectB (treatment). The treatment applies the change being tested.

Isolated execution

Each prompt runs inside a fresh Docker container with the ANTHROPIC_API_KEY injected. No conversation context bleeds between prompts. Event hooks log every skill load, reference read, and tool call.

Signal detection

The template skills embed unique verification signals (data attributes, CSS custom properties, function names) in their references. If Claude read the right reference, those signals appear in the output.

Grading + Report

Automated grading checks each prompt's output against assertions (signal present/absent, skill triggered, reference read). Results are aggregated into pass rates, deltas, and statistical benchmarks.

Verification signals

Each reference file embeds unique naming conventions that would never appear naturally. If Claude outputs data-rack in its HTML, it read the layout-patterns reference. No guessing required.

CSS

data-racklayout-patterns

--seamlayout-patterns

data-zapanimation-patterns

--pulseanimation-patterns

data-coattheming

--ink-*theming

HTML

data-forge-idform-patterns

forge-triggerform-patterns

data-hatch-iddialog-patterns

hatch-triggerdialog-patterns

data-slab-idtable-patterns

row-levertable-patterns

JavaScript

zap()event-handling

on_x_yevent-handling

createVault()state-management

vault.tap()state-management

skyFetch()fetch-patterns

_landedfetch-patterns

7-tier prompt design

Test prompts are structured in tiers of increasing complexity. Each tier tests a different aspect of skill activation accuracy.

Broad"Add a responsive navigation bar"

Generic tasks that should NOT trigger specific references

Single-reference"Add a CSS animation with keyframes"

Each reference gets one direct prompt

Multi-reference"Create a themed, animated component"

Two references from the same skill

Multi-skill"Build a form with validation and animations"

Cross 2-3 skills in one prompt

Edge cases"Write a Python sorting algorithm"

Seem related but should NOT trigger any skill

Modification"Refactor the navbar to use flexbox"

Modify existing files, should preserve conventions

Refactoring"Split app.js into separate modules"

Restructure code while following conventions

Run your own experiments

SkillBench is open source. Clone the repo, define a hypothesis, and test it.

terminal

$ git clone git@github.com:Techfolk-AS/skill-bench.git
$ cd skill-bench

# Install dependencies
$ cd scripts && pnpm install && cd ..

# Start an experiment
$ claude
> /experiment:test "Adding a TOC to reference files improves retrieval"

# Or step by step:
> /experiment:start my-hypothesis
# Edit experiments/ProjectB/.claude/ with your treatment
> /experiment:report
> /experiment:publish my-hypothesis

Requirements

Docker (for isolated runs)
Claude Code CLI
Node.js 22+
An Anthropic API key

What you get

Per-prompt grading with pass/fail evidence
Statistical benchmarks (mean/stddev/delta)
Standalone HTML viewer
Publishable report for this dashboard

Experiment results

13 experiments tracked

Experiments

113

Prompts tested

13.1%

Avg |delta|

description-with-examples

negative

ProjectB change: Append 2-3 concrete example prompts to each skill description field. E.g., CSS description gets "Examples: 'add dark mode', 'make the sidebar responsive', 'add hover animations'". Same domain terms, different format.

2026-03-31-17.1%

10 promptsskills 8/10refs 7/10signals 25/26

gotchas-vs-rules-framing

negative

ProjectB change: Add a "Common failures" section to each SKILL.md body that reframes the key conventions as failure modes. Control keeps the current neutral "This project uses custom X conventions" framing.

2026-03-31-45.9%

9 promptsskills 7/9refs 5/9signals 19/26

gotchas-vs-rules-rerun

negative

ProjectB change: Add "Common Failures" section to each SKILL.md body reframing key conventions as failure modes. Same treatment as original experiment.

2026-03-31-22.9%

10 promptsskills 4/10refs 3/10signals 20/26

pure-routing-skill-body

positive

ProjectB change: Rewrite all three SKILL.md bodies to pure routing format. Remove inline context lines (e.g., "Layouts use a data-rack attribute system"). Keep only numbered steps pointing to references. All domain knowledge stays exclusively in reference files.

2026-03-31+12.9%

8 promptsskills 8/8refs 7/8signals 26/26

reference-toc-header

positive

ProjectB change: Add a ToC header to all 11 reference files listing sections and key concepts. Keep all content identical below the ToC.

2026-03-31+10.3%

10 promptsskills 5/10refs 4/10signals 19/26

descriptive-reference-names

positive

ProjectB change: Rename all 11 reference files to generic names (css/ref-1.md through ref-4.md, html/ref-1.md through ref-4.md, js/ref-1.md through ref-4.md). Update SKILL.md pointers. Keep file contents and SKILL.md descriptions identical.

2026-03-30+8.3%

9 promptsskills 6/9refs 5/9signals 20/26

hook-prompt-scanning

neutral

ProjectB change: Add a UserPromptSubmit hook (prompt-scanner.ts) that pattern-matches the prompt against keyword sets per domain and outputs additionalContext with specific reference file paths to read.

2026-03-30+2.6%

9 promptsskills 7/9refs 7/9signals 18/26

keyword-dense-descriptions

neutral

ProjectB change: Rewrite all three skill descriptions from natural language prose to keyword-dense format (action verbs + domain nouns, no grammar). Keep skill body and references identical.

2026-03-30-2.4%

10 promptsskills 7/10refs 6/10signals 15/26

paths-auto-activation

negative

ProjectB change: Add paths: frontmatter to all three skills mapping file extensions to skill activation (css: *.css, html: *.html, javascript: *.js).

2026-03-30-8.6%

9 promptsskills 6/9refs 5/9signals 20/26

short-vs-long-descriptions

positive

ProjectB change: Shorten all three skill descriptions to generic ~30-36 char versions that remove domain terms (layout, animation, theming, forms, dialogs, events, state, fetch).

2026-03-30+9.4%

9 promptsskills 6/9refs 4/9signals 22/26

skill-index-read-instruction

positive

ProjectB change: Replace skill-index placeholder in CLAUDE.md with a full docs index listing all 11 reference files grouped by domain, plus the instruction "Before starting any task, identify which docs below are relevant and read them first."

2026-03-30+8.1%

9 promptsskills 8/9refs 2/9signals 16/26

forced-eval-hook-v4

positive

ProjectB change: Add `skill-forced-eval-hook.sh` that outputs forced evaluation instructions on every prompt, requiring Claude to reason about each skill YES/NO and activate via Skill tool before proceeding.

2026-03-18+8.6%

8 promptsskills 2/8refs 1/8signals 23/26

example

neutral

Added skill-index hints to CLAUDE.md

2026-03-10

3 promptsskills 2/3refs 2/3signals 20/26

Does your skill configactually work?

What is this?

How it works

Hypothesis

Template + Treatment

Isolated execution

Signal detection

Grading + Report

Verification signals

CSS

HTML

JavaScript

7-tier prompt design

Run your own experiments

Requirements

What you get

Experiment results

description-with-examples

gotchas-vs-rules-framing

gotchas-vs-rules-rerun

pure-routing-skill-body

reference-toc-header

descriptive-reference-names

hook-prompt-scanning

keyword-dense-descriptions

paths-auto-activation

short-vs-long-descriptions

skill-index-read-instruction

forced-eval-hook-v4

example

Does your skill config
actually work?