# AI Skill Testing
Validating that [[AI Agent Skills]] work correctly, produce expected outputs, and don't break when modified. Unlike software testing, skill testing deals with non-deterministic behavior: the same skill can produce different outputs on every run.
## Why testing is hard for skills
- **Non-deterministic output**: LLMs don't produce identical results for identical inputs
- **No clear pass/fail**: skill output quality is often subjective
- **Context sensitivity**: a skill may work perfectly in one context and fail in another
- **Tool interaction**: skills that call tools (file reads, web searches, API calls) have side effects
- **Long feedback loops**: you have to actually run the skill to see if it works
## Testing approaches
- **Example-based**: provide known inputs, check outputs match expected patterns (not exact strings)
- **Property-based**: verify output has required properties (contains wikilinks, follows format, includes required sections)
- **Regression testing**: save known-good outputs and compare future runs against them
- **Manual review**: human reads the output and judges quality; still the most common approach
- **Prompt-as-test**: a second prompt evaluates the first prompt's output against criteria
- **Edge case catalogs**: maintain a list of tricky inputs that have caused failures before
## What to test
- Does the skill follow its own instructions?
- Does it produce the correct output format?
- Does it handle edge cases (empty input, very long input, ambiguous input)?
- Does it respect constraints (token budget, tool restrictions)?
- Does it fail gracefully when dependencies are missing?
## Current state
- No standardized testing framework for skills
- Most testing is manual: run the skill, read the output, iterate
- Some teams use "golden file" approaches: save expected output, diff against actual
- [[Agentic TDD]] applies test-driven principles to agent development but is early-stage
- [[AI Observability]] helps catch skill failures in production but isn't preventive
## What's needed
- Skill testing framework with assertion primitives (output contains X, format matches Y)
- CI integration for skill regression testing
- Sandboxed execution environments for testing skills with tool access
- Benchmark suites per skill type (writing skills, code skills, research skills)
## References
-
## Related
- [[AI Agent Skills]]
- [[AI Skill Versioning]]
- [[AI Skill Composability]]
- [[AI Skill Distribution]]
- [[AI Skill Supply Chain Security]]
- [[AI Observability]]
- [[Agentic TDD]]
- [[AI Guardrails]]
- [[Idempotency]]
- [[Design by Contract]]
- [[Pure Function]]