# AI Evaluation
Evaluating AI output quality beyond subjective "does it look right" judgments. Distinct from [[AI Skill Testing]] (which tests specific skill implementations); this covers evaluating any AI interaction systematically.
## Evaluation approaches
- **Human review**: gold standard but expensive and slow. Best for calibrating automated methods.
- **Automated metrics**: BLEU, ROUGE, exact match, semantic similarity, LLM-as-judge. Cheap and scalable but can miss nuance.
- **A/B testing**: compare model versions or prompt variants on real traffic with measurable outcomes.
- **Benchmark suites**: standardized test sets for specific capabilities (reasoning, coding, factuality).
## What to evaluate
- **Accuracy**: is the output factually correct? Watch for [[AI Hallucination]].
- **Consistency**: does the same input produce reliably similar outputs?
- **Format compliance**: does the output follow the required structure?
- **Safety**: does it avoid harmful content? See [[AI Guardrails]].
- **[[AI Bias]]**: does it exhibit systematic skew across demographics or topics?
- **[[AI Sycophancy]]**: does it agree with the user even when the user is wrong?
## Evaluation in production
- **[[AI Observability]]**: instrument systems to capture inputs, outputs, latency, and errors.
- **Sampling**: evaluate a representative subset of production traffic rather than everything.
- **Drift detection**: track quality metrics over time to catch model or data degradation early.
## References
## Related
- [[AI Hallucination]]
- [[AI Bias]]
- [[AI Sycophancy]]
- [[AI Observability]]
- [[AI Skill Testing]]
- [[AI Guardrails]]