# AI Evaluation Evaluating AI output quality beyond subjective "does it look right" judgments. Distinct from [[AI Skill Testing]] (which tests specific skill implementations); this covers evaluating any AI interaction systematically. ## Evaluation approaches - **Human review**: gold standard but expensive and slow. Best for calibrating automated methods. - **Automated metrics**: BLEU, ROUGE, exact match, semantic similarity, LLM-as-judge. Cheap and scalable but can miss nuance. - **A/B testing**: compare model versions or prompt variants on real traffic with measurable outcomes. - **Benchmark suites**: standardized test sets for specific capabilities (reasoning, coding, factuality). ## What to evaluate - **Accuracy**: is the output factually correct? Watch for [[AI Hallucination]]. - **Consistency**: does the same input produce reliably similar outputs? - **Format compliance**: does the output follow the required structure? - **Safety**: does it avoid harmful content? See [[AI Guardrails]]. - **[[AI Bias]]**: does it exhibit systematic skew across demographics or topics? - **[[AI Sycophancy]]**: does it agree with the user even when the user is wrong? ## Evaluation in production - **[[AI Observability]]**: instrument systems to capture inputs, outputs, latency, and errors. - **Sampling**: evaluate a representative subset of production traffic rather than everything. - **Drift detection**: track quality metrics over time to catch model or data degradation early. ## References ## Related - [[AI Hallucination]] - [[AI Bias]] - [[AI Sycophancy]] - [[AI Observability]] - [[AI Skill Testing]] - [[AI Guardrails]]