# Data Poisoning
Data poisoning is the deliberate corruption of AI training data to compromise model behavior. By injecting malicious or biased examples into the training set, an attacker can cause the model to produce specific wrong outputs, exhibit biases, or create backdoors that activate under certain conditions.
Attack vectors:
- **Direct poisoning**: injecting crafted examples into datasets that are scraped from the web or user-contributed
- **Backdoor attacks**: training the model to behave normally except when a specific trigger is present (a keyword, a pattern), at which point it produces attacker-controlled output
- **Label flipping**: changing the correct labels on training examples to teach the model wrong associations
Data poisoning is harder to detect than [[Prompt injection]] because the compromise happens at training time, not runtime. The malicious behavior is baked into the model's weights. It relates to [[AI Safety]] as a supply-chain attack on AI systems.
For practitioners using [[AI Agents]], the practical implication is model provenance: knowing where your model was trained, on what data, and by whom. Using models from reputable providers ([[Anthropic]], [[OpenAI]]) with documented training practices reduces but doesn't eliminate this risk. [[Synthetic data]] can both help (by reducing reliance on potentially poisoned web data) and hurt (if the synthetic data itself is generated by a compromised model).
## References
-
## Related
- [[AI Safety]]
- [[Prompt injection]]
- [[AI Guardrails]]
- [[Synthetic data]]
- [[Machine Learning (ML)]]
- [[Large Language Models (LLMs)]]
- [[Generative AI Risks]]