# Synthetic Data
Artificially generated data used to train, fine-tune, or evaluate [[Machine Learning (ML)]] models when real data is scarce, expensive, sensitive, or biased. Modern [[Large Language Models (LLMs)]] are both consumers and producers of synthetic data.
## Generation Methods
- **LLM-generated**: using a capable model to produce training examples for a smaller model (often combined with [[Knowledge Distillation]])
- **Rule-based**: programmatic generation following known distributions and constraints
- **Simulation**: virtual environments producing labeled data (common in robotics, autonomous driving)
- **Augmentation**: transforming existing real data through perturbation, paraphrasing, or recombination
## Use Cases
- **[[AI Fine-Tuning]]**: generating domain-specific training pairs when labeled data is limited
- **[[AI Instruction Tuning]]**: creating instruction-response pairs to teach models to follow directions
- **[[Reinforcement Learning From Human Feedback (RLHF)]]**: generating preference data for reward model training
- **Evaluation**: building test sets for [[AI Evaluation]] benchmarks
- **Privacy**: creating datasets that preserve statistical properties without exposing real [[AI Privacy|personal data]]
## Risks
- **Model collapse**: training on synthetic data from the same model family can amplify errors and reduce diversity over generations
- **[[AI Bias]]**: synthetic data inherits and can amplify biases present in the generating model
- **Quality ceiling**: the student cannot exceed the teacher's knowledge without additional real-world signal
- **[[Data Poisoning]]**: if the generation pipeline is compromised, synthetic data becomes an attack vector
## References
-
## Related
- [[Knowledge Distillation]]
- [[AI Fine-Tuning]]
- [[AI Instruction Tuning]]
- [[Machine Learning (ML)]]
- [[AI Training Data Collection]]
- [[AI Bias]]
- [[Data Poisoning]]