Synthetic Data - DeveloPassion

# Synthetic Data Artificially generated data used to train, fine-tune, or evaluate [[Machine Learning (ML)]] models when real data is scarce, expensive, sensitive, or biased. Modern [[Large Language Models (LLMs)]] are both consumers and producers of synthetic data. ## Generation Methods - **LLM-generated**: using a capable model to produce training examples for a smaller model (often combined with [[Knowledge Distillation]]) - **Rule-based**: programmatic generation following known distributions and constraints - **Simulation**: virtual environments producing labeled data (common in robotics, autonomous driving) - **Augmentation**: transforming existing real data through perturbation, paraphrasing, or recombination ## Use Cases - **[[AI Fine-Tuning]]**: generating domain-specific training pairs when labeled data is limited - **[[AI Instruction Tuning]]**: creating instruction-response pairs to teach models to follow directions - **[[Reinforcement Learning From Human Feedback (RLHF)]]**: generating preference data for reward model training - **Evaluation**: building test sets for [[AI Evaluation]] benchmarks - **Privacy**: creating datasets that preserve statistical properties without exposing real [[AI Privacy|personal data]] ## Risks - **Model collapse**: training on synthetic data from the same model family can amplify errors and reduce diversity over generations - **[[AI Bias]]**: synthetic data inherits and can amplify biases present in the generating model - **Quality ceiling**: the student cannot exceed the teacher's knowledge without additional real-world signal - **[[Data Poisoning]]**: if the generation pipeline is compromised, synthetic data becomes an attack vector ## References - ## Related - [[Knowledge Distillation]] - [[AI Fine-Tuning]] - [[AI Instruction Tuning]] - [[Machine Learning (ML)]] - [[AI Training Data Collection]] - [[AI Bias]] - [[Data Poisoning]]