AI Training Data Collection

# AI Training Data Collection When you use AI platforms, your prompts and the model's responses may be collected and used to improve future models. This means your data effectively becomes part of the training set, potentially surfacing in some form in the model's future outputs. ## How it works Most AI providers distinguish between: - **Consumer/free tier**: your conversations are typically used for training unless you opt out - **API access**: data is generally NOT used for training (OpenAI, Anthropic, Google all commit to this) - **Enterprise plans**: explicit contractual guarantees about data handling The distinction matters. Using ChatGPT in a browser is not the same as calling the OpenAI API. ## What gets collected - Your prompts (including any pasted code, documents, or data) - The model's responses - Conversation metadata (timing, length, model used) - Feedback signals (thumbs up/down, regenerations) ## Risks - **Model memorization**: in rare cases, training data can be extracted from models through targeted prompting - **Aggregation risk**: even if individual prompts seem harmless, patterns across many prompts reveal strategy, priorities, and capabilities - **Irreversibility**: once data is in a training set, it can't be fully removed - **Competitive exposure**: your workflows, approaches, and domain expertise become training signal for a model your competitors also use ## How to protect yourself 1. Check each provider's data policy and opt-out settings 2. Use API access for anything sensitive 3. Use [[Running AI Models Locally|local models]] for the most confidential work 4. Never paste credentials, API keys, or secrets into any AI tool 5. Establish organizational policies about what can and cannot be shared with AI ## References - ## Related - [[AI Privacy]] - [[Running AI Models Locally]] - [[AI Safety]] - [[AI Governance]] - [[Data Poisoning]] - [[Responsible AI]]