# AI Multimodal
AI models that process and generate across multiple data types (modalities): text, images, audio, video, code, and structured data. Examples: GPT-4o (text + vision + audio), Claude (text + vision), Gemini (text + vision + audio + video), [[Gemma 4]] (text + image + audio on small models).
Key capability: cross-modal reasoning (e.g., answering questions about an image, generating images from text descriptions).
Trend: models are converging toward unified multimodal architectures rather than separate models per modality.
## References
## Related
- [[Large Language Models (LLMs)]]
- [[Generative AI (Gen AI)]]
- [[Diffusion Models]]