# AI Multimodal AI models that process and generate across multiple data types (modalities): text, images, audio, video, code, and structured data. Examples: GPT-4o (text + vision + audio), Claude (text + vision), Gemini (text + vision + audio + video), [[Gemma 4]] (text + image + audio on small models). Key capability: cross-modal reasoning (e.g., answering questions about an image, generating images from text descriptions). Trend: models are converging toward unified multimodal architectures rather than separate models per modality. ## References ## Related - [[Large Language Models (LLMs)]] - [[Generative AI (Gen AI)]] - [[Diffusion Models]]