Pixtral - DeveloPassion

# Pixtral Pixtral is a family of vision-language models developed by [[Mistral AI]], designed to understand both natural images and documents while maintaining strong text-only performance. ## Pixtral 12B Released in September 2024, Pixtral 12B is a natively multimodal model trained with interleaved image and text data. It features a new 400M parameter vision encoder trained from scratch and a 12B parameter multimodal decoder based on Mistral Nemo. The model supports variable image sizes and aspect ratios, and can process multiple images in its 128K token context window. Pixtral 12B achieved 52.5% on the MMMU reasoning benchmark and excels at chart and figure understanding, document question answering, multimodal reasoning, and instruction following. It is released under the [[Apache 2.0 License]]. ## Pixtral Large Released in November 2024, Pixtral Large is a 124B open-weights multimodal model built on top of Mistral Large 2. It features a 1B parameter vision encoder and a 123B multimodal decoder with a 128K context window (fitting a minimum of 30 high-resolution images). Pixtral Large achieved state-of-the-art results on MathVista (69.4%), and outperformed GPT-4o and Gemini-1.5 Pro on ChartQA and DocVQA. It was the best open-weights model on the LMSys Vision Leaderboard by a substantial margin. Pixtral Large was released under the Mistral Research License (MRL). Note: Pixtral Large has since been deprecated in favor of newer Mistral vision models. ## References - https://mistral.ai/news/pixtral-12b - https://mistral.ai/news/pixtral-large - https://arxiv.org/abs/2410.07073 ## Related - [[Mistral AI]] - [[Large Language Models (LLMs)]] - [[Mistral OCR]] - [[Natural Language Processing (NLP)]] - [[Apache 2.0 License]]