GLM OCR - DeveloPassion

# GLM OCR A multimodal OCR model for complex document understanding, built on the GLM-V encoder-decoder architecture. Created by Zhipu AI (zai-org). With only 0.9B parameters, it achieves state-of-the-art performance while remaining lightweight enough for edge deployment. ## Architecture Three components: - **CogViT visual encoder** pre-trained on large-scale image-text data - **Lightweight cross-modal connector** with efficient token downsampling - **GLM-0.5B language decoder** Uses a two-stage pipeline of layout analysis (PP-DocLayout-V3) and parallel recognition. Training uses Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning. ## Key Features - **State-of-the-art performance**: 94.62 on OmniDocBench V1.5, ranked #1 overall - **Supported tasks**: text recognition, table extraction, formula recognition, figure recognition, multi-page documents, code-heavy documents, seals and signatures - **Efficient inference**: 0.9B parameters; supports deployment via vLLM, SGLang, and [[Ollama]] - **Output formats**: JSON (structured layout with bounding boxes) and Markdown ## Installation ```bash pip install glmocr # basic pip install "glmocr[selfhosted]" # self-hosted with vLLM/SGLang pip install "glmocr[server]" # Flask server mode ``` ## Usage ### Via [[Ollama]] ```bash ollama run glm-ocr "Text Recognition: ./image.png" ollama run glm-ocr "Table Recognition: ./image.png" ollama run glm-ocr "Figure Recognition: ./image.png" ``` Available model variants: | Model | Size | Context | |-------|------|---------| | glm-ocr:latest | 2.2 GB | 128K | | glm-ocr:q8_0 | 1.6 GB | 128K | | glm-ocr:bf16 | 2.2 GB | 128K | ### Via Python SDK ```python from glmocr import parse result = parse("image.png") result.save() ``` ### Via CLI ```bash glmocr parse image.png --output ./results/ ``` ### Via Flask Server ```bash python -m glmocr.server curl -X POST http://localhost:5002/glmocr/parse \ -d '{"images": ["./code.png"]}' ``` ## Deployment Options 1. **Zhipu MaaS API** (cloud, no GPU needed) 2. **Self-hosted with vLLM/SGLang** (local, OpenAI-compatible API) 3. **[[Ollama]]/MLX** (edge deployment, Apple Silicon) ## License - Code: Apache 2.0 - Model: MIT - Pipeline integrates PP-DocLayoutV3 (Apache 2.0) ## References - Ollama model page: https://ollama.com/library/glm-ocr - Source code and SDK: https://github.com/zai-org/GLM-OCR ## Related - [[Ollama]] - [[Mistral OCR]] - [[Transcriber plugin for Obsidian]] - [[How to convert notes from analog to digital]] - [[LangExtract]] - [[Pixtral]] - [[Large Language Models (LLMs)]]