# GLM OCR
A multimodal OCR model for complex document understanding, built on the GLM-V encoder-decoder architecture. Created by Zhipu AI (zai-org). With only 0.9B parameters, it achieves state-of-the-art performance while remaining lightweight enough for edge deployment.
## Architecture
Three components:
- **CogViT visual encoder** pre-trained on large-scale image-text data
- **Lightweight cross-modal connector** with efficient token downsampling
- **GLM-0.5B language decoder**
Uses a two-stage pipeline of layout analysis (PP-DocLayout-V3) and parallel recognition. Training uses Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning.
## Key Features
- **State-of-the-art performance**: 94.62 on OmniDocBench V1.5, ranked #1 overall
- **Supported tasks**: text recognition, table extraction, formula recognition, figure recognition, multi-page documents, code-heavy documents, seals and signatures
- **Efficient inference**: 0.9B parameters; supports deployment via vLLM, SGLang, and [[Ollama]]
- **Output formats**: JSON (structured layout with bounding boxes) and Markdown
## Installation
```bash
pip install glmocr # basic
pip install "glmocr[selfhosted]" # self-hosted with vLLM/SGLang
pip install "glmocr[server]" # Flask server mode
```
## Usage
### Via [[Ollama]]
```bash
ollama run glm-ocr "Text Recognition: ./image.png"
ollama run glm-ocr "Table Recognition: ./image.png"
ollama run glm-ocr "Figure Recognition: ./image.png"
```
Available model variants:
| Model | Size | Context |
|-------|------|---------|
| glm-ocr:latest | 2.2 GB | 128K |
| glm-ocr:q8_0 | 1.6 GB | 128K |
| glm-ocr:bf16 | 2.2 GB | 128K |
### Via Python SDK
```python
from glmocr import parse
result = parse("image.png")
result.save()
```
### Via CLI
```bash
glmocr parse image.png --output ./results/
```
### Via Flask Server
```bash
python -m glmocr.server
curl -X POST http://localhost:5002/glmocr/parse \
-d '{"images": ["./code.png"]}'
```
## Deployment Options
1. **Zhipu MaaS API** (cloud, no GPU needed)
2. **Self-hosted with vLLM/SGLang** (local, OpenAI-compatible API)
3. **[[Ollama]]/MLX** (edge deployment, Apple Silicon)
## License
- Code: Apache 2.0
- Model: MIT
- Pipeline integrates PP-DocLayoutV3 (Apache 2.0)
## References
- Ollama model page: https://ollama.com/library/glm-ocr
- Source code and SDK: https://github.com/zai-org/GLM-OCR
## Related
- [[Ollama]]
- [[Mistral OCR]]
- [[Transcriber plugin for Obsidian]]
- [[How to convert notes from analog to digital]]
- [[LangExtract]]
- [[Pixtral]]
- [[Large Language Models (LLMs)]]