# LangExtract
LangExtract is an open-source [[Python]] library by [[Google]] that uses large language models to extract structured data from unstructured text while maintaining precise links back to the source material. It handles tasks like processing clinical notes, reports, or any document requiring organized information retrieval — without model fine-tuning.
Users provide a prompt describing what to extract, a few examples of the desired output format, and input text. LangExtract chunks long documents, processes them in parallel, and merges results while tracking exact source locations for every extracted piece of data.
## Key Features
- **Source grounding**: Every extracted data point maps back to its exact location in the original text, enabling visual verification through highlighting
- **Structured outputs**: Enforces schema-based results through controlled generation on compatible models (e.g., [[Gemini]])
- **Long document handling**: Chunking, parallel processing, and multiple extraction passes for documents that exceed token limits
- **Interactive visualization**: Generates self-contained HTML files displaying extracted entities in their original context
- **Flexible model support**: Works with [[Gemini]], OpenAI, Vertex AI, and local models via [[Ollama]]
- **Domain adaptability**: Define custom extraction tasks for any domain using minimal examples, no retraining needed
## Installation
```bash
pip install langextract
```
From source:
```bash
git clone https://github.com/google/langextract
pip install -e .
```
Docker is also supported.
## How It Works
1. Define a **prompt** describing the extraction rules
2. Provide **few-shot examples** showing the desired output format
3. Pass **input text** to `lx.extract()` with a model choice
4. Results are saved to JSONL format
5. Generate visual reports with `lx.visualize()`
## References
- GitHub: https://github.com/google/langextract
## Related
- [[AI]]
- [[Python]]
- [[Google]]
- [[Gemini]]
- [[Ollama]]