# MarkItDown
Lightweight Python utility by [[Microsoft]] that converts various file formats to Markdown. Designed for LLM consumption and text analysis pipelines. Preserves document structure (headings, lists, tables, links) while producing token-efficient output. MIT licensed.
## Supported Formats
- **Documents**: PDF, PowerPoint, Word, Excel
- **Media**: images (with EXIF and OCR), audio (with transcription)
- **Web**: HTML, YouTube URLs
- **Data**: CSV, JSON, XML
- **Containers**: ZIP files (iterates contents)
- **eBooks**: ePub
## Usage
### CLI
```bash
markitdown file.pdf > output.md
markitdown file.pdf -o output.md
cat file.pdf | markitdown
```
### Python API
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("file.xlsx")
print(result.text_content)
```
### With LLM Vision
```python
from markitdown import MarkItDown
from openai import OpenAI
md = MarkItDown(llm_client=OpenAI(), llm_model="gpt-4o")
result = md.convert("image.jpg")
```
### Docker
```bash
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < file.pdf > output.md
```
## Installation
```bash
pip install 'markitdown[all]' # everything
pip install 'markitdown[pdf,docx,pptx]' # selective
```
Optional feature groups: `[pdf]`, `[docx]`, `[pptx]`, `[xlsx]`, `[xls]`, `[outlook]`, `[audio-transcription]`, `[youtube-transcription]`, `[az-doc-intel]`.
## Integrations
- MCP server for Claude Desktop
- Azure Document Intelligence support
- Third-party plugin system (e.g., `markitdown-ocr` for OCR via LLM vision)
## Tech Stack
Python 3.10+. Uses Hatch for environment management, pre-commit hooks for quality.
## References
- https://github.com/microsoft/markitdown
## Related
- [[Open Source]]