# MarkItDown Lightweight Python utility by [[Microsoft]] that converts various file formats to Markdown. Designed for LLM consumption and text analysis pipelines. Preserves document structure (headings, lists, tables, links) while producing token-efficient output. MIT licensed. ## Supported Formats - **Documents**: PDF, PowerPoint, Word, Excel - **Media**: images (with EXIF and OCR), audio (with transcription) - **Web**: HTML, YouTube URLs - **Data**: CSV, JSON, XML - **Containers**: ZIP files (iterates contents) - **eBooks**: ePub ## Usage ### CLI ```bash markitdown file.pdf > output.md markitdown file.pdf -o output.md cat file.pdf | markitdown ``` ### Python API ```python from markitdown import MarkItDown md = MarkItDown() result = md.convert("file.xlsx") print(result.text_content) ``` ### With LLM Vision ```python from markitdown import MarkItDown from openai import OpenAI md = MarkItDown(llm_client=OpenAI(), llm_model="gpt-4o") result = md.convert("image.jpg") ``` ### Docker ```bash docker build -t markitdown:latest . docker run --rm -i markitdown:latest < file.pdf > output.md ``` ## Installation ```bash pip install 'markitdown[all]' # everything pip install 'markitdown[pdf,docx,pptx]' # selective ``` Optional feature groups: `[pdf]`, `[docx]`, `[pptx]`, `[xlsx]`, `[xls]`, `[outlook]`, `[audio-transcription]`, `[youtube-transcription]`, `[az-doc-intel]`. ## Integrations - MCP server for Claude Desktop - Azure Document Intelligence support - Third-party plugin system (e.g., `markitdown-ocr` for OCR via LLM vision) ## Tech Stack Python 3.10+. Uses Hatch for environment management, pre-commit hooks for quality. ## References - https://github.com/microsoft/markitdown ## Related - [[Open Source]]