CocoIndex - DeveloPassion

# CocoIndex **CocoIndex is an incremental data framework that keeps the context behind AI agents continuously fresh by reprocessing only what changed.** Open source, Rust core, Python on top. Apache 2.0. The problem it solves: batch pipelines go stale. You re-index everything on a schedule, the agent reads outdated data between runs, and the bill scales with the whole corpus instead of the delta. CocoIndex flips that. Change a source byte and only the affected rows propagate, in under a second. ## The mental model `Target = F(Source)`. You declare the desired target state; the engine keeps it in sync, React-style. You don't build the DAG by hand; CocoIndex derives the processing graph from your code. ```python @coco.fn(memo=True) async def index_file(file, table): for chunk in RecursiveSplitter().split(await file.read_text()): table.declare_row(text=chunk.text, embedding=embed(chunk.text)) ``` ## Why it matters - **Delta-only processing.** Only the Δ is recomputed on every change. Up to **10x cheaper** at scale. - **Sub-second freshness.** Changes reach the target store almost immediately. - **End-to-end lineage.** Every output traces back to its source byte; you can audit and invalidate precisely. - **Code-aware caching.** Memoization invalidates only the transformations whose code or inputs actually changed. - **Production reliability** from the Rust core: parallel chunking, retries, exponential backoff, dead-letter queues. ## What it connects - **Sources:** codebases, meeting notes, APIs, filesystems, databases, message queues, images/video, transcripts. - **Targets:** relational databases, data warehouses, vector databases (Qdrant, LanceDB), graph databases, message queues, feature stores. Common builds: code embedding, PDF indexing, knowledge-graph extraction from meeting notes, Kafka streaming, real-time codebase indexing. The code-intelligence offering ships as a separate product, [[CocoIndexCode]]. ## Core concepts - **App** — the top-level executable. It reads sources, transforms data, and declares target states to sync. - **Processing component** — a logical unit that groups one item's processing with its target state. Each runs independently and commits its changes atomically; it doesn't wait for the whole app to finish. - **Target state** — `TargetState = Transform(SourceState)`. A pure function of the source, synced to a database, vector store, or filesystem. - **Functions / transforms** — discrete operations (PDF to markdown, chunking, embedding). Memoized, so unchanged inputs and unchanged code skip recomputation. Memoization works at two levels: skip an entire processing component when its inputs and logic are unchanged, and skip individual transforms when intermediate results still match. The engine detects what changed and applies only the needed inserts, updates, and deletes inside atomic transactions. No manual delta computation, no hand-rolled state tracking. ## Getting started ```bash pip install -U cocoindex # also: uv add cocoindex, or Poetry ^1.0 ``` Requires Python 3.11–3.13 on macOS, Linux, or Windows 10+. Key CLI verbs: - `cocoindex init` — scaffold a project (`main.py`, `pyproject.toml`, README). - `cocoindex ls` — list apps and their persisted status. - `cocoindex show` — inspect an app's stable paths and components. - `cocoindex update` — run in catch-up mode; add `--live` to keep components processing as sources change. Also `--reset`, `--full-reprocess`, `--preview`. - `cocoindex drop` — revert all target states and clear internal state. ## References - https://cocoindex.io/ - https://cocoindex.io/docs/getting_started/installation/ - https://cocoindex.io/blogs/ - https://github.com/cocoindex-io/cocoindex - https://github.com/cocoindex-io ## Related - [[CocoIndexCode]] - [[Embeddings]] - [[RAG Pipelines]] - [[Retrieval-Augmented Generation (RAG)]] - [[Vector Store]] - [[Semantic Search]] - [[qmd]] - [[Rust]]