DuckDB - DeveloPassion

# DuckDB DuckDB is an in-process analytical (OLAP) database; the analytical counterpart to [[SQLite]]. Created at CWI Amsterdam by Mark Raasveldt and Hannes Mühleisen and first released in 2019, it embeds directly inside the host application (Python, R, Node.js, Java, Rust, C++, the CLI), runs columnar queries on local files (CSV, Parquet, JSON, Arrow), and stores its own data in a single columnar file. No server, no daemon, no configuration. ## Positioning - **Like [[SQLite]]**: zero-configuration, single-file, embedded, MIT-licensed, public-API stable - **Unlike [[SQLite]]**: column-oriented and vectorized, designed for analytical (read-heavy, aggregation-heavy) workloads instead of transactional (point-lookup, single-row update) workloads - **Like Pandas**: lives in the same process as your analysis code; no network round-trips - **Unlike Pandas**: backed by a real query optimizer, parallel execution, and out-of-core processing for datasets larger than RAM The shorthand: SQLite is for OLTP, DuckDB is for OLAP. They are siblings, not competitors. ## Why It Took Off Analysts and data scientists were caught between two worlds: lightweight tools (Pandas, R data frames) that hit a wall above ~10GB, and heavyweight warehouses (Snowflake, BigQuery, Redshift) that required infrastructure, latency, and cost they didn't want for exploratory work. DuckDB filled the gap. Run SQL over a 50GB Parquet file on a laptop in seconds, with the same query you'd run in production. ## Core Capabilities - **Vectorized columnar execution** with parallel multi-core query plans - **Native readers** for Parquet, CSV, JSON, Arrow, Iceberg, Delta Lake, and remote HTTP/S3 URLs - **Direct query** of in-memory Pandas/Polars/Arrow data frames, zero-copy where possible - **Streaming and out-of-core**: works on datasets larger than RAM via spilling to disk - **Extensions**: full-text search, spatial, HTTP, JSON, vector similarity, and many community extensions - **Stable storage format** with backwards compatibility (since v1.0 in 2024) ## Common Use Cases - Local data exploration without spinning up a warehouse - ETL and data transformation pipelines (often as a Pandas/Spark replacement for medium data) - Embedded analytics inside applications (web apps, notebooks, CLIs) - Querying remote Parquet/Iceberg lakes directly without a separate query engine - The "small data" backend for dashboards and BI prototypes ## Trade-offs - Single-writer, like SQLite; not designed for high-concurrency write workloads - No built-in network protocol; embedded-only by design (the project explicitly does not want to become a server) - Younger than SQLite; ecosystem is growing but smaller than the SQL warehouse incumbents ## References - Official site: https://duckdb.org/ - Documentation: https://duckdb.org/docs/ - GitHub: https://github.com/duckdb/duckdb ## Related - [[SQLite]] - [[Database]] - [[Database Management Systems (DBMS)]] - [[Relational Databases (RDBMS)]] - [[SQL]] - [[PostgreSQL]]