LLM Knowledge Bases Over Unstructured Data

# LLM Knowledge Bases Over Unstructured Data LLM knowledge bases perform computation over unstructured data: text articles, PDFs, transcripts, screenshots, code in unfamiliar languages, raw notes. This was effectively impossible with classical software because classical code requires schemas. Without a parseable structure, you cannot query, aggregate, or reason. [[Andrej Karpathy]] uses this category as his strongest example of what LLMs unlock that *fundamentally did not exist before*; not a faster version of an existing capability, but a new computational primitive. ## What Makes This Different Pre-LLM, working with unstructured knowledge required a costly extraction step: humans (or fragile NLP pipelines) converted prose into rows, tables, embeddings, or knowledge graphs *before* any computation could happen. That extraction step was the bottleneck. It collapsed nuance, lost context, and demanded continuous maintenance as sources evolved. LLMs eliminate the extraction prerequisite. They operate directly on the source material in its native form. Computation happens over meaning, not over schema-conformant rows. ## Categories of Computation Now Possible - **Synthesis**: combine claims from 50 articles into a coherent position with citations. - **Reconciliation**: detect contradictions across sources and surface which one is more credible. - **Cross-referencing**: link a concept in one document to its appearance in another by meaning, not by string match. - **Summarization at arbitrary granularity**: zoom in or out without re-extracting. - **Translation across formats**: turn a podcast transcript into a structured note, a paper into a teaching map, a meeting recording into action items. - **Question answering with provenance**: answer freeform questions and cite the underlying passage. None of these were robust before. All are now table stakes. ## Canonical Example: [[LLM Wiki]] The clearest implementation of this pattern is the [[LLM Wiki]] (Karpathy's term). A directory of raw sources sits alongside an LLM-maintained wiki of summaries, entity pages, and cross-references. The LLM ingests new sources, updates the wiki, lints for contradictions, and answers queries. The wiki itself is markdown, navigable by humans and crawlable by agents. See the full pattern in [[LLM Wiki]]. [[Farzapedia]] is an independent implementation of the same idea over personal data (diary, notes, iMessage). ## Why It's Hard to See This as New People underrate this category because the *output* (a wiki page, an answer, a summary) looks like something humans have always produced. The novelty is on the production side. Previously, a human had to do the bookkeeping work; now the model does it tirelessly across thousands of sources. In Karpathy's framing, the obvious applications of any new paradigm are speedups of what existed; the genuinely new affordances take longer to recognize because they don't fit the categories the audience already has. ## Limits - **Verifiability gap**: outputs are confident even when wrong; see [[AI Verifiability as a Capability Ceiling]]. - **Scale ceiling at context window**: above a certain corpus size, hierarchy and indexing matter again (see [[Retrieval-Augmented Generation (RAG)]] vs wiki tradeoff in [[LLM Wiki]]). - **Curation collapse**: automating the bookkeeping risks automating away the meaning-making (Steven Thompson's critique in [[LLM Wiki]]). ## Related - [[Andrej Karpathy]] - [[LLM Wiki]] - [[Farzapedia]] - [[Large Language Models (LLMs)]] - [[Retrieval-Augmented Generation (RAG)]] - [[Personal Knowledge Management (PKM)]] - [[Compounding Knowledge]] - [[Knowledge Graph (KG)]] - [[AI Verifiability]] - [[AI Verifiability as a Capability Ceiling]] - [[Menugen Architecture Pattern]] - [[Markdown-based Installation (MD Scripts)]] - [[Tools to enhance understanding]]