Karpathy's LLM Wiki treats an LLM as a full-time librarian that compiles raw sources into structured, cross-referenced markdown pages — a navigable knowledge base — before any user query arrives. This eliminates embedding drift, chunk-boundary errors, and per-query retrieval noise that plague traditional RAG. Use it for stable knowledge under ~400K words (internal wikis, research synthesis, product docs). Keep RAG for high-velocity corpora over 100K documents. The hybrid pattern — compiled wiki for core knowledge plus RAG for volatile data — is the architecture most teams should land on.
Last month I was reviewing a client's internal documentation system — a standard RAG pipeline over 80 product spec documents. The pipeline worked well enough for simple lookups. But when their product team asked "how does our pricing model interact with the enterprise SLA terms?", the system retrieved chunks about pricing from one document and chunks about SLAs from another, never connecting the two. The answer was a confident-sounding paragraph that contradicted an exception clause buried in a third document the retrieval step never surfaced.
This is the failure mode that RAG's architecture makes inevitable. Each query is a fresh start. The system has no memory of what it learned from previous queries, no pre-built understanding of how documents relate, and no mechanism to flag contradictions before a user asks the right question. It's a search engine, not a knowledge system.
Then Andrej Karpathy published a GitHub gist in April 2026 that reframed the problem entirely. His "LLM Wiki" pattern — which hit 16 million views and spawned five-plus open-source implementations within two weeks — proposes treating an LLM not as a retrieval engine but as a full-time research librarian. One that reads your sources, builds a structured knowledge base, maintains cross-references, and flags contradictions before anyone asks a question.
The Problem RAG Doesn't Solve
Traditional RAG solves the "how do I get relevant context to the LLM" problem. It does not solve the "how do I build and maintain a coherent understanding of my knowledge base" problem. These are fundamentally different challenges, and conflating them is where most teams go wrong.
Here's what happens in a standard RAG pipeline: you split documents into chunks, embed them, store vectors, and retrieve the top-k most similar chunks per query. Every query re-derives knowledge from raw text fragments. The system never learns that Document A's pricing section contradicts Document B's terms addendum. It never builds a concept of "pricing model" that spans multiple sources. It never notices that three documents reference an API endpoint that was deprecated last quarter.
The symptoms show up in production as retrieval noise — the system returns chunks that are semantically similar to the query but contextually wrong. We've written extensively about why chunking strategies matter and how agentic retrieval loops can help, but both are patches on a fundamentally stateless architecture. The knowledge is never compiled — it's re-interpreted from scratch every time.
Karpathy's insight is that for many knowledge bases, this is backwards. The expensive work — reading, synthesizing, cross-referencing, flagging contradictions — should happen once, before any query arrives. The query should run against a pre-compiled knowledge artifact, not against raw source text.
The Three-Layer Architecture
The LLM Wiki pattern has three layers, and understanding each is critical to implementing it correctly.
Layer 1: Raw Sources
The raw/ folder holds your original documents — articles, PDFs, meeting notes, code, specifications. These are immutable. The LLM reads them but never modifies them. This guarantees that every claim in the wiki traces back to an original source, giving you an audit trail that RAG's chunk-level citations cannot match. The discipline here matters. You curate what goes into raw/ — this isn't a dump of every document you own. It's the 50-200 sources that represent your actual knowledge base. Internal wikis, product specs, research papers, regulatory documents, onboarding guides. If a document changes, you replace the source and trigger a recompilation of affected wiki pages.
Layer 2: The Compiled Wiki
The wiki/ folder is where the LLM's work lives. It contains structured markdown files organized by type: Two structural files hold the architecture together: This is the critical difference from RAG. When you add a new source about pricing, the LLM doesn't just embed it — it reads the existing pricing-model.md concept page, identifies what's new, updates the page with the additional information, cross-references against enterprise-sla.md, and flags any contradictions in the log. The knowledge compounds.
- Concept pages like
attention-mechanism.mdorpricing-model.mdthat synthesize what your sources say about a specific topic - Entity pages like
anthropic.mdorproduct-v3.mdthat aggregate information about a specific thing - Source summaries — one per ingested document — that capture key facts and flag how each source relates to existing wiki pages
- **
index.md** — a catalog of every page, designed to fit in a single context window so the LLM can navigate the entire knowledge base without retrieval - **
log.md** — an append-only operation log that records every compilation action for transparency and debugging
Layer 3: The Schema
A CLAUDE.md (or equivalent configuration) file defines page structure, naming conventions, templates, and operational workflows. This transforms generic LLM behavior into disciplined knowledge work. Without it, the LLM will write wiki pages however it feels like on a given day. With it, every page follows a consistent structure that makes navigation and querying predictable.
The Compiler Analogy
Karpathy frames this as a compiler for knowledge, and the analogy is precise enough to be actionable.
Traditional RAG is interpreted execution. Every query reads raw source documents (source code), re-parses them (lexing/parsing), and generates an answer (execution) at runtime. It works, but it's slow, inconsistent, and throws away all the work between queries.
The LLM Wiki is compiled execution. Raw sources are compiled into optimized artifacts (wiki pages) ahead of time. Queries run against pre-compiled knowledge — faster, more consistent, and benefiting from cross-source analysis that runtime interpretation cannot perform.
The compilation step is where the real value lives. When the LLM compiles a new source, it doesn't just summarize — it identifies entities, maps relationships to existing concept pages, detects contradictions with prior sources, and updates the index. This is the work that RAG skips entirely and that human knowledge workers do intuitively but slowly.
| Dimension | RAG (Interpreted) | LLM Wiki (Compiled) |
|---|---|---|
| State | Stateless — fresh retrieval per query | Stateful — knowledge compounds over time |
| Infrastructure | Vector DB + embedding pipeline + chunker | Markdown folders + LLM |
| Cross-references | Discovered ad-hoc via similarity search | Pre-built during compilation |
| Contradictions | Invisible until a user hits one | Flagged during ingestion |
| Cost model | Per-query retrieval + generation | One-time compilation + cheap navigation |
| Scale ceiling | Millions of documents | ~400K words (~100 articles) |
| Citation quality | Chunk-level (lossy) | Source-level (traceable) |
Three Core Operations
The LLM Wiki runs on three operations that map cleanly to how a research librarian works.
Ingest
When a new source arrives, the LLM reads it against the existing wiki index, creates or updates relevant concept and entity pages, establishes cross-references, writes a source summary, and logs every action. This is the compilation step — expensive upfront but amortized across every future query. A 10-page technical specification takes roughly 2-3 minutes to ingest with Claude Opus and costs about $0.15-0.30. The same document in a RAG pipeline would be embedded in seconds, but every future query that touches it pays retrieval and synthesis costs.
Query
Users don't search embeddings — they ask the LLM to navigate the wiki via the index. The LLM reads index.md, identifies relevant pages, retrieves only those pages, and synthesizes an answer with traceable citations back to specific wiki pages and ultimately to raw sources. Because the index fits in a single context window, there's no retrieval noise. The LLM knows exactly which pages exist and what they cover. This eliminates the top-k relevance lottery that plagues vector search — the LLM chooses pages by semantic understanding, not cosine similarity.
Lint
The lint operation is the maintenance layer. The LLM scans the wiki for contradictions between pages, orphaned pages with no backlinks, stale information that conflicts with newer sources, and missing concepts that sources reference but no wiki page covers. Think of it as eslint for knowledge — a health check that keeps the knowledge base coherent as it grows. This is something RAG systems fundamentally cannot do. A vector database doesn't know that two chunks contradict each other. It doesn't know that an embedding is stale. The LLM Wiki makes knowledge maintenance a first-class operation.
When Compiled Knowledge Beats RAG
The LLM Wiki isn't a RAG replacement — it's a different architecture for a different scale and stability profile. Here's the decision framework I use with clients.
Use the LLM Wiki when:
- Your knowledge base is stable. Product docs that change quarterly, research paper collections, regulatory frameworks, internal process guides. If the corpus updates less than weekly, compilation amortizes well.
- You need answer consistency. RAG answers vary with chunking parameters, embedding model versions, and retrieval randomness. Compiled wiki answers are deterministic against the same wiki state.
- Traceability matters. Healthcare, legal, financial services — anywhere you need to trace an answer back to a specific source document. The raw → wiki → answer chain is cleaner than RAG's chunk-level citations.
- Your corpus is under ~400K words. The index must fit in a single context window. At roughly 100 core documents, you're well within this boundary.
- You want zero infrastructure. No vector database, no embedding pipeline, no chunking strategy to tune. Markdown folders and an LLM API key.
Keep RAG when:
For more on where RAG alternatives fit — including Cache-Augmented Generation and GraphRAG — see our comprehensive comparison of RAG alternatives.
- You have 100K+ documents. The LLM Wiki's index doesn't scale past a few hundred pages. RAG scales to millions.
- Data changes hourly. Real-time feeds, live dashboards, news aggregation. Recompilation latency is minutes, not milliseconds.
- You need sub-second latency at scale. Wiki navigation adds 3-8 seconds per query. RAG retrieval is 200-500ms. For customer-facing chatbots with latency SLAs, RAG wins.
- Multiple teams need different access controls. RAG systems can filter retrieval by permission. A shared wiki folder is harder to partition.
The Hybrid Pattern Most Teams Should Build
The real-world answer isn't "LLM Wiki or RAG" — it's both. The hybrid pattern uses compiled knowledge for your core, stable information and RAG for volatile or high-volume data.
Here's the architecture I recommend:
The agent router — which is itself a lightweight LLM call — decides per-query whether the answer lives in the compiled wiki or requires RAG retrieval. "What's our refund policy?" routes to the wiki. "What did the customer say in yesterday's support ticket?" routes to RAG. "How does our refund policy apply to this customer's complaint?" hits both.
This mirrors the context engineering principle that the information environment determines the quality of the answer. The wiki provides stable, pre-verified context. RAG provides fresh, dynamic context. The router decides which the query needs.
Implementation: Cost and Latency Comparison
I compiled numbers from three client implementations and our internal testing to give you realistic benchmarks.
The latency gap is real — wiki navigation is slower per-query than vector retrieval. But the consistency and traceability gains are substantial for knowledge bases where accuracy matters more than speed. And the total cost of ownership is lower because you're not running a vector database, managing embedding model upgrades, or debugging chunking strategies that lose context.
| Metric | LLM Wiki (80 docs) | Standard RAG (80 docs) | Hybrid |
|---|---|---|---|
| Setup cost | $18-25 (one-time compile) | $3-5 (embedding) | $20-28 |
| Per-query cost | $0.02-0.05 (navigation) | $0.01-0.03 (retrieval + gen) | $0.02-0.04 avg |
| Per-query latency | 4-8s | 1-3s | 2-5s avg |
| Incremental update | $0.15-0.30 per source | $0.01-0.02 per source | Depends on target |
| Answer consistency | High (deterministic) | Medium (retrieval variance) | High for core, medium for dynamic |
| Contradiction detection | Yes (at ingest) | No | Yes for core knowledge |
| Infrastructure | None (markdown + API) | Vector DB + embeddings | Both |
What the Community Built in Two Weeks
The velocity of open-source adoption signals that Karpathy hit a real nerve. Within two weeks of the gist, five-plus implementations emerged:
npx cachezero ingest ./docs and you have a compiled wikiThe MCP integration is particularly interesting for teams already using MCP for agent tool access. It means the compiled wiki can serve as a knowledge tool alongside database queries, API calls, and file system access — all through a single protocol.
Getting Started: The 30-Minute Version
If you want to try this pattern before committing to an implementation:
raw/, wiki/, and put a CLAUDE.md schema file at the root defining page structure (concept pages, entity pages, naming conventions).wiki/index.md file listing every page with a one-line description.The entire setup takes 30 minutes for a 20-document corpus and costs under $5 in API calls. You'll know within the first five queries whether the pattern fits your use case better than your current RAG pipeline.
The Bigger Picture
The LLM Wiki pattern is part of a broader shift I've been tracking across our client work: the move from retrieval-centric to knowledge-centric AI architectures. RAG solved the "LLMs don't know about my data" problem. But it solved it with a search engine, and search engines don't build understanding.
The next generation of knowledge systems — whether they look like Karpathy's wiki, GraphRAG's entity graphs, or agentic RAG's self-correcting loops — all share a common principle: invest compute upfront to build structured knowledge, rather than spending it per-query to re-derive answers from raw text.
For teams with stable, high-value knowledge bases under a few hundred documents, the LLM Wiki pattern is the simplest version of this idea. No vector database. No embedding pipeline. No chunking strategy. Just markdown, an LLM, and the discipline to treat knowledge as something you compile rather than retrieve.
Frequently Asked Questions
Quick answers to common questions about this topic
The LLM Wiki is a knowledge architecture proposed by Andrej Karpathy in April 2026. Instead of retrieving raw document chunks at query time like RAG, it uses an LLM to pre-compile raw sources into structured, cross-referenced markdown wiki pages — concept pages, entity pages, and source summaries — with a navigable index. The LLM acts as a full-time librarian that maintains and enriches the knowledge base continuously, rather than a search engine that re-synthesizes from scratch on every query. The pattern went viral with 16 million views and spawned 5+ open-source implementations within two weeks.



