Neo4j knowledge graph with 12 million nodes and 89 million relationships powering GraphRAG queries across a wholesale distribution company's supply chain, enabling multi-hop relational reasoning that standard RAG cannot handle.

Python · Neo4j · LangChain · Claude · Apache Kafka · FastAPI · PostgreSQL · Cypher · OpenAI Embeddings · Redis
A wholesale distribution company operating across three countries needed to answer questions that required tracing connections across their entire supply chain: customers to orders, orders to products, products to components, components to suppliers. Their existing RAG system handled document retrieval well, but questions like 'Which suppliers provide components for products our top customers order most frequently?' returned irrelevant results because vector similarity search doesn't understand entity relationships.
We built a GraphRAG system that models the company's business relationships as a Neo4j knowledge graph and uses graph traversal to answer relational queries that standard RAG cannot handle. The system ingests data from 14 sources, resolves entities across systems, and lets anyone ask multi-hop relationship questions in natural language. The graph now contains 12 million nodes and 89 million relationships, and handles the 7% of daily queries that require relational reasoning, the queries that generate the highest business value.
The GraphRAG system was built as the relational intelligence layer within a broader data platform that also includes Cache-Augmented Generation (CAG) for fast lookups and standard RAG for document search. This case focuses specifically on the knowledge graph construction, entity extraction pipeline, GraphRAG query system, and the intelligent routing that directs relational queries to the graph.
| Component | Focus | Status | Key Deliverables |
|---|---|---|---|
| 1 | Graph Schema Design | Completed | Entity type modeling, relationship mapping, property schema, traversal optimization, reduced from 40M to 12M nodes by converting lookup values to properties |
| 2 | Entity Extraction Pipeline | Completed | CDC pipelines via Kafka, LLM-based extraction from unstructured sources, multi-signal identity resolution (98% → 99.3% automatic matching) |
| 3 | GraphRAG Query Pipeline | Completed | Query understanding, template-based Cypher generation (96% accuracy), graph traversal with depth limits, LLM response synthesis with citation trails |
| 4 | Query Routing Integration | Completed | AI classifier routing queries to CAG, RAG, or GraphRAG based on intent, routing accuracy improved from 84% to 96% over three months |
The most consequential decision was designing which entities become nodes, which connections become relationships, and which data points become properties. We cataloged every entity type across 14 data sources and mapped their real-world connections.
The core schema models the supply chain: Customer nodes connect to Order nodes via PLACED relationships. Order nodes connect to Product nodes via CONTAINS relationships with quantity and pricing. Product nodes connect to Component nodes via REQUIRES relationships. Component nodes connect to Supplier nodes via SOURCED_FROM relationships with lead time, cost, and contract terms. Additional node types include Contracts, Support Tickets, and Regions.
Early iterations modeled everything as nodes: invoice line items, shipping events, email threads. The graph ballooned to 40 million nodes and traversal performance degraded. We applied a pruning principle: if data is a lookup value rather than something you'd traverse through, it belongs as a property on a node. Shipping tracking numbers became Order properties. Invoice amounts became properties on the CONTAINS relationship. This cut the graph to 12 million nodes with dramatically faster traversals.
Storing business data on relationships proved essential. Quantity, unit price, and order date on the CONTAINS relationship between Orders and Products meant we could answer 'What's the average order value for products from Vendor X?' without joining external tables at query time.
Building the graph required extracting entities from structured databases, semi-structured APIs, and unstructured documents, then resolving them into unified identities. This consumed more engineering time than any other component.
Structured sources mapped directly: customer records from Salesforce, product catalogs from SAP, and supplier data from procurement became graph nodes via CDC pipelines on Apache Kafka. When a sales rep updates a record in Salesforce, the corresponding Neo4j node updates within seconds.
Unstructured sources, emails, support tickets, contract PDFs, required LLM-based extraction to identify entity mentions and map them to existing graph nodes. A support ticket mentioning 'the Q4 order for the industrial pumps' needed to resolve to specific Order and Product nodes.
Identity resolution was the hardest problem. The same customer appeared as 'Acme Industries' in SAP, 'Acme Industries Inc.' in Salesforce, 'ACME IND' in shipping, and just an email in Stripe. We built a multi-signal matching system combining tax IDs, email domains, phone numbers, addresses, and fuzzy name matching. After training on 2,400 manually verified matches, the system achieved 98% automatic resolution, climbing to 99.3% over three months as human corrections fed back into the model.
The pipeline converts natural language questions into graph traversals in four stages. First, query understanding: an LLM identifies entity types, relationships, and expected answer format. 'Which suppliers provide components for products Acme orders regularly?' parses into a start node, a traversal path, and an aggregation target.
Second, Cypher generation. We initially had the LLM write Cypher directly from natural language. This produced correct queries about 77% of the time, impressive for demos, unacceptable for production. We switched to a template-based approach: the understanding stage classifies the question into one of roughly 30 query patterns, each with a parameterized Cypher template. The LLM fills in parameters rather than writing Cypher from scratch, bringing accuracy above 96%.
Third, graph traversal executes against Neo4j with guardrails: 3-hop default traversal depth (up to 5 for analytical queries), 500-node result cap, and query timeouts. These prevent runaway traversals where a query touching one well-connected entity would return half the graph.
Fourth, response synthesis. The LLM receives the original question plus structured graph results, not raw Cypher output, and generates a natural language answer. Every response includes the Cypher query and contributing node IDs so users can verify any claim against the underlying data.
Running GraphRAG alongside CAG and standard RAG required a routing layer that automatically directs each query to the optimal retrieval method. We built an AI classifier that evaluates whether a query mentions entities and relationships (favors GraphRAG), asks for document content (favors RAG), or requests factual lookups about current data (favors CAG).
The classifier understands intent, not just keywords. 'Tell me about Acme' routes to CAG for basic company information. 'Tell me about our relationship with Acme' routes to RAG for a document-based narrative. 'Show me every supplier connected to products Acme orders' routes to GraphRAG for relational traversal.
Some queries combine methods. 'Summarize our top 10 customers' recent issues and which suppliers are involved' pulls customer rankings from CAG, support ticket content from RAG, and supplier connections from GraphRAG. A synthesis agent merges the results into a coherent answer.
Routing accuracy started at 84% and reached 96% through a user feedback loop. Low-rated responses are analyzed for routing errors, and each misrouted query becomes a training signal for the classifier. Most early errors were GraphRAG queries incorrectly sent to standard RAG, the classifier initially underestimated how many business questions require relational reasoning.
Graph construction costs exceeded projections. Building the initial knowledge graph from 14 sources took eight weeks instead of four. The bottleneck wasn't Neo4j ingestion: it was data cleaning, normalization, and entity resolution upstream. The graph database was the easy part; data preparation was the real investment.
Schema evolution required careful migration. When the business needed to add 'warehouse location' as a new node type, it required re-evaluating traversal patterns, updating Cypher templates, and regression testing existing queries. We now version the graph schema and run automated query suites before any schema change goes live.
Graph query latency spiked on deep traversals. Occasionally a query triggered a traversal touching hundreds of thousands of nodes. We added traversal budgets, maximum nodes per hop and total, plus query timeouts. When a traversal exceeds budget, the system returns partial results with an explanation rather than timing out silently.
Entity freshness is an ongoing concern. Structured sources update in near-real-time via Kafka CDC pipelines. Unstructured sources have inherent lag, emails and contracts go through daily batch extraction. Graph query responses now clearly label the freshness of underlying data so users know when relationship information was last updated.
The GraphRAG system handles the 7% of daily queries that require multi-hop relational reasoning. These are the highest-value queries: supply chain impact analysis, cross-entity aggregation, temporal relationship patterns, and path discovery between entities. Questions that previously required hours of manual cross-referencing across systems now return answers in an average of 1.8 seconds.
Sales identified $340,000 in at-risk revenue within the first month by running relationship queries that were previously impossible, identifying customers whose order frequency dropped while their industry peers increased orders, traced back through the supply chain to specific supplier and product changes.
The template-based Cypher generation approach delivers 96% accuracy on production queries, up from 77% with direct LLM-generated Cypher. The query routing classifier reaches 96% accuracy in directing questions to the correct retrieval method. The identity resolution system automatically matches 99.3% of cross-system records.
The platform processes 2,400 natural language queries per day across 180 active users. The three-tier architecture keeps costs manageable: 71% of queries resolve through CAG, 22% through standard RAG, and 7% through GraphRAG. Intelligent routing ensures GraphRAG compute is reserved for queries that genuinely require relational traversal.
If your team is one or two unknowns away from a system like this one, a thirty-minute call is the fastest way to find out.
Book a discovery callEngagements range from two-week diagnostics to multi-month builds, scoped after a single discovery call.
Every project on this page shipped because we said no to the wrong scope before we said yes to the right one. Half the value of working with us is the engagement we will not take. The other half is the system that ends up running in your business.
Healthcare, defense-adjacent, and enterprise clients sign NDAs that prevent naming. Engagement scope, technology stack, and measured outcomes can be shared publicly. Client identity stays protected.