RAND puts AI project failure at 80.3% and MIT Sloan puts GenAI pilot failure at 95%. A March 2026 survey of 650 leaders found 78% running agent pilots but only 14% reaching production. The fix is a four-gate framework, feasibility then value then cost-justification then scale, with a measurable production success metric and an explicit kill-threshold defined on day one. Run paid pilots on your own data with structured human review and opportunity-cost accounting, because the average abandoned initiative burns $7.2M in sunk cost while successful projects return a median 188% ROI.
RAND puts the failure rate of AI projects at 80.3%. MIT Sloan, looking specifically at generative AI, puts the share of pilots that never reach production at 95%. A March 2026 survey of 650 leaders found 78% running agent pilots and only 14% in production. Whichever number you trust, the conclusion is the same: the default outcome of an AI pilot is failure, and the question worth asking is not "how do we run more pilots" but "how do we kill the bad ones faster." That is what an AI pilot ROI framework is for, and it is the single most underbuilt piece of most AI programs.
The reason pilots fail is not usually the model. It is the absence of a measurable business objective tied to the project from day one, which means there is never an honest moment to declare it done, dead, or fundable. Without a stated bar, a pilot that "works" technically drifts into POC purgatory: too promising to cancel, too unproven to ship, quietly consuming budget while everyone waits for someone else to make the call. The failure data backs this up. Of failed projects, 33.8% are abandoned before production, 28.4% complete but deliver no value, and 18.1% can't justify their cost. Each of those buckets is a gate that was never built.
This post lays out the framework I use to decide which POCs deserve production budget and which deserve a clean death: a four-gate sequence (feasibility, value, cost-justification, scale), each with a pass condition and a kill-threshold set before the work starts. It is opinionated on purpose. The board pressure is real, 98% of boards now demand demonstrated AI ROI and 71% of CIOs expect budget cuts on missed mid-2026 targets, and a vague "we're piloting AI" answer no longer survives contact with a finance review.
The Data on Why AI Pilots Die
Start with the numbers, because they reframe the problem. The instinct in most organizations is to treat a failed pilot as a surprise, a bet that didn't pay off. The aggregate data says the opposite: failure is the base rate, and a pilot that succeeds is the exception that needs explaining.
Three figures anchor this. RAND's analysis lands at 80.3% of AI projects failing to deliver business value. MIT Sloan's GenAI-specific work lands at 95% of pilots never scaling. The March 2026 survey of 650 leaders shows the gap concretely: 78% have agent pilots running, 14% have anything in production. The 64-point drop between "we started" and "it shipped" is POC purgatory rendered as a statistic.
The more useful data is the breakdown of how they die, because each failure mode maps to a missing gate:
That table is the entire thesis. A third of failures are projects that never worked and were allowed to run anyway. Another quarter worked perfectly and moved no business number, because no business number was ever named. Roughly a fifth produced value but cost more to run than the value was worth, because token spend, human review, and infrastructure were treated as free during the pilot. None of these are model problems. They are decision-process problems, and a decision process is exactly what stage gates impose.
The economics make the case for that discipline unanswerable. Across abandoned initiatives, the average sunk cost is roughly $7.2M. Successful projects, by contrast, return a median 188% ROI. The spread between those two outcomes is not luck. It is whether the project was forced to clear measurable bars at each stage or allowed to coast on demo enthusiasm until the money ran out. For a related pattern, our breakdown of why 95% of so-called AI agents are expensive chatbots shows the same gap between demo and production showing up in agent procurement specifically.
| Failure mode | Share of failed projects | The gate that was missing |
|---|---|---|
| Abandoned pre-production | 33.8% | Feasibility (never cleared the bar) |
| Complete but no business value | 28.4% | Value (no metric tied to it) |
| Can't justify the cost | 18.1% | Cost-justification (run cost ignored) |
Root Cause: No Measurable Business Objective on Day One
If you fix one thing, fix this. The dominant root cause of pilot failure is that no measurable business objective was attached to the project before work began. Everything downstream, the drift, the purgatory, the inability to justify cost, flows from that single omission.
A measurable objective is not "improve customer support with AI." It is "reduce median time-to-resolution on tier-1 tickets from 14 hours to under 4 hours while holding CSAT flat or better." The second version has a number, a baseline, a direction, and an implicit kill-threshold: if the pilot can't get under, say, 8 hours, it isn't worth scaling. The first version can never fail, which is exactly why it can never succeed. A project that cannot fail has no decision point, and a project with no decision point lives forever.
The test I apply: before a pilot gets budget, someone must be able to complete the sentence "this pilot is a success if _ and a failure if _," using a number that already has a baseline. If nobody can fill both blanks, the pilot is not ready to start. It is ready for a thirty-minute scoping conversation that produces those blanks, and that conversation is the cheapest risk reduction available. This is the same discipline we argue for in building evaluation datasets for business AI: you cannot manage what you have not committed to measuring against a held-out baseline.
The Four-Gate Framework
The framework is a sequence, not a checklist. Each gate is cheaper to clear than the next, so the design forces expensive failures to surface at cheap stages. You do not proceed to value testing until feasibility passes, and you do not spend on scale engineering until cost-justification passes. The whole point is to spend the least money required to reach a confident no.
Gate 1: Feasibility on Your Own Data
Feasibility asks one question: on a held-out sample of your actual production data, can the system clear a defined accuracy or task-completion bar? Two words carry the weight: your data and held-out. Vendor demo data is curated to pass; your data has the malformed inputs, the ambiguous cases, and the long tail that decide real outcomes. A held-out sample, scored after the build rather than tuned against, is the only honest feasibility signal. Time-box this gate hard, four to six weeks is plenty for a feasibility read. A feasibility pilot that runs longer is usually avoiding the no. The kill-threshold here is the cheapest one you will ever set: if the system can't reach the bar on representative data within the box, stop. You have spent weeks, not the $7.2M that comes from carrying an infeasible project to month nine.
Gate 2: Value, Not Accuracy
Gate 2 is where 28.4% of failed projects die in hindsight, because they cleared feasibility and nobody asked whether feasibility mattered. A model can hit 94% accuracy and move zero business metrics if the 6% it misses are the only cases that were ever expensive, or if the task it automates was never the bottleneck. The value gate requires that clearing the feasibility bar produces a pre-stated movement in a named business metric: tickets resolved per agent-hour, days to close, defect escape rate, revenue per rep. If the metric doesn't move by the amount you committed to, the project fails the value gate even though the demo looked great. This is the gate teams most want to skip and least can afford to.
Gate 3: Cost-Justification at Loaded Cost
Gate 3 kills the 18.1% that work and add value but cost more than they save. The mistake is computing value against a pilot's "free" run cost. At production volume, the fully loaded cost includes token spend, the human review that most production AI still requires, retries, infrastructure, and the engineering time to keep it running. The gate is simple: cost per successful outcome must beat the baseline the system replaces. If a support deflection costs $1.40 in loaded AI spend and the human path costs $0.90, you have an expensive automation, not a win. Build this math before scale, not after the invoice. Our build vs buy AI decision guide covers the loaded-cost modeling that separates a real saving from a vanity one.
Gate 4: Scale and Governance
Gate 4 is the production-readiness gate, and it is where reliability, latency, and governance get tested at full load. A pilot that passes the first three gates at 50 requests a day can still fail at 50,000: latency degrades, reliability drops below the threshold the business actually needs, or an audit requirement that was waved through in the pilot becomes a hard blocker. Reliability is the usual culprit, and it is worth measuring separately from accuracy, the gap between "passes evals" and "behaves in production" is the subject of our piece on why reliability lags accuracy in production agents. If there is a reliability or compliance gap with no clear path to close it, gate 4 is a kill, even this late. A late kill is painful, but shipping an unreliable system is worse.
| Gate | Question it answers | Pass condition (example) | Kill-threshold (example) |
|---|---|---|---|
| 1. Feasibility | Can it hit the accuracy bar on our real data? | >=90% task completion on held-out sample | <80% after the time-box, or unfixable edge cases |
| 2. Value | Does hitting that bar move a business metric? | Target metric moves by the pre-stated amount | No measurable movement, or movement < threshold |
| 3. Cost-justification | Does the value exceed the fully loaded run cost? | Cost per successful outcome beats the baseline | Loaded cost per outcome >= manual baseline |
| 4. Scale | Does it hold at production volume and governance? | Latency, reliability, audit all pass at full load | Reliability or compliance gap with no clear fix |
Set the Kill-Threshold Before You Start
The hardest part of this framework is not technical. It is committing to the kill-threshold while you are still optimistic, before sunk cost and team attachment make every project worth "just one more sprint."
A kill-threshold is a number plus a deadline, agreed in writing before the work starts. "If feasibility is below 80% on held-out data by week five, we stop." "If the value gate shows under a 20% improvement in time-to-resolution, we stop." The reason to write it down in advance is psychological, not procedural. In the moment, every struggling project has a sympathetic story: the model is almost there, the next data batch will fix it, the edge cases are rare. Those stories are how a project that should have died at gate one reaches month nine and the $7.2M average. The pre-committed threshold is the only thing that reliably beats the sunk-cost reflex, because it moves the decision out of the emotional moment and into a contract with your past, more objective self.
This is also where opportunity-cost accounting belongs. A pilot does not just cost its own budget; it costs the next pilot that didn't get those engineers, that data access, that executive sponsorship. When you carry a dead project, you are not being patient, you are spending the budget of a viable one. The kill-threshold protects the portfolio, not just the line item.
Make It a Paid Pilot With Structured Review
Two practices sharpen every gate above. First, run paid pilots, not free ones. A paid pilot, even a small one, forces both sides to treat it as real: the vendor engages with your actual constraints, and you gain the contractual leverage to demand the held-out evaluation gate 1 requires. Free pilots optimize for impressiveness; paid pilots optimize for the production decision, because money is on the line for everyone.
Second, build structured human review into the pilot from the start. Across systems we have reviewed, the projects that clear the value and cost gates honestly are the ones where humans systematically scored a sample of outputs against a rubric, rather than eyeballing a demo and declaring it good. Structured review is what turns "it feels accurate" into the held-out number gate 1 needs, and it is what surfaces the expensive-edge-case problem that quietly kills the value gate. The same human-in-the-loop discipline carries into production as a fallback layer; our framework for choosing between AI, rules, and humans as the decision layer maps where that review should sit permanently.
This is the work Particula Tech runs as a fixed-scope diagnostic: a paid, time-boxed pilot on your own data with a structured evaluation rubric, ending in a documented go/no-go against the four gates rather than another optimistic deck. The deliverable is a decision, which is the one artifact POC purgatory never produces on its own.
Run It Against the Board Pressure, Not Around It
The framework also happens to be the cleanest answer to the board question every AI program now faces. With 98% of boards demanding demonstrated ROI and 71% of CIOs expecting budget cuts on missed mid-2026 targets, "we have twelve pilots running" is no longer a reassuring sentence. It reads as twelve unbounded liabilities.
A gated portfolio reads completely differently. "We have twelve pilots; four cleared the value gate and are in cost-justification, three were killed at feasibility for a combined cost of six weeks, and the rest are pre-gate" is a sentence that survives a finance review, because it shows a process that produces decisions and contains losses. The median 188% ROI on successful projects is achievable precisely because the gates concentrate spend on the projects that earn it and starve the ones that don't. The deeper strategic context, how AI initiatives map to real ROI rather than perpetual cost, sits in our AI for Business pillar.
The uncomfortable truth in the failure data is that most organizations already have the pilots they need to win; they are just unwilling to kill the ones they need to lose. The framework does not make AI projects succeed. It makes them fail cheaply and early, which, given an 80.3% base rate, is the same thing as winning.
Frequently Asked Questions
Quick answers to common questions about this topic
Most AI pilots fail to scale because no measurable business objective was defined on day one. RAND finds 80.3% of AI projects never deliver business value and MIT Sloan finds 95% of GenAI pilots never reach production. The failure breakdown is specific: 33.8% are abandoned before production, 28.4% complete technically but produce no value, and 18.1% can't justify their cost. The common thread is a demo-grade pilot built to impress rather than a production-grade pilot built to clear a stated revenue, cost, or cycle-time bar. When success is never quantified, there is no honest moment to declare the project done, so it drifts in POC purgatory instead of shipping or dying.



