Why do most AI pilots fail to scale to production?

Most AI pilots fail to scale because no measurable business objective was defined on day one. RAND finds 80.3% of AI projects never deliver business value and MIT Sloan finds 95% of GenAI pilots never reach production. The failure breakdown is specific: 33.8% are abandoned before production, 28.4% complete technically but produce no value, and 18.1% can't justify their cost. The common thread is a demo-grade pilot built to impress rather than a production-grade pilot built to clear a stated revenue, cost, or cycle-time bar. When success is never quantified, there is no honest moment to declare the project done, so it drifts in POC purgatory instead of shipping or dying.

What is an AI ROI gate framework?

An AI ROI gate framework is a sequence of explicit decision points a project must clear before it advances and before it consumes more budget. The four gates are feasibility (can the model hit the accuracy or task-completion bar on your real data), value (does clearing that bar move a measurable business metric), cost-justification (does the value exceed the fully loaded run cost including human review and tokens), and scale (will it hold up at production volume and governance load). Each gate has a pass condition and a kill-threshold defined before work starts. The point is to fail cheap projects early, at gate one or two, instead of discovering at month nine that a technically working system was never worth building.

When should you kill an AI project?

Kill an AI project the moment it crosses a kill-threshold you defined before the work began. Concretely: if a feasibility pilot can't reach the accuracy bar on representative data within the time-box, kill it. If it hits the accuracy bar but the value gate shows no measurable movement in the target metric, kill it. If the fully loaded cost per successful outcome exceeds the manual baseline it was meant to beat, kill it. The discipline is deciding the threshold in advance, because in the moment every project has a sympathetic reason to continue. Across initiatives, the average abandoned project burns $7.2M in sunk cost, most of it spent after the point a pre-committed kill-threshold would have stopped it.

What is POC purgatory and how do you escape it?

POC purgatory is the state where an AI proof of concept neither ships to production nor gets cancelled, so it consumes budget and attention indefinitely. A March 2026 survey of 650 leaders captures it precisely: 78% are running agent pilots but only 14% have reached production. You escape it two ways. First, define the production success metric and the kill-threshold on day one, so there is an objective trigger to ship or stop. Second, time-box each gate; a feasibility pilot that runs longer than four to six weeks is usually avoiding a decision, not gathering data. Purgatory is rarely a technical problem. It is the absence of a pre-committed decision rule.

What stage-gate criteria should an AI POC meet before production?

Before production, an AI POC should clear four stage-gate criteria with numbers attached. Gate one, feasibility: the system meets a defined accuracy or task-completion bar on a held-out sample of your real data, not vendor demo data. Gate two, value: clearing that bar moves a named business metric (tickets resolved per hour, days to close, error rate) by a pre-stated amount. Gate three, cost-justification: the fully loaded cost per successful outcome, including tokens, human review, and infrastructure, beats the baseline it replaces. Gate four, scale: latency, reliability, and governance hold at production volume. A project that passes all four with documented numbers is a fundable production system. One that skips a gate is a future write-off.

How much does a failed AI pilot actually cost?

A failed AI pilot costs far more than the engineering hours. Across abandoned initiatives the average sunk cost is roughly $7.2M, and that figure understates the real damage because it ignores opportunity cost: the senior engineers, the data access, and the executive attention that a dead project consumed instead of a viable one. There is also a credibility cost. With 98% of boards now demanding demonstrated AI ROI and 71% of CIOs expecting budget cuts if mid-2026 targets are missed, every failed pilot makes the next funding ask harder. The cheapest failed pilot is one killed at gate one in week three. The most expensive is the one that limps to month nine before someone admits it was never going to clear the value gate.

Should AI pilots run on vendor demo data or your own data?

AI pilots should run on your own data, and ideally as paid pilots, because vendor demo data systematically overstates feasibility. Demo datasets are curated to make the model look good: clean inputs, unambiguous labels, the happy path. Your production data has the messy edge cases, the inconsistent formats, and the long tail that actually determine whether a system clears the value gate. A paid pilot on your own data also forces the vendor to engage with real constraints and gives you contractual leverage to demand the held-out evaluation that gate one requires. The extra cost of a paid, on-your-data pilot is trivial against the $7.2M average sunk cost of a project that passed a demo and then failed in production.

BLOG/AI FOR BUSINESS

Killing the Pilot: An ROI Gate Framework for AI POCs

80.3% of AI projects miss business value and 95% of GenAI pilots never scale. The stage-gate framework with kill-thresholds for which POCs reach production.

Sebastian MondragonJUNE 02, 2026 · 10 MIN READ

Killing the Pilot: An ROI Gate Framework for AI POCs

RAND puts the failure rate of AI projects at 80.3%. MIT Sloan, looking specifically at generative AI, puts the share of pilots that never reach production at 95%. A March 2026 survey of 650 leaders found 78% running agent pilots and only 14% in production. Whichever number you trust, the conclusion is the same: the default outcome of an AI pilot is failure, and the question worth asking is not "how do we run more pilots" but "how do we kill the bad ones faster." That is what an AI pilot ROI framework is for, and it is the single most underbuilt piece of most AI programs.

The reason pilots fail is not usually the model. It is the absence of a measurable business objective tied to the project from day one, which means there is never an honest moment to declare it done, dead, or fundable. Without a stated bar, a pilot that "works" technically drifts into POC purgatory: too promising to cancel, too unproven to ship, quietly consuming budget while everyone waits for someone else to make the call. The failure data backs this up. Of failed projects, 33.8% are abandoned before production, 28.4% complete but deliver no value, and 18.1% can't justify their cost. Each of those buckets is a gate that was never built.

This post lays out the framework I use to decide which POCs deserve production budget and which deserve a clean death: a four-gate sequence (feasibility, value, cost-justification, scale), each with a pass condition and a kill-threshold set before the work starts. It is opinionated on purpose. The board pressure is real, 98% of boards now demand demonstrated AI ROI and 71% of CIOs expect budget cuts on missed mid-2026 targets, and a vague "we're piloting AI" answer no longer survives contact with a finance review.

01 · The Data on Why AI Pilots Die

Start with the numbers, because they reframe the problem. The instinct in most organizations is to treat a failed pilot as a surprise, a bet that didn't pay off. The aggregate data says the opposite: failure is the base rate, and a pilot that succeeds is the exception that needs explaining.

Three figures anchor this. RAND's analysis lands at 80.3% of AI projects failing to deliver business value. MIT Sloan's GenAI-specific work lands at 95% of pilots never scaling. The March 2026 survey of 650 leaders shows the gap concretely: 78% have agent pilots running, 14% have anything in production. The 64-point drop between "we started" and "it shipped" is POC purgatory rendered as a statistic.

The more useful data is the breakdown of how they die, because each failure mode maps to a missing gate:

That table is the entire thesis. A third of failures are projects that never worked and were allowed to run anyway. Another quarter worked perfectly and moved no business number, because no business number was ever named. Roughly a fifth produced value but cost more to run than the value was worth, because token spend, human review, and infrastructure were treated as free during the pilot. None of these are model problems. They are decision-process problems, and a decision process is exactly what stage gates impose.

The economics make the case for that discipline unanswerable. Across abandoned initiatives, the average sunk cost is roughly $7.2M. Successful projects, by contrast, return a median 188% ROI. The spread between those two outcomes is not luck. It is whether the project was forced to clear measurable bars at each stage or allowed to coast on demo enthusiasm until the money ran out. For a related pattern, our breakdown of why 95% of so-called AI agents are expensive chatbots shows the same gap between demo and production showing up in agent procurement specifically.

Failure mode	Share of failed projects	The gate that was missing
Abandoned pre-production	33.8%	Feasibility (never cleared the bar)
Complete but no business value	28.4%	Value (no metric tied to it)
Can't justify the cost	18.1%	Cost-justification (run cost ignored)

02 · Root Cause: No Measurable Business Objective on Day One

If you fix one thing, fix this. The dominant root cause of pilot failure is that no measurable business objective was attached to the project before work began. Everything downstream, the drift, the purgatory, the inability to justify cost, flows from that single omission.

A measurable objective is not "improve customer support with AI." It is "reduce median time-to-resolution on tier-1 tickets from 14 hours to under 4 hours while holding CSAT flat or better." The second version has a number, a baseline, a direction, and an implicit kill-threshold: if the pilot can't get under, say, 8 hours, it isn't worth scaling. The first version can never fail, which is exactly why it can never succeed. A project that cannot fail has no decision point, and a project with no decision point lives forever.

The test I apply: before a pilot gets budget, someone must be able to complete the sentence "this pilot is a success if _ and a failure if _," using a number that already has a baseline. If nobody can fill both blanks, the pilot is not ready to start. It is ready for a thirty-minute scoping conversation that produces those blanks, and that conversation is the cheapest risk reduction available. This is the same discipline we argue for in building evaluation datasets for business AI: you cannot manage what you have not committed to measuring against a held-out baseline.

03 · The Four-Gate Framework

The framework is a sequence, not a checklist. Each gate is cheaper to clear than the next, so the design forces expensive failures to surface at cheap stages. You do not proceed to value testing until feasibility passes, and you do not spend on scale engineering until cost-justification passes. The whole point is to spend the least money required to reach a confident no.

Gate 1: Feasibility on Your Own Data

Feasibility asks one question: on a held-out sample of your actual production data, can the system clear a defined accuracy or task-completion bar? Two words carry the weight: your data and held-out. Vendor demo data is curated to pass; your data has the malformed inputs, the ambiguous cases, and the long tail that decide real outcomes. A held-out sample, scored after the build rather than tuned against, is the only honest feasibility signal. Time-box this gate hard, four to six weeks is plenty for a feasibility read. A feasibility pilot that runs longer is usually avoiding the no. The kill-threshold here is the cheapest one you will ever set: if the system can't reach the bar on representative data within the box, stop. You have spent weeks, not the $7.2M that comes from carrying an infeasible project to month nine.

Gate 2: Value, Not Accuracy

Gate 2 is where 28.4% of failed projects die in hindsight, because they cleared feasibility and nobody asked whether feasibility mattered. A model can hit 94% accuracy and move zero business metrics if the 6% it misses are the only cases that were ever expensive, or if the task it automates was never the bottleneck. The value gate requires that clearing the feasibility bar produces a pre-stated movement in a named business metric: tickets resolved per agent-hour, days to close, defect escape rate, revenue per rep. If the metric doesn't move by the amount you committed to, the project fails the value gate even though the demo looked great. This is the gate teams most want to skip and least can afford to.

Gate 3: Cost-Justification at Loaded Cost

Gate 3 kills the 18.1% that work and add value but cost more than they save. The mistake is computing value against a pilot's "free" run cost. At production volume, the fully loaded cost includes token spend, the human review that most production AI still requires, retries, infrastructure, and the engineering time to keep it running. The gate is simple: cost per successful outcome must beat the baseline the system replaces. If a support deflection costs $1.40 in loaded AI spend and the human path costs $0.90, you have an expensive automation, not a win. Build this math before scale, not after the invoice. Our build vs buy AI decision guide covers the loaded-cost modeling that separates a real saving from a vanity one.

Gate 4: Scale and Governance

Gate 4 is the production-readiness gate, and it is where reliability, latency, and governance get tested at full load. A pilot that passes the first three gates at 50 requests a day can still fail at 50,000: latency degrades, reliability drops below the threshold the business actually needs, or an audit requirement that was waved through in the pilot becomes a hard blocker. Reliability is the usual culprit, and it is worth measuring separately from accuracy, the gap between "passes evals" and "behaves in production" is the subject of our piece on why reliability lags accuracy in production agents. If there is a reliability or compliance gap with no clear path to close it, gate 4 is a kill, even this late. A late kill is painful, but shipping an unreliable system is worse.

Gate	Question it answers	Pass condition (example)	Kill-threshold (example)
1. Feasibility	Can it hit the accuracy bar on our real data?	>=90% task completion on held-out sample	<80% after the time-box, or unfixable edge cases
2. Value	Does hitting that bar move a business metric?	Target metric moves by the pre-stated amount	No measurable movement, or movement < threshold
3. Cost-justification	Does the value exceed the fully loaded run cost?	Cost per successful outcome beats the baseline	Loaded cost per outcome >= manual baseline
4. Scale	Does it hold at production volume and governance?	Latency, reliability, audit all pass at full load	Reliability or compliance gap with no clear fix

04 · Set the Kill-Threshold Before You Start

The hardest part of this framework is not technical. It is committing to the kill-threshold while you are still optimistic, before sunk cost and team attachment make every project worth "just one more sprint."

A kill-threshold is a number plus a deadline, agreed in writing before the work starts. "If feasibility is below 80% on held-out data by week five, we stop." "If the value gate shows under a 20% improvement in time-to-resolution, we stop." The reason to write it down in advance is psychological, not procedural. In the moment, every struggling project has a sympathetic story: the model is almost there, the next data batch will fix it, the edge cases are rare. Those stories are how a project that should have died at gate one reaches month nine and the $7.2M average. The pre-committed threshold is the only thing that reliably beats the sunk-cost reflex, because it moves the decision out of the emotional moment and into a contract with your past, more objective self.

This is also where opportunity-cost accounting belongs. A pilot does not just cost its own budget; it costs the next pilot that didn't get those engineers, that data access, that executive sponsorship. When you carry a dead project, you are not being patient, you are spending the budget of a viable one. The kill-threshold protects the portfolio, not just the line item.

05 · Make It a Paid Pilot With Structured Review

Two practices sharpen every gate above. First, run paid pilots, not free ones. A paid pilot, even a small one, forces both sides to treat it as real: the vendor engages with your actual constraints, and you gain the contractual leverage to demand the held-out evaluation gate 1 requires. Free pilots optimize for impressiveness; paid pilots optimize for the production decision, because money is on the line for everyone.

Second, build structured human review into the pilot from the start. Across systems we have reviewed, the projects that clear the value and cost gates honestly are the ones where humans systematically scored a sample of outputs against a rubric, rather than eyeballing a demo and declaring it good. Structured review is what turns "it feels accurate" into the held-out number gate 1 needs, and it is what surfaces the expensive-edge-case problem that quietly kills the value gate. The same human-in-the-loop discipline carries into production as a fallback layer; our framework for choosing between AI, rules, and humans as the decision layer maps where that review should sit permanently.

This is the work Particula Tech runs as a fixed-scope diagnostic: a paid, time-boxed pilot on your own data with a structured evaluation rubric, ending in a documented go/no-go against the four gates rather than another optimistic deck. The deliverable is a decision, which is the one artifact POC purgatory never produces on its own.

06 · Run It Against the Board Pressure, Not Around It

The framework also happens to be the cleanest answer to the board question every AI program now faces. With 98% of boards demanding demonstrated ROI and 71% of CIOs expecting budget cuts on missed mid-2026 targets, "we have twelve pilots running" is no longer a reassuring sentence. It reads as twelve unbounded liabilities.

A gated portfolio reads completely differently. "We have twelve pilots; four cleared the value gate and are in cost-justification, three were killed at feasibility for a combined cost of six weeks, and the rest are pre-gate" is a sentence that survives a finance review, because it shows a process that produces decisions and contains losses. The median 188% ROI on successful projects is achievable precisely because the gates concentrate spend on the projects that earn it and starve the ones that don't. The deeper strategic context, how AI initiatives map to real ROI rather than perpetual cost, sits in our AI for Business pillar.

The uncomfortable truth in the failure data is that most organizations already have the pilots they need to win; they are just unwilling to kill the ones they need to lose. The framework does not make AI projects succeed. It makes them fail cheaply and early, which, given an 80.3% base rate, is the same thing as winning.

07 · FAQ

Quick answers to the questions this post tends to raise.

BLOG/AI FOR BUSINESS

Killing the Pilot: An ROI Gate Framework for AI POCs

80.3% of AI projects miss business value and 95% of GenAI pilots never scale. The stage-gate framework with kill-thresholds for which POCs reach production.

Sebastian MondragonJUNE 02, 2026 · 10 MIN READ

01 · The Data on Why AI Pilots Die

The more useful data is the breakdown of how they die, because each failure mode maps to a missing gate:

Failure mode	Share of failed projects	The gate that was missing
Abandoned pre-production	33.8%	Feasibility (never cleared the bar)
Complete but no business value	28.4%	Value (no metric tied to it)
Can't justify the cost	18.1%	Cost-justification (run cost ignored)

02 · Root Cause: No Measurable Business Objective on Day One

03 · The Four-Gate Framework

Gate 1: Feasibility on Your Own Data

Gate 2: Value, Not Accuracy

Gate 3: Cost-Justification at Loaded Cost

Gate 4: Scale and Governance

Gate	Question it answers	Pass condition (example)	Kill-threshold (example)
1. Feasibility	Can it hit the accuracy bar on our real data?	>=90% task completion on held-out sample	<80% after the time-box, or unfixable edge cases
2. Value	Does hitting that bar move a business metric?	Target metric moves by the pre-stated amount	No measurable movement, or movement < threshold
3. Cost-justification	Does the value exceed the fully loaded run cost?	Cost per successful outcome beats the baseline	Loaded cost per outcome >= manual baseline
4. Scale	Does it hold at production volume and governance?	Latency, reliability, audit all pass at full load	Reliability or compliance gap with no clear fix