Princeton's Narayanan-Kapoor team (Feb 2026) showed AI agent accuracy has risen roughly 7x on customer service benchmarks since 2023 while reliability barely moved. Compounding math makes it worse: 90% step accuracy across a 10-step workflow collapses to 35% task success. Measure four dimensions — consistency, robustness, predictability, safety — and instrument consistency@10 and p95 task success, not pass@1. For customer-facing workflows, a 70% consistent agent beats an 85% variable one every time.
A fintech client of ours ran an agent through their evaluation suite last quarter. 89% task accuracy on 500 test cases. Board presentation the following week. Green light for production rollout.
Six weeks into deployment, the support queue was on fire. The agent that passed 89% of eval tasks was completing only 43% of real customer conversations end-to-end. Same tool calls. Same prompts. Same model. The 89% number didn't lie — it just measured the wrong thing.
That gap has a name now. In February 2026, Princeton's Arvind Narayanan and Sayash Kapoor (with Rabanser, Kirgis, Liu, and Utpala) published "Towards a Science of AI Agent Reliability" — 66 pages of data showing that since 2023, agent accuracy benchmarks have climbed roughly 2x faster than reliability. On customer service tasks specifically, the accuracy-to-reliability gap is now 7x. This is the single most important chart in agent evaluation right now, and almost nobody in procurement or engineering is looking at it.
This post breaks down why that gap exists, why compounding math makes it catastrophic in production, and exactly what to measure if you want your agent to behave in the wild the way it behaved in your eval harness.
The Chart That Broke Conventional Agent Evaluation
Narayanan and Kapoor's team at Princeton's Center for Information Technology Policy ran the same agents on the same tasks multiple times — something almost no published benchmark does — and plotted accuracy (average single-run success) against reliability (consistency of success across repeated runs).
The headline result: on general-purpose agents, accuracy has risen roughly 2x faster than reliability since 2023. On customer service benchmarks, the ratio jumps to 7:1. Agents look like they're getting twice as good; they're getting seven times as good at their peak and barely any more consistent at their floor.
Most benchmarks report a single number: what percentage of tasks the agent completed at least once. That number has been improving steadily. It's what vendors put in decks and what appears in blog posts. But when the same agent runs the same task ten times, success rates fluctuate — often by 30-50 percentage points on identical inputs. Accuracy is the peak. Reliability is the floor. Your users live on the floor.
The reason reliability has stagnated while accuracy has surged is architectural. Larger models and better scaffolding push the peak up by expanding what the agent can do. But variance is determined by sampling, tool-call non-determinism, retrieval order, and context composition — none of which have gotten meaningfully more deterministic. As our analysis of scaffolding beating model upgrades on SWE-Bench showed, the winners in 2026 aren't picking better models, they're picking harnesses that reduce variance.
The Compounding Math Nobody Budgets For
Take an agent with 90% step accuracy. Sounds great. Now chain it across a 10-step customer service workflow:
0.90 ^ 10 = 0.3487
34.87% end-to-end task success if steps are independent. And that's the optimistic case — real workflows have error propagation, where a wrong classification at step 3 guarantees an irrelevant tool call at step 7.
Every team building multi-step agents should have this table taped to their monitor:
A "fantastic" 85% step accuracy on an eval set collapses to 20% task success on a 10-step workflow. The fintech agent I opened with looked fine in isolation and disastrous in production because the eval measured per-step accuracy while users experienced cumulative reliability across an 8- to 12-step resolution flow.
This is why agent loops and reasoning steps need aggressive optimization — every extra step multiplies failure probability. Reducing an 8-step flow to 5 steps buys you more end-to-end reliability than a 5-point bump in step accuracy.
| Step accuracy | 5 steps | 10 steps | 20 steps | 30 steps |
|---|---|---|---|---|
| 95% | 77.4% | 59.9% | 35.8% | 21.5% |
| 90% | 59.0% | 34.9% | 12.2% | 4.2% |
| 85% | 44.4% | 19.7% | 3.9% | 0.8% |
| 80% | 32.8% | 10.7% | 1.2% | 0.1% |
| 75% | 23.7% | 5.6% | 0.3% | 0.02% |
The Four Dimensions of Reliability
The Princeton paper defines 12 metrics across four dimensions. I've found the four-dimension framing useful for deciding what to instrument. Most production agent dashboards cover a slice of the first dimension and none of the others.
Consistency
Does the same input produce the same (or equivalent) output across runs? Measure by running each canonical task N=10 times. Report pass@1, pass@k, and — more importantly — consistency@k: the fraction of runs that succeed. A task at 9/10 consistency is production-healthy. A task at 5/10 is a coin flip even if the average looks identical.
Robustness
Does the agent handle perturbations that shouldn't matter? Paraphrase the prompt. Swap synonyms. Reorder tools in the schema. Change the system message's whitespace. Small changes often flip outcomes. Robustness failure is where vendor demos collapse when you deviate one inch from the scripted input.
Predictability
Can you forecast failure modes? Does the agent fail gracefully with an explicit "I don't know" or silently with a confident wrong answer? Predictability is about the shape of failure, not its frequency. A 30% failure rate where all failures abstain with "I cannot complete this task" is operationally manageable. A 10% failure rate where every failure produces a confidently wrong refund amount is a crisis.
Safety
Do failures remain bounded? A bad answer is recoverable. A destructive tool call — DROP TABLE, an email to the wrong customer, a refund issued in the wrong currency — often is not. Safety is the tail-risk dimension, and it's why AI agent production safety deserves its own engineering discipline. Consistency metrics won't catch it; you need explicit bounded-damage guarantees on every tool. For deeper coverage of the broader landscape this framework sits in, see our AI agents pillar page.
A 30-Line Wrapper for Consistency Measurement
Here's the minimum-viable pattern for pulling real reliability numbers out of any agent. Wrap the agent with a runner that repeats each task N times and computes the distribution:
import statistics
from collections import defaultdict
def run_reliability_suite(agent, tasks, n_runs=10):
results = defaultdict(list)
for task in tasks:
for _ in range(n_runs):
output = agent.run(task.input)
results[task.id].append(task.grade(output)) # bool or 0..1
summary = {}
for task_id, runs in results.items():
passes = sum(1 for r in runs if r)
summary[task_id] = {
"pass@1": bool(runs[0]),
"pass@k": passes > 0,
"consistency": passes / len(runs),
}
consistencies = [v["consistency"] for v in summary.values()]
return {
"mean_task_success": statistics.mean(consistencies),
"p50_task_success": statistics.median(consistencies),
"p95_task_success": statistics.quantiles(consistencies, n=20)[-1],
"per_task": summary,
}What this gives you that pass@1 never does:
Run this on every prompt change. Run it nightly against a fixed task suite. Put the p95 line on the same dashboard as your latency and cost metrics — our AI production monitoring guide covers where this fits in the broader observability stack.
The grading function task.grade(output) is where the hard work lives. For factual tasks it's exact match. For subjective tasks it's an LLM-as-judge score — see our practical patterns in regression testing non-deterministic AI with LLM-as-judge and the broader framing in testing AI systems when there's no right answer. This wrapper is useless without a trustworthy grader.
The Procurement Checklist: Six Questions Before You Sign
If you're evaluating a vendor-built agent, these six questions separate serious vendors from those who only optimized for benchmark numbers. Ask them in order. The answer quality drops off a cliff around question three for vendors who've never measured reliability.
DROP TABLE must be impossible, not just unlikely. "We trust the model" is the wrong answer.A vendor that answers "85% accuracy on our benchmark" and nothing else is selling a peak and leaving you to discover the floor.
When 70% Consistent Beats 85% Variable
Here's the operational decision I've faced with clients more than once: given two agents with equivalent cost and latency, which do you ship?
For internal tools where humans review every output, Agent A wins — the successful 85% is high quality and humans catch the variance. For customer-facing workflows where the output reaches the user directly, Agent B wins every time. Predictable mediocrity is manageable. Unpredictable excellence is not.
The decision framework: ask what operational cost a variable outcome imposes. If the cost of failure is "the user retries and it works next time," optimize for mean accuracy. If the cost is "we send a refund to the wrong customer and field a regulatory complaint," optimize for consistency — even at a 15-point accuracy penalty. The Princeton paper's quiet corollary is that "better" depends on where on the peak-vs-floor curve your users live.
This is also the trade-off that makes evals-driven development so valuable in practice. Our walkthrough of evals-driven development in practice covers how to gate every change on eval results — the discipline is what keeps your p95 from silently eroding release over release.
| Dimension | Agent A | Agent B |
|---|---|---|
| Mean accuracy | 85% | 70% |
| Per-task consistency@10 | 20% - 100% (wide) | 60% - 80% (tight) |
| P95 task success | 42% | 65% |
| Shape of failure | Confident wrong | Explicit abstain |
| Variance across paraphrase | High | Low |
What to Do Monday Morning
If you have an agent in production today, three things are worth doing this week:
None of this requires new tooling. Most teams already have the logs. They just never aggregated them with reliability as the target metric.
The Real Insight
The Princeton paper's core contribution isn't that models are bad — they're clearly better every quarter. It's that we've been benchmarking the wrong number. Accuracy measures peak capability in a controlled setting. Reliability measures what your users actually experience at 3am on a Tuesday when the input is a typo-ridden paraphrase of a question the eval suite never saw.
The 7x gap on customer service benchmarks means your agent is probably far better in your eval suite than in your customer's hands — and the gap isn't closing with scale. It's widening, because accuracy is what the industry optimizes and reliability is what the industry ignores.
The teams shipping useful agents in 2026 aren't the ones with the highest eval scores. They're the ones who measured the floor, not just the peak, and built systems where the floor is high enough to trust. That starts with consistency@10 and p95 task success on your dashboard this week.
Frequently Asked Questions
Quick answers to common questions about this topic
Accuracy measures average success rate on a single run across a set of tasks. Reliability measures whether the agent produces the same result when you run the same task multiple times. Princeton's February 2026 study found accuracy has climbed roughly 7x on customer service benchmarks since 2023 while reliability has barely moved. An 89% accuracy agent can complete only 43% of real conversations end-to-end because repeated runs on identical inputs produce inconsistent outputs.



