What is the difference between AI agent accuracy and reliability?

Accuracy measures average success rate on a single run across a set of tasks. Reliability measures whether the agent produces the same result when you run the same task multiple times. Princeton's February 2026 study found accuracy has climbed roughly 7x on customer service benchmarks since 2023 while reliability has barely moved. An 89% accuracy agent can complete only 43% of real conversations end-to-end because repeated runs on identical inputs produce inconsistent outputs.

Why does 90% agent step accuracy produce 35% task success?

Multi-step agent workflows compound step-level error rates multiplicatively. If each step has independent 90% accuracy, a 10-step workflow completes end-to-end in 0.9^10 = 34.8% of runs. Real workflows perform worse because errors propagate, a wrong step at position 3 often guarantees failure at step 7. Teams report step accuracy, but users experience cumulative task success. The compounding math is the number nobody budgets for.

What is consistency@k for AI agents?

Consistency@k is the fraction of runs that succeed when you execute the same task k times. Pass@1 reports the result of one run. Pass@k reports 'at least one success in k runs.' Consistency@k reports 'how many of k runs succeeded.' A task at 9/10 consistency is healthier than two tasks averaging 5/10, even though both yield 90% mean accuracy. Consistency@k surfaces flaky tasks that averages hide and is the metric most production dashboards never track.

How do you measure p95 task success rate for an AI agent?

Run each task in your eval suite N times, typically 10. Compute the per-task success rate. Take the 95th percentile of those rates: the value at which 95% of tasks perform equal or better. P95 task success is typically 20-40 points below mean accuracy and tells you how the worst 5% of your production traffic experiences the agent. That long tail is the distribution users actually complain about.

Why should procurement ask about reliability, not just accuracy benchmarks?

Vendor benchmark numbers are almost always pass@1 on fixed test sets, peak-case performance of a single run. Production deployment reveals the distribution: how variable the agent is, how failures fail, and whether bounded-damage guarantees exist on tool use. A vendor that cannot report consistency@10, robustness to paraphrase, or a failure mode breakdown is selling peak performance and leaving you to discover the floor in production.

When should you ship a 70% consistent agent over an 85% variable one?

Ship the consistent agent when failures are costly and users see outputs directly: customer-facing chatbots, automated document generation, unattended tool use. The unpredictability of a high-variance agent creates operational overhead, retries, escalations, tickets, wrong-customer incidents, that overwhelms any accuracy advantage. Ship the variable agent only when humans review each output before it reaches the user, because high-quality peaks compensate for occasional failures reviewers can catch.

What are the four dimensions of AI agent reliability?

The Princeton CITP framework defines consistency (same input produces equivalent output across runs), robustness (the agent handles paraphrased or perturbed inputs), predictability (failure modes are forecastable and preferably explicit), and safety (failures remain bounded and do not cause destructive side effects). Most teams only instrument a slice of consistency. Robustness and safety are where production agents quietly collapse.

BLOG/AI AGENTS

Reliability Lags Accuracy 7x: Why Agents Fail in Production

Princeton's Feb 2026 study: agent accuracy climbed 7x since 2023, reliability barely moved. Here's why consistency@10 and p95 task success matter more than pass@1.

Sebastian MondragonAPRIL 20, 2026 · 9 MIN READ

Reliability Lags Accuracy 7x: Why Agents Fail in Production

If your agent passes 89% of evals and breaks in week one of production, the eval is the bug. That gap, between a benchmark suite that says "ship it" and a customer queue that says "this is on fire", is the single most expensive failure mode in agent rollouts right now. Same tool calls. Same prompts. Same model. The 89% number doesn't lie. It just measures the wrong thing.

That gap has a name now. In February 2026, Princeton's Arvind Narayanan and Sayash Kapoor (with Rabanser, Kirgis, Liu, and Utpala) published "Towards a Science of AI Agent Reliability", 66 pages of data showing that since 2023, agent accuracy benchmarks have climbed roughly 2x faster than reliability. On customer service tasks specifically, the accuracy-to-reliability gap is now 7x. This is the single most important chart in agent evaluation right now, and almost nobody in procurement or engineering is looking at it.

This post breaks down why that gap exists, why compounding math makes it catastrophic in production, and exactly what to measure if you want your agent to behave in the wild the way it behaved in your eval harness.

01 · The Chart That Broke Conventional Agent Evaluation

Narayanan and Kapoor's team at Princeton's Center for Information Technology Policy ran the same agents on the same tasks multiple times, something almost no published benchmark does, and plotted accuracy (average single-run success) against reliability (consistency of success across repeated runs).

The headline result: on general-purpose agents, accuracy has risen roughly 2x faster than reliability since 2023. On customer service benchmarks, the ratio jumps to 7:1. Agents look like they're getting twice as good; they're getting seven times as good at their peak and barely any more consistent at their floor.

Most benchmarks report a single number: what percentage of tasks the agent completed at least once. That number has been improving steadily. It's what vendors put in decks and what appears in blog posts. But when the same agent runs the same task ten times, success rates fluctuate, often by 30-50 percentage points on identical inputs. Accuracy is the peak. Reliability is the floor. Your users live on the floor.

The reason reliability has stagnated while accuracy has surged is architectural. Larger models and better scaffolding push the peak up by expanding what the agent can do. But variance is determined by sampling, tool-call non-determinism, retrieval order, and context composition, none of which have gotten meaningfully more deterministic. As our analysis of scaffolding beating model upgrades on SWE-Bench showed, the winners in 2026 aren't picking better models, they're picking harnesses that reduce variance.

02 · The Compounding Math Nobody Budgets For

Take an agent with 90% step accuracy. Sounds great. Now chain it across a 10-step customer service workflow:

CODE

0.90 ^ 10 = 0.3487

34.87% end-to-end task success if steps are independent. And that's the optimistic case, real workflows have error propagation, where a wrong classification at step 3 guarantees an irrelevant tool call at step 7.

Every team building multi-step agents should have this table taped to their monitor:

A "fantastic" 85% step accuracy on an eval set collapses to 20% task success on a 10-step workflow. The 89%-eval agent from the opening looks fine in isolation and disastrous in production because the eval measures per-step accuracy while users experience cumulative reliability across an 8- to 12-step resolution flow. Web agents make the gap concrete: top systems hit roughly 90% on the single-task WebVoyager benchmark yet only the high 50s on the multi-step WebArena suite, as we detail in Browser Use vs Operator vs Claude Computer Use.

This is why agent loops and reasoning steps need aggressive optimization, every extra step multiplies failure probability. Reducing an 8-step flow to 5 steps buys you more end-to-end reliability than a 5-point bump in step accuracy. Even then, durable execution is what keeps a high-accuracy agent reliable when the worker process dies mid-run, because a crash on step 7 of a long flow shouldn't reset all the work that already succeeded.

Step accuracy	5 steps	10 steps	20 steps	30 steps
95%	77.4%	59.9%	35.8%	21.5%
90%	59.0%	34.9%	12.2%	4.2%
85%	44.4%	19.7%	3.9%	0.8%
80%	32.8%	10.7%	1.2%	0.1%
75%	23.7%	5.6%	0.3%	0.02%

03 · The Four Dimensions of Reliability

The Princeton paper defines 12 metrics across four dimensions. I've found the four-dimension framing useful for deciding what to instrument. Most production agent dashboards cover a slice of the first dimension and none of the others.

Consistency

Does the same input produce the same (or equivalent) output across runs? Measure by running each canonical task N=10 times. Report pass@1, pass@k, and, more importantly, consistency@k: the fraction of runs that succeed. A task at 9/10 consistency is production-healthy. A task at 5/10 is a coin flip even if the average looks identical.

Robustness

Does the agent handle perturbations that shouldn't matter? Paraphrase the prompt. Swap synonyms. Reorder tools in the schema. Change the system message's whitespace. Small changes often flip outcomes. Robustness failure is where vendor demos collapse when you deviate one inch from the scripted input.

Predictability

Can you forecast failure modes? Does the agent fail gracefully with an explicit "I don't know" or silently with a confident wrong answer? Predictability is about the shape of failure, not its frequency. A 30% failure rate where all failures abstain with "I cannot complete this task" is operationally manageable. A 10% failure rate where every failure produces a confidently wrong refund amount is a crisis.

Safety

Do failures remain bounded? A bad answer is recoverable. A destructive tool call, DROP TABLE, an email to the wrong customer, a refund issued in the wrong currency, often is not. Safety is the tail-risk dimension, and it's why AI agent production safety deserves its own engineering discipline. Consistency metrics won't catch it; you need explicit bounded-damage guarantees on every tool. For deeper coverage of the broader landscape this framework sits in, see our AI agents pillar page.

04 · A 30-Line Wrapper for Consistency Measurement

Here's the minimum-viable pattern for pulling real reliability numbers out of any agent. Wrap the agent with a runner that repeats each task N times and computes the distribution:

PYTHON

import statistics
from collections import defaultdict

def run_reliability_suite(agent, tasks, n_runs=10):
    results = defaultdict(list)
    for task in tasks:
        for _ in range(n_runs):
            output = agent.run(task.input)
            results[task.id].append(task.grade(output))  # bool or 0..1

    summary = {}
    for task_id, runs in results.items():
        passes = sum(1 for r in runs if r)
        summary[task_id] = {
            "pass@1": bool(runs[0]),
            "pass@k": passes > 0,
            "consistency": passes / len(runs),
        }

    consistencies = [v["consistency"] for v in summary.values()]
    return {
        "mean_task_success": statistics.mean(consistencies),
        "p50_task_success": statistics.median(consistencies),
        "p95_task_success": statistics.quantiles(consistencies, n=20)[-1],
        "per_task": summary,
    }

What this gives you that pass@1 never does:

consistency@k per task: flaky tasks surface immediately and can be patched with retries, prompt tightening, or scope reduction.

p95 task success, the worst 5% of tasks determine how production feels at scale. This number is typically 20-40 points below mean accuracy.

per-task variance trends, plot it over time. If p95 drops while mean climbs, you're regressing and don't know it.

Run this on every prompt change. Run it nightly against a fixed task suite. Put the p95 line on the same dashboard as your latency and cost metrics, our AI production monitoring guide covers where this fits in the broader observability stack.

The grading function task.grade(output) is where the hard work lives. For factual tasks it's exact match. For subjective tasks it's an LLM-as-judge score, see our practical patterns in regression testing non-deterministic AI with LLM-as-judge and the broader framing in testing AI systems when there's no right answer. This wrapper is useless without a trustworthy grader.

05 · The Procurement Checklist: Six Questions Before You Sign

If you're evaluating a vendor-built agent, these six questions separate serious vendors from those who only optimized for benchmark numbers. Ask them in order. The answer quality drops off a cliff around question three for vendors who've never measured reliability.

"What's consistency@10 on your benchmark, not just pass@1?": If they've never run the same task twice, their accuracy number is an anecdote, not a measurement.

"What's the p95 task success rate across 100 production-like conversations?", Averages hide the tail. The tail is what your users see.

"Show me the distribution of outcomes when you paraphrase five canonical inputs.", Robustness failure is where polished demos fall apart.

"What are the top three failure modes in your logs from the last 30 days?", If they can't answer, they aren't observing failure. You'll be the one doing the observing, in production.

"What fraction of failures produce confident wrong answers versus explicit abstentions?", Silent failure is the dimension that hurts users most.

"What bounded-damage guarantees do you have on tool use?", DROP TABLE must be impossible, not just unlikely. "We trust the model" is the wrong answer.

A vendor that answers "85% accuracy on our benchmark" and nothing else is selling a peak and leaving you to discover the floor.

06 · When 70% Consistent Beats 85% Variable

Here's the operational decision that surfaces repeatedly in agent rollouts: given two agents with equivalent cost and latency, which do you ship?

For internal tools where humans review every output, Agent A wins, the successful 85% is high quality and humans catch the variance. For customer-facing workflows where the output reaches the user directly, Agent B wins every time. Predictable mediocrity is manageable. Unpredictable excellence is not.

The decision framework: ask what operational cost a variable outcome imposes. If the cost of failure is "the user retries and it works next time," optimize for mean accuracy. If the cost is "we send a refund to the wrong customer and field a regulatory complaint," optimize for consistency, even at a 15-point accuracy penalty. The Princeton paper's quiet corollary is that "better" depends on where on the peak-vs-floor curve your users live.

This is also the trade-off that makes evals-driven development so valuable in practice. Our walkthrough of evals-driven development in practice covers how to gate every change on eval results, the discipline is what keeps your p95 from silently eroding release over release.

Dimension	Agent A	Agent B
Mean accuracy	85%	70%
Per-task consistency@10	20% - 100% (wide)	60% - 80% (tight)
P95 task success	42%	65%
Shape of failure	Confident wrong	Explicit abstain
Variance across paraphrase	High	Low

07 · What to Do Monday Morning

If you have an agent in production today, three things are worth doing this week:

Pick 30 canonical tasks. Run each 10 times. Compute consistency@10. You now have a baseline nobody in your org has ever seen. Half of those tasks will surprise you.

Plot p95 task success alongside your mean accuracy. Put them on the same dashboard as latency, cost, and error rate. If the two lines diverge, your eval suite and your production experience are drifting apart.

Tag failure mode categories in your logs. Is a failure a confident wrong answer, a silent timeout, a graceful abstention, or a tool-use accident? You can't fix what you can't count, and a single "failure" bucket tells you nothing.

None of this requires new tooling. Most teams already have the logs. They just never aggregated them with reliability as the target metric.

08 · The Real Insight

The Princeton paper's core contribution isn't that models are bad, they're clearly better every quarter. It's that we've been benchmarking the wrong number. Accuracy measures peak capability in a controlled setting. Reliability measures what your users actually experience at 3am on a Tuesday when the input is a typo-ridden paraphrase of a question the eval suite never saw.

The 7x gap on customer service benchmarks means your agent is probably far better in your eval suite than in your customer's hands, and the gap isn't closing with scale. It's widening, because accuracy is what the industry optimizes and reliability is what the industry ignores.

The teams shipping useful agents in 2026 aren't the ones with the highest eval scores. They're the ones who measured the floor, not just the peak, and built systems where the floor is high enough to trust. That starts with consistency@10 and p95 task success on your dashboard this week.

09 · FAQ

Quick answers to the questions this post tends to raise.

BLOG/AI AGENTS

Reliability Lags Accuracy 7x: Why Agents Fail in Production

Princeton's Feb 2026 study: agent accuracy climbed 7x since 2023, reliability barely moved. Here's why consistency@10 and p95 task success matter more than pass@1.

Sebastian MondragonAPRIL 20, 2026 · 9 MIN READ

01 · The Chart That Broke Conventional Agent Evaluation

02 · The Compounding Math Nobody Budgets For

Take an agent with 90% step accuracy. Sounds great. Now chain it across a 10-step customer service workflow:

CODE

0.90 ^ 10 = 0.3487

Every team building multi-step agents should have this table taped to their monitor:

Step accuracy	5 steps	10 steps	20 steps	30 steps
95%	77.4%	59.9%	35.8%	21.5%
90%	59.0%	34.9%	12.2%	4.2%
85%	44.4%	19.7%	3.9%	0.8%
80%	32.8%	10.7%	1.2%	0.1%
75%	23.7%	5.6%	0.3%	0.02%

03 · The Four Dimensions of Reliability

Consistency

Robustness

Predictability

Safety

04 · A 30-Line Wrapper for Consistency Measurement

Here's the minimum-viable pattern for pulling real reliability numbers out of any agent. Wrap the agent with a runner that repeats each task N times and computes the distribution:

PYTHON

import statistics
from collections import defaultdict

def run_reliability_suite(agent, tasks, n_runs=10):
    results = defaultdict(list)
    for task in tasks:
        for _ in range(n_runs):
            output = agent.run(task.input)
            results[task.id].append(task.grade(output))  # bool or 0..1

    summary = {}
    for task_id, runs in results.items():
        passes = sum(1 for r in runs if r)
        summary[task_id] = {
            "pass@1": bool(runs[0]),
            "pass@k": passes > 0,
            "consistency": passes / len(runs),
        }

    consistencies = [v["consistency"] for v in summary.values()]
    return {
        "mean_task_success": statistics.mean(consistencies),
        "p50_task_success": statistics.median(consistencies),
        "p95_task_success": statistics.quantiles(consistencies, n=20)[-1],
        "per_task": summary,
    }

What this gives you that pass@1 never does:

consistency@k per task: flaky tasks surface immediately and can be patched with retries, prompt tightening, or scope reduction.

p95 task success, the worst 5% of tasks determine how production feels at scale. This number is typically 20-40 points below mean accuracy.

per-task variance trends, plot it over time. If p95 drops while mean climbs, you're regressing and don't know it.

05 · The Procurement Checklist: Six Questions Before You Sign

"What's consistency@10 on your benchmark, not just pass@1?": If they've never run the same task twice, their accuracy number is an anecdote, not a measurement.

"What's the p95 task success rate across 100 production-like conversations?", Averages hide the tail. The tail is what your users see.

"Show me the distribution of outcomes when you paraphrase five canonical inputs.", Robustness failure is where polished demos fall apart.

"What are the top three failure modes in your logs from the last 30 days?", If they can't answer, they aren't observing failure. You'll be the one doing the observing, in production.

"What fraction of failures produce confident wrong answers versus explicit abstentions?", Silent failure is the dimension that hurts users most.

"What bounded-damage guarantees do you have on tool use?", DROP TABLE must be impossible, not just unlikely. "We trust the model" is the wrong answer.

A vendor that answers "85% accuracy on our benchmark" and nothing else is selling a peak and leaving you to discover the floor.

06 · When 70% Consistent Beats 85% Variable

Here's the operational decision that surfaces repeatedly in agent rollouts: given two agents with equivalent cost and latency, which do you ship?

Dimension	Agent A	Agent B
Mean accuracy	85%	70%
Per-task consistency@10	20% - 100% (wide)	60% - 80% (tight)
P95 task success	42%	65%
Shape of failure	Confident wrong	Explicit abstain
Variance across paraphrase	High	Low

07 · What to Do Monday Morning

If you have an agent in production today, three things are worth doing this week:

Pick 30 canonical tasks. Run each 10 times. Compute consistency@10. You now have a baseline nobody in your org has ever seen. Half of those tasks will surprise you.

None of this requires new tooling. Most teams already have the logs. They just never aggregated them with reliability as the target metric.

08 · The Real Insight

09 · FAQ

Quick answers to the questions this post tends to raise.