What did the DORA 2025 report find about AI's impact on software delivery?

The 2025 DORA report found AI raised raw output sharply but degraded downstream quality. Across thousands of developers, PRs per developer rose 98% while bugs per developer rose 54% (up from a 9% rise the prior year), incidents per PR rose 243%, and median PR review time rose 441%, with 31% more PRs merging with no human review at all. The headline is not that AI is bad for delivery, it is that AI accelerated code creation faster than every downstream quality gate could keep up. DORA also added a new Rework Rate metric and restructured the framework around team archetypes, replacing the old low, medium, high, and elite performer clusters that the classic four metrics produced.

Do AI coding tools increase bugs and production incidents?

Yes, on average, but the effect depends almost entirely on the delivery system around the tool. The 2025 DORA data shows a 54% rise in bugs per developer and a 243% rise in incidents per PR after AI adoption, which is a real and large regression in production stability. The Faros dataset of 67,000 developers explains the variance: teams with strong existing review and automated quality gates saw 50% fewer incidents, while teams without that rigor saw roughly 2x more customer-facing incidents. The tool does not cause incidents on its own. It multiplies whatever the delivery pipeline already does, so a pipeline that already leaks defects leaks far more under machine-paced volume.

What is acceleration whiplash in software delivery?

Acceleration whiplash, a term coined by Faros Research for the 2025 DORA findings, describes what happens when code creation speeds up but the downstream gates that catch defects do not. AI roughly doubled PRs per developer, but human review, testing, and release validation were all built for human-paced output. Force-feeding machine-paced volume into those gates pushes the bottleneck downstream: review time stretched 441%, a third more PRs slipped through unreviewed, and the debt that review used to catch surfaced as production incidents instead, up 243% per PR. Whiplash is the lag between a sudden increase in input velocity and the unchanged capacity of everything that has to absorb it.

Why did PR review time increase 441% with AI coding tools?

PR review time rose 441% in the 2025 DORA data because AI roughly doubled the number of PRs without adding any reviewer capacity, and AI-authored PRs are often larger and harder to reason about. Review is a human serial process: a fixed pool of reviewers now faces nearly twice the queue, so each PR waits far longer for attention. Larger diffs and unfamiliar machine-generated patterns also make each individual review slower. The visible symptom is the 31% rise in PRs merged with zero review, which is the queue overflowing. The fix is not more reviewers, it is right-sizing review depth to PR risk and pushing routine checks onto automated gates that scale with volume.

Is AI a net positive or negative for DORA metrics in 2025?

It is both, and the same dataset contains both outcomes, which is the central finding. AI is an amplifier of existing organizational behavior rather than a uniform improvement or regression. In the Faros data of 67,000 developers, teams that already had rigorous review and automated quality gates saw 50% fewer incidents after adopting AI, because they could safely convert higher throughput into shipped value. Teams flooding a pipeline that already leaked defects saw roughly 2x more customer-facing incidents. The deciding variable is the maturity of the delivery system, not the model or the IDE. AI is net positive for organizations with rigor and net negative for organizations without it.

What is the new DORA Rework Rate metric and why does it matter?

Rework Rate is a metric DORA added in 2025 to capture quality debt that the classic four metrics miss. The original four (deployment frequency, lead time, change failure rate, and time to restore) were designed before AI doubled throughput, and they can all look healthy while a team quietly rewrites a growing share of recently shipped code. Rework Rate measures how much new code modifies or replaces code committed in the recent past, which surfaces the churn that AI-assisted development tends to generate. It matters because deployment frequency rewards volume, and AI makes volume cheap. Without a quality-debt metric, a team can post record throughput while its real productivity stalls under rework.

How should teams change their workflow to ship AI-authored code safely?

Make three changes the 2025 DORA data points to directly. First, right-size review: route low-risk AI PRs to lightweight or automated checks and reserve deep human review for high-risk changes, so the 441% review-time blowup does not force the 31% no-review shortcut. Second, scale automated quality gates with volume: tests, static analysis, and policy checks must absorb doubled PR counts that humans cannot. Third, track incidents per user, not per PR, because PR count inflates under AI and per-PR rates exaggerate regressions while hiding the real customer impact. Underneath all three, treat AI adoption as a delivery-system redesign, since rigor is what separated the teams with 50% fewer incidents from those with 2x more.

BLOG/AI DEVELOPMENT TOOLS

DORA 2025: AI Raised Throughput 98%, Tripled Incidents

DORA 2025 found AI raised PR throughput 98% but pushed incidents per PR up 243% and review time up 441%. The acceleration whiplash data and the fix.

Sebastian MondragonJUNE 15, 2026 · 10 MIN READ

DORA 2025: AI Raised Throughput 98%, Tripled Incidents

The 2025 DORA report is the largest look yet at what AI actually does to software delivery, and the numbers are not the clean productivity win the tooling vendors promised. Surveying thousands of developers, the report (published by Google Cloud with delivery data from Faros) found that AI raised pull requests per developer by 98%. It also found that bugs per developer rose 54%, incidents per PR rose 243%, and median PR review time rose 441%. Nearly a third more PRs (31%) now merge with no human review at all. The same adoption that nearly doubled output roughly tripled how often that output breaks in production.

This is the most important DORA finding since the metrics were first published, because it breaks the assumption underneath most AI coding rollouts: that faster code creation is the goal and everything else follows. It does not follow. The 2025 data shows code velocity racing ahead while every downstream quality gate (review, testing, release validation) stayed exactly as fast as it was when humans wrote everything by hand. Faros Research named the result acceleration whiplash, and it is the defining failure mode of AI-assisted development in its current form.

This post walks through the headline numbers, explains why they coexist with genuine wins in the same dataset, and lays out what the data actually tells you to change. The short version: AI is an amplifier, not a transformer. It multiplies whatever your delivery system already does. If your pipeline already leaks defects, AI makes it leak at machine speed. If your pipeline is rigorous, AI lets you safely convert throughput into shipped value. Both outcomes are in the data, and which one you get is a choice about delivery-system design, not about which model you license.

The Headline Numbers From DORA 2025

Start with the figures, because they reframe what AI adoption is doing. Most teams measure AI's value by throughput: more PRs, more commits, more lines shipped per engineer. By that measure, AI is an unqualified success. The problem is that throughput is an input metric, and the 2025 DORA data shows what happens to the outputs.

Read top to bottom, the table tells a single story. Throughput nearly doubled. The bug rate, which had been creeping up 9% year over year, jumped to a 54% rise, a six-fold acceleration in how fast defects accumulate. Incidents per PR more than tripled, meaning each unit of shipped work is now far likelier to cause a production problem. Review time, the human gate meant to catch those defects before merge, stretched by more than five times. And when review could not keep up, the system did what overloaded systems always do: it shed the load, merging 31% more PRs with no review at all.

None of these are model-quality problems. The model is producing more code, and some of that code is worse, but the dominant effect is structural. A gate built to process human-paced output got force-fed machine-paced volume, and it failed in the predictable way: queue blowup, then bypass, then the uncaught debt resurfacing downstream as production incidents.

Metric	Change after AI adoption	What it measures
PRs per developer	+98%	Raw code-creation throughput
Bugs per developer	+54% (vs +9% prior year)	Defect generation rate
Incidents per PR	+243%	Production stability per unit shipped
Median PR review time	+441%	Downstream review-gate saturation
PRs merged with no review	+31%	Review-gate overflow / bypass rate

What Acceleration Whiplash Actually Means

Acceleration whiplash is the gap between how fast you can now create code and how fast you can still safely ship it. AI compressed the create step dramatically while leaving every verify step at its old pace. The result is not faster delivery end to end. It is a relocated bottleneck and a relocated cost.

The relocation is the key insight. Before AI, defects were caught and paid for pre-merge: a reviewer flagged a problem, the author fixed it, the cost stayed inside the development loop where it is cheap. With review saturated and a third more PRs slipping through unreviewed, that same defect debt does not disappear. It moves downstream and gets paid for in production, where it is expensive, where it is a 243% rise in incidents per PR, and where the people paying are customers instead of reviewers.

This is why the throughput number is misleading on its own. A 98% increase in PRs looks like doubled productivity only if you ignore that a growing share of those PRs is unreviewed, defect-bearing, and incident-prone. The real productivity question is not how much code you created, it is how much value you shipped net of the rework and incident response that code generated. That distinction is exactly the developer productivity paradox we have written about before: perceived speed goes up while measured delivery often does not, because the time saved writing code gets spent debugging, reviewing, and reworking it.

The 31% No-Review Signal

The single most alarming number in the report is not the incident rate. It is the 31% increase in PRs merging with zero human review. That figure is the review system failing silently, and it deserves its own attention because it is both a cause and a symptom.

It is a symptom because review time rose 441%. When a reviewer's queue nearly doubles overnight and each item takes longer (AI PRs tend to be larger and use unfamiliar patterns), the rational individual response is to wave through anything that looks plausible. Multiply that across a team and you get a third more PRs merged unseen.

It is a cause because every unreviewed PR is a defect filter removed from the pipeline. Review was the gate specifically designed to catch the kind of subtle, context-dependent mistakes that automated tests miss. Disabling it on a third more PRs, precisely when the code is increasingly machine-generated and increasingly voluminous, is how the bug rate goes from a 9% annual creep to a 54% jump. The review system was built for human-paced output and is now being force-fed machine-paced volume. It did not adapt. It overflowed.

The wrong fix is to demand that humans review everything anyway, which just trades the incident spike for a delivery freeze. The right fix is to stop treating all PRs as equal review candidates, which we get to below.

AI as Amplifier, Not Transformer

Here is the finding that should change how every engineering leader reads this report: the 2025 dataset contains both large regressions and large improvements, in the same metrics, at the same time. AI did not move everyone in one direction. It amplified the direction each organization was already heading.

The Faros dataset of 67,000 developers makes this concrete. Organizations that already had strong delivery rigor (real review discipline, automated quality gates, mature testing) saw 50% fewer incidents after adopting AI. Organizations flooding a pipeline that already leaked defects saw roughly 2x more customer-facing incidents. Same tools, opposite outcomes, and the deciding variable was the maturity of the delivery system the AI was dropped into.

This is the difference between an amplifier and a transformer. A transformer changes the signal. An amplifier makes whatever signal you feed it louder. AI is the latter. If you feed it a disciplined delivery process, it makes that discipline more productive. If you feed it a broken one, it makes the breakage more frequent and more visible. The teams getting burned are not the ones using AI wrong at the keyboard. They are the ones who never built the downstream system that AI volume requires.

That amplification also shows up in the codebase itself, not just in incidents. A separate analysis found that AI-assisted development drives code churn and cloning, with a measurable share of recently committed code being rewritten or duplicated rather than reused. That churn is the same whiplash viewed from inside the repository: volume up, durable value not keeping pace.

Organization profile	Incident outcome after AI	Why
Strong review + automated gates	50% fewer incidents	Higher throughput safely converted to shipped value
Weak review + few automated gates	~2x more incidents	Higher throughput floods an already-leaky pipeline

Why the Classic Four DORA Metrics Miss This

The original DORA metrics (deployment frequency, lead time for changes, change failure rate, and time to restore service) were defined for a world where humans wrote the code. AI broke an assumption baked into them: that throughput is scarce and therefore a reasonable proxy for productivity. AI makes throughput cheap, and a cheap input is a bad proxy.

Deployment frequency is the clearest casualty. It rewards shipping often, and AI makes shipping often trivial, so a team can post record deployment frequency while quietly rewriting a growing share of what it just shipped. The metric looks elite. The real productivity is flat or negative once you net out the rework.

This is why DORA added a fifth metric in 2025: Rework Rate, which measures how much new code modifies or replaces code committed in the recent past. Rework Rate is designed to catch exactly the quality debt that AI generates and that the classic four metrics paper over. A high deployment frequency paired with a high rework rate is not a high-performing team. It is a team spinning, shipping volume that it has to unship and reship.

DORA also restructured the whole framework in 2025, replacing the familiar low, medium, high, and elite performer clusters with team archetypes. The cluster model implied a single ladder of performance that everyone climbs the same way. The 2025 data shattered that, because the same AI adoption sent some teams up and others down. Archetypes acknowledge that different organizational profiles get different outcomes from the same tools, which is the amplifier finding encoded into the framework itself.

What the Data Says To Do

The report is unusually prescriptive once you read past the headlines, because the failure modes are structural and structural problems have structural fixes. Three changes follow directly from the numbers.

Right-Size Review for AI-Authored PRs

The 441% review-time blowup and the 31% no-review jump are the same problem: a uniform review process applied to a doubled, increasingly machine-generated PR stream. The fix is to stop reviewing all PRs the same way. Route low-risk changes (dependency bumps, generated boilerplate, well-tested refactors) to automated checks and lightweight approval. An AI code reviewer is the natural automated check for that first pass, and which AI code review tool actually fits comes down to the recall-versus-noise trade-off, because a high-noise bot just rebuilds the review load you were trying to shed. Reserve deep human review for high-risk changes: anything touching auth, payments, data migrations, or core business logic. The goal is to spend your scarce, expensive reviewer attention where it actually prevents incidents, instead of spreading it thin enough that a third of PRs get none at all.

Scale Automated Quality Gates With Volume

Humans do not scale to a 98% throughput increase. Automated gates do. Static analysis, comprehensive test suites, policy-as-code checks, and security scanning all absorb doubled PR counts without doubling headcount, and they catch the routine defects that were drowning your reviewers. The mistake is treating automated gates as a nice-to-have you bolt on later. In an AI-volume world they are the primary defect filter, and human review becomes the targeted second layer. The teams in the Faros data that saw 50% fewer incidents are the ones that already had this layer built. This is also where strong agent scaffolding beats raw model upgrades: the system around the model, including the verification gates, determines real-world reliability far more than the model version does.

Track Incidents Per User, Not Per PR

Per-PR metrics are now misleading because PR count is inflated by AI. A 243% rise in incidents per PR sounds catastrophic, and it is serious, but PR-denominated rates both exaggerate regressions and obscure the only thing that matters: customer impact. Track incidents and severity per active user or per unit of business value delivered. That denominator does not inflate with AI volume, so it tells you whether your customers are actually experiencing more failures, which is the question executives and customers actually care about. Pairing that with continuous production monitoring for quality drift closes the loop, because the incidents this data is warning about surface in production behavior, not in pre-merge metrics.

The Practitioner Verdict

The honest read of the 2025 DORA report is uncomfortable for most engineering organizations: adopting AI coding tools without redesigning your delivery system actively worsens production stability, and that is the majority case, not the edge case. A 54% bug increase, a 243% rise in incidents per PR, and a third more code merging unreviewed are not the price of progress. They are the cost of dropping machine-paced output into human-paced gates and hoping it works out.

The good news is symmetrical. The same data shows that teams with rigor turned AI into a 50% reduction in incidents, because for them higher throughput was a benefit they could safely absorb. The deciding factor was never the model. It was whether the delivery system was built to handle the volume the model produces. AI is an amplifier. It will make your delivery process louder in whichever direction it already points.

This is the work Particula Tech runs as a delivery-system audit: we map where AI-driven volume is overrunning your review and quality gates, identify the specific gates that broke under the new throughput, and redesign them (right-sized review, automated gates scaled to volume, and incident metrics that survive PR inflation) so that the throughput gain becomes a shipped-value gain instead of an incident spike. For the broader strategy on tooling, CI, and the systems that decide whether AI development is net positive, our AI development tools pillar is the place to start.

The teams that win the next two years will not be the ones that adopted AI fastest. They will be the ones that rebuilt their delivery system to deserve the speed.

FAQ

Quick answers to the questions this post tends to raise.

BLOG/AI DEVELOPMENT TOOLS

DORA 2025: AI Raised Throughput 98%, Tripled Incidents

DORA 2025 found AI raised PR throughput 98% but pushed incidents per PR up 243% and review time up 441%. The acceleration whiplash data and the fix.

Sebastian MondragonJUNE 15, 2026 · 10 MIN READ

The Headline Numbers From DORA 2025

Metric	Change after AI adoption	What it measures
PRs per developer	+98%	Raw code-creation throughput
Bugs per developer	+54% (vs +9% prior year)	Defect generation rate
Incidents per PR	+243%	Production stability per unit shipped
Median PR review time	+441%	Downstream review-gate saturation
PRs merged with no review	+31%	Review-gate overflow / bypass rate

What Acceleration Whiplash Actually Means

The 31% No-Review Signal

AI as Amplifier, Not Transformer

Organization profile	Incident outcome after AI	Why
Strong review + automated gates	50% fewer incidents	Higher throughput safely converted to shipped value
Weak review + few automated gates	~2x more incidents	Higher throughput floods an already-leaky pipeline

Why the Classic Four DORA Metrics Miss This

What the Data Says To Do

Right-Size Review for AI-Authored PRs

Scale Automated Quality Gates With Volume

Track Incidents Per User, Not Per PR

The Practitioner Verdict

The teams that win the next two years will not be the ones that adopted AI fastest. They will be the ones that rebuilt their delivery system to deserve the speed.

FAQ

Quick answers to the questions this post tends to raise.