AI generates roughly 41% of new code, and the maintenance bill is now measurable. GitClear's 211M-line study (2020-2024) found code churn rose from 3.1% to 5.7% (+39% in AI-heavy projects), copy-pasted lines climbed from 8.3% to 12.3% and for the first time exceeded moved/refactored lines, and the refactored share collapsed from 25% to under 10% (a 60% decline). AI-authored PRs carry 1.7x more issues (10.83 vs 6.45 per PR), tech debt rises 30-41% post-adoption, and unmanaged AI code drives maintenance cost to roughly 4x by year two. Gate on duplication and churn, not just test pass.
GitClear analyzed 211 million changed lines of code across repositories from Google, Microsoft, Meta, and others, spanning 2020 to 2024. Over that window, AI coding assistants went from novelty to default, and roughly 41% of new code is now AI-generated. The study tracked what happened to the codebases underneath. Code churn, the share of lines reverted or rewritten within two weeks of being committed, climbed from 3.1% to 5.7%. In AI-heavy projects it rose 39%. For the first time in the dataset's history, copy-pasted lines outnumbered moved or refactored lines.
That last sentence is the one to sit with. Refactoring, the act of reshaping existing code, used to outpace cloning. Now cloning wins. A codebase that copies more than it consolidates is a codebase quietly accreting duplication, and duplication is the raw material of tech debt. The velocity is real. So is the bill that arrives in year two.
This is the data analysis underneath every "AI made us 2x faster" claim. The speed shows up immediately and the maintenance cost shows up later, which is exactly why most teams justify adoption on velocity and only discover the churn and duplication metrics after the codebase has already absorbed them. Below is what the numbers actually say, why AI assistants produce this specific failure pattern, and the gates that let you keep the speed without the debt.
The Velocity-vs-Maintenance Gap
Every AI coding rollout I have seen is sold on the same metric: lines shipped, tickets closed, time-to-first-PR. Those are first-order numbers, and they genuinely improve. The problem is that code quality is a second-order metric. It does not show up in this sprint's velocity chart; it shows up six months later as the rate at which changes get harder.
The METR randomized controlled trial put a number on the gap. Experienced open-source developers working on their own mature repositories were 19% slower when using AI tools, even though they expected a speedup and reported feeling faster. We unpacked that finding in detail in why AI coding tools make developers 19% slower. GitClear's churn data is the codebase-level mirror of the same effect: a 39% rise in code that gets reverted or rewritten within two weeks is, definitionally, velocity you paid for and then discarded.
The two studies measure different things, one measures developer time and one measures code lifespan, but they point at the same gap. The speed is front-loaded. The cost is back-loaded. And the metrics most teams watch are all front-loaded, which is how a velocity win and a maintainability loss can coexist on the same dashboard without anyone noticing until the codebase tells them.
GitClear's 211M-Line Study: Churn Rose 39% in AI-Heavy Projects
GitClear's central metric is code churn: lines that are reverted or substantially rewritten within two weeks of being committed. The logic is that healthy code, code that was thought through, tends to stick. Code that gets ripped out almost immediately is a proxy for output that should not have shipped in that form. The two-week window is deliberately tight to catch genuine rework rather than normal long-term evolution.
Here is the trend across the study window:
The 3.1% to 5.7% climb is the industry average. Inside AI-heavy projects specifically, churn rose 39% relative to baseline. That is not a rounding error. It means roughly one in eighteen committed lines is being undone within a fortnight, up from one in thirty-two at the start of the window.
A fair objection: this is correlational. GitClear did not run a controlled experiment that withheld AI tools from one group, so the study shows churn rising alongside AI adoption, not AI adoption causing churn. Other things changed over 2020-2024 too. But the size of the effect, its concentration in AI-heavy projects, and the mechanism, which I will get to, make the alternative explanations strain. The honest framing is that this is a strong pattern, not a proof, and it lines up with everything else in the data.
| Year | Code churn (reverted/rewritten <2 weeks) | Context |
|---|---|---|
| 2020 | 3.1% | Pre-assistant baseline |
| 2021 | ~3.5% | Early Copilot |
| 2022 | ~4.1% | Assistant adoption accelerating |
| 2023 | ~4.9% | ChatGPT-era coding mainstream |
| 2024 | 5.7% | ~41% of new code AI-generated |
Cloning Now Beats Refactoring for the First Time
The churn number is the headline, but the duplication numbers are the more structural finding. GitClear tracks copy-pasted lines against moved/refactored lines, the latter being the signature of someone reshaping code rather than bolting more on.
Two things happened at once. Copy-pasted lines rose from 8.3% in 2021 to 12.3% in 2024, and crucially crossed above moved/refactored lines, the first time in the dataset that cloning outpaced reshaping. Meanwhile duplicate code blocks grew roughly tenfold between 2022 and 2024.
Duplication is not a cosmetic problem. Every cloned block is a change you have committed to making in N places instead of one. The first time a bug lives in five copies of the same function, the cost of the original shortcut comes due, with interest. A codebase where cloning beats refactoring is a codebase whose change cost is compounding silently, and the compounding does not appear on any sprint board.
| Metric | 2021 | 2024 | Change |
|---|---|---|---|
| Copy-pasted lines (share) | 8.3% | 12.3% | +48% |
| Refactored share of changes | 25% | <10% | ~60% decline |
| Duplicate code blocks | 1x | ~10x | 2022-2024 growth |
The Refactoring Collapse
The refactored share of all changes falling from 25% to under 10% is, to me, the most worrying single number in the study. Refactoring is how a codebase stays shapeable. It is the act of paying down structure so the next feature is cheap. When that share collapses by 60%, the codebase stops getting reshaped and only grows.
The mechanism is straightforward once you think about how the model and the human interact. Refactoring requires understanding the existing structure well enough to improve it. An AI assistant prompted with "add X" has no incentive to reorganize what is already there, and the developer accepting the suggestion often has not read the surrounding code closely enough to refactor it either. The path of least resistance is to add, not to reshape. Multiply that across thousands of commits and the refactored share collapses exactly as the data shows.
This is the same failure pattern, at the codebase scale, that shows up at the agent scale on harder benchmarks. When a coding agent edits one file to make a test pass and never touches the consumers it just broke, it is optimizing for a green check, not a coherent change. We documented that dynamic in agent scaffolding beats model upgrades on SWE-Bench: the model is capable of the better change, but the default loop does not ask for it. Code churn and the refactoring collapse are the same incentive problem expressed in commit history rather than benchmark scores.
The Quality Signal: 1.7x More Issues Per PR
The churn and duplication metrics describe what is happening to the codebase structurally. The pull-request data describes what reviewers actually catch. Analysis of AI-authored versus human-authored PRs found AI-authored pull requests carry 1.7x more issues: 10.83 issues per PR versus 6.45.
The compounding estimates follow from there. Industry analysis puts the post-adoption increase in tech debt at 30-41%, and projects that unmanaged AI code drives maintenance cost to roughly 4x by the second year. Those second-year numbers are projections, not measured outcomes from a five-year study, so treat them as directional rather than precise. But the direction is consistent across every dataset: more issues per PR, more duplication, more churn, less refactoring, and a maintenance curve that bends the wrong way over time.
The phrase that explains all of it is "it compiles." AI-generated code clears the bar that gets attention, a passing test or a clean run, while quietly failing the bars that do not, reuse, structure, and reviewability. Code that works on the first try is exactly the code that ships without scrutiny, which is why the issue count per PR is higher even though the code "works."
| PR source | Issues per PR | Relative |
|---|---|---|
| Human-authored | 6.45 | 1.0x |
| AI-authored | 10.83 | 1.7x |
The Year-Two Cost and Why It Hides
The reason this debt is dangerous is that it is invisible during the period when you are deciding whether AI coding is working. Year one is all upside on the metrics most teams watch. The lines ship, the tickets close, the velocity chart goes up and to the right. The churn, the duplication, and the suppressed refactoring are accumulating, but they do not yet hurt because the codebase is still small enough relative to the debt to absorb it.
Year two is when the maintenance multiplier, estimated at roughly 4x for unmanaged AI code, starts to bite. Now every change touches a duplicated block that has drifted out of sync. Now the refactoring that was deferred is a major undertaking instead of a routine cleanup. Now the 1.7x issue rate compounds across a larger surface. The velocity that justified adoption is eroding, and the cause is a year of decisions that looked free at the time.
This is the same shape as the cost runaways we see elsewhere in AI systems, where the meter spins quietly until the invoice lands. The discipline that prevents it is identical: instrument the leading indicator before it becomes a lagging cost. For teams already worried their existing systems have accumulated this kind of hidden debt, the diagnostic approach in how to audit your AI for bugs, bias, and performance issues applies directly to AI-generated code: you measure the debt before you can manage it. At Particula Tech, the code-quality gates we build for clients start exactly here, with churn and duplication baselines, because you cannot govern a number you are not watching.
Policy Implication: Gate on Duplication and Churn, Not Just Test Pass
The fix is not to ban the tools. The productivity gains on boilerplate, scaffolding, tests, and exploration are real, and prohibition cedes them to competitors who keep the tools and add the guardrails. The fix is to change what you gate on. A passing test suite is necessary but not sufficient, because every metric in this study sits underneath a green build.
Here is the practical gate set, in priority order:
The deeper point is that these are process gates, not tool restrictions. The same teams that win with AI coding are the ones who picked their harness and their review discipline as deliberately as their model. That framing carries straight into tool selection itself: in Cursor vs Claude Code for 2026, the differentiator is rarely the underlying model and almost always the workflow and guardrails wrapped around it. The same is true here. The tools are not the variable that determines whether you accumulate debt. Your gates are.
For the broader picture of where AI coding tools fit in a production engineering stack, our AI development tools pillar covers the surrounding decisions: which agent, which sandbox, which observability layer, and how the pieces compose into a workflow that ships fast without shipping debt.
What the Data Tells Engineering Leaders
Three takeaways from the numbers.
The speed is real and so is the debt; they are not in tension if you measure both. AI generates roughly 41% of new code and the velocity gain on isolated tasks is genuine. The churn (+39% in AI-heavy projects), the duplication (cloning now exceeds refactoring), and the issue rate (1.7x per PR) are equally genuine. A team that only watches velocity will conclude AI coding is an unambiguous win and discover the year-two maintenance curve too late.
Duplication is the metric to gate on first. Of everything in the GitClear study, duplication is the most mechanically detectable and the most directly tied to future cost. Cloned lines now exceed refactored lines for the first time, and duplicate blocks grew tenfold in two years. A hard duplication threshold in CI catches the failure mode at the moment it is cheapest to fix.
Govern, do not prohibit. The teams that lose treat AI output as free because it compiles. The teams that win treat it as draft code held to the same maintainability bar as anything else: review on AI-heavy PRs, an enforced clone threshold, and protected refactoring time so the refactored-to-added ratio stays off the floor. The data does not say stop using AI to write code. It says stop pretending the output is finished the moment the test goes green.
The 41% number is going up, not down. The question for 2026 is not whether AI writes your code; it already writes a plurality of it. The question is whether you are measuring what it costs you, or only what it saves you.
Frequently Asked Questions
Quick answers to common questions about this topic
On maintainability signals, the data says yes, on average and at current tooling defaults. GitClear's 211M-line study found code churn (lines reverted or rewritten within two weeks) rose 39% in AI-heavy projects, copy-pasted lines now exceed refactored lines for the first time, and the refactored share fell 60%. Separate analysis shows AI-authored pull requests carry 1.7x more issues (10.83 vs 6.45 per PR). The code often works on first run, which is exactly why it ships, but it duplicates instead of reusing, and it adds rather than restructures. The problem is not that models can't write good code; it is that the default loop optimizes for a passing test, not for a clean codebase.



