Everyone knows AI speeds up coding. Developers feel faster. Companies buy licenses. Vendors show impressive demos. Experts predict massive productivity gains.
Researchers at METR tested this with real developers on real code. Sixteen experienced contributors completed 246 tasks on mature open-source projects—average 23,000 GitHub stars, 1.1 million lines of code, rigorous review standards. Half the tasks allowed AI tools, half didn't.
Tasks with AI took 19% longer.
Developers predicted AI would make them 24% faster. ML experts predicted 38% speedup. Economics experts predicted 39%. Everyone was wrong.
Why This Matters
We're terrible at knowing what makes us productive.
Developers used these tools for 30-50 hours and still believed they were 20% faster when they were 19% slower. Their intuition was backward. Expert predictions missed by roughly 60 percentage points.
This gap should concern you if you're deploying AI tools. Decisions based on feelings rather than measurement waste money and time.
What Happened
The study measured time to complete well-defined tasks with requirements fixed before randomization. You can't game this by expanding scope or writing verbose code.
Developers used Cursor Pro and Claude 3.5/3.7 Sonnet—frontier tools from early 2025. Researchers labeled 143 hours of screen recordings to see where time went.
When AI was allowed, developers spent less time writing code (37% → 27%) and reading (24% → 16%). They spent more time prompting AI (8%), waiting for generations (4%), reviewing output (9%), and idle (11%).
They accepted less than 44% of what AI generated. Of accepted code, 56% needed major cleanup.
The workflow: stop working, write a prompt, wait, review carefully, reject it half the time, clean up what you accept, resume work. Managing AI consumed more time than writing code directly.
Why AI Slowed Developers Down
Experience and complexity worked against AI. These developers averaged 1,500 commits to their repositories. They knew the patterns, edge cases, and gotchas. AI couldn't match that tacit knowledge.
The repositories averaged 10 years old with 1.1 million lines of code. One developer reported AI "failed to properly apply edits and started editing random other parts of the file." Another: "AI introduced as many errors as it fixed."
Software at this scale contains knowledge never written down: why patterns exist, what backwards compatibility matters, which edge cases are important. One developer: "We know the data that will interact with the code, but the model doesn't."
AI acts like a new contributor. On mature projects with rigorous standards, new contributors slow things down. AI never stops being new.
Low reliability created waste. Rejecting half of AI suggestions wastes time crafting prompts, reviewing wrong output, testing before rejection, and context-switching between working and managing AI.
One developer: "I wasted at least an hour trying to solve the issue with AI" before implementing it directly.
Developers misjudged AI's impact. Even after extensive use, they believed AI made them faster when it made them slower. This led them to keep using AI on tasks where working directly would have been faster.
When AI Actually Helps
AI helps most with unfamiliar tasks and new domains. One developer noted Cursor was "super helpful figuring out how to write a frontend test. I didn't know how to do this before." Another working with Git hooks for the first time: "Without AI the implementation would've taken me 3 additional hours."
The pattern: AI helps when you're learning. It hurts when you're expert.
Sixty-nine percent of developers continued using Cursor after the study ended. They're getting value—some report it feels less effortful, others find it useful for specific subtasks. The tools aren't useless. They're just not doing what everyone thinks they're doing.
What This Means for You
If you're deploying AI coding tools: Stop relying on developer sentiment. Measure task completion time on comparable work. Track separately for junior versus senior developers, familiar versus unfamiliar codebases, greenfield versus legacy systems.
Expect the benefit to vary wildly. Results showing 50% speedup on toy problems don't predict performance on complex production systems.
If you're an experienced developer: Your intuition about AI productivity can be completely wrong even after extensive use. Test deliberately: randomize half your tasks to allow AI, track completion time, compare.
Consider that AI might help most on tasks you do least. If you're expert at what you're building, AI might slow you down.
If you're forecasting AI impact: Expert predictions were as wrong as developer intuitions. Benchmark results don't predict real-world impact. The impressive capabilities on coding challenges don't transfer to messy production codebases with accumulated context.
Plan for a large gap between lab performance and field performance.
The Broader Pattern
This reveals something bigger than AI productivity. It shows how consistently we misjudge our tools.
We've seen this before: Agile sprints that felt faster but didn't ship more features. Standing desks with mixed evidence. Open offices that felt collaborative but destroyed deep work.
Things that feel productive often aren't. The more emotionally invested we are in a tool, the worse our judgment becomes.
AI coding tools feel productive because code appears on screen, the IDE does something, it seems like progress. But that feeling doesn't correlate with speed.
The only way to know is measurement. Rigorous, controlled measurement with fixed outcomes. Not surveys. Not vibes. Not vendor benchmarks.
It's annoying. It takes time.
But given how expensive engineering time is, and how wrong our intuitions can be—predicting 40% speedup when reality is 19% slowdown—measurement might be the most valuable work you can do.
Start With the Outcome
The principle: start with the outcome you want, not the one AI can most easily deliver.
If you want to complete features faster, measure completion time. Not lines of code. Not developer satisfaction. Not whether AI generated something impressive.
Time to complete well-defined, quality-approved work.
Then test whether AI helps or hurts in your context. With your developers. On your codebase. For your tasks.
Don't assume results transfer. Don't trust your intuition. Don't believe vendor promises.
Measure.
The gap between what you think is happening and what's happening costs 19% of your engineering capacity.
Source: Becker, J., Rush, N., Barnes, B., & Rein, D. (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. arXiv:2507.09089v2
