More

narush · 2025-07-10T18:10:46 1752171046

Qualitatively, we don't see a drop in PR quality in between AI-allowed and AI-disallowed conditions in the study; the devs who participate are generally excellent, know their repositories standards super well, and aren't really into the 'get up a bad PR' vibe -- the median review time on the PRs in the study is about a minute.

Developers totally spend time totally differently, though, this is a great callout! On page 10 of the paper [1], you can see a breakdown of how developers spend time when they have AI vs. not - in general, when these devs have AI, they spend a smaller % of time writing code, and a larger % of time working with AI (which... makes sense).

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

narush · 2025-07-10T17:59:06 1752170346

Noting that most of our power comes from the number of tasks that developers complete; it's 246 total completed issues in the course of this study -- developers do about 15 issues (7.5 with AI and 7.5 without AI) on average.

biophysboy · 2025-07-10T19:33:36 1752176016

Did you compare the variance within individuals (due to treatment) to the variance between individuals (due to other stuff)?

narush · 2025-07-10T17:52:56 1752169976

Hey Simon -- thanks for the detailed read of the paper - I'm a big fan of your OS projects!

Noting a few important points here:

1. Some prior studies that find speedup do so with developers that have similar (or less!) experience with the tools they use. In other words, the "steep learning curve" theory doesn't differentially explain our results vs. other results.

2. Prior to the study, 90+% of developers had reasonable experience prompting LLMs. Before we found slowdown, this was the only concern that most external reviewers had about experience was about prompting -- as prompting was considered the primary skill. In general, the standard wisdom was/is Cursor is very easy to pick up if you're used to VSCode, which most developers used prior to the study.

3. Imagine all these developers had a TON of AI experience. One thing this might do is make them worse programmers when not using AI (relatable, at least for me), which in turn would raise the speedup we find (but not because AI was better, but just because with AI is much worse). In other words, we're sorta in between a rock and a hard place here -- it's just plain hard to figure out what the right baseline should be!

4. We shared information on developer prior experience with expert forecasters. Even with this information, forecasters were still dramatically over-optimistic about speedup.

5. As you say, it's totally possible that there is a long-tail of skills to using these tools -- things you only pick up and realize after hundreds of hours of usage. Our study doesn't really speak to this. I'd be excited for future literature to explore this more.

In general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the factors table on page 11).

I'll also note that one really important takeaway -- that developer self-reports after using AI are overoptimistic to the point of being on the wrong side of speedup/slowdown -- isn't a function of which tool they use. The need for robust, on-the-ground measurements to accurately judge productivity gains is a key takeaway here for me!

(You can see a lot more detail in section C.2.7 of the paper ("Below-average use of AI tools") -- where we explore the points here in more detail.)

amirhirsch · 2025-07-10T19:22:47 1752175367

Figure 6 which breaks-down the time spent doing different tasks is very informative -- it suggest: 15% less active coding 5% less testing, 8% less research and reading

4% more idle time 20% more AI interaction time

The 28% less coding/testing/research is why developers reported 20% less work. You might be spending 20% more time overall "working" while you are really idle 5% more time and feel like you've worked less because you were drinking coffee and eating a sandwich between waiting for the AI and reading AI output.

I think the AI skill-boost comes from having work flows that let you shave half that git-ops time, cut an extra 5% off coding, but cut the idle/waiting and do more prompting of parallel agents and a bit more testing then you really are a 2x dev.

viraptor · 2025-07-11T00:18:42 1752193122

> You might be spending 20% more time overall "working" while you are really idle 5% more time and feel like you've worked less because you were drinking coffee and eating a sandwich between waiting for the AI and reading AI output.

This is going to be interesting long-term. Realistically people don't spend anywhere close to 100% of time working and they take breaks after intense periods of work. So the real benefit calculation needs to include: outcome itself, time spent interacting with the app, overlap of tasks while agents are running, time spent doing work over a long period of time, any skill degradation, LLM skills, etc. It's going to take a long time before we have real answers to most of those, much less their interactions.

amirhirsch · 2025-07-10T19:33:15 1752175995

i just realized the figure is showing the time breakdown as a percentage of total time, it would be more useful to show absolute time (hours) for those side-by-side comparisons since the implied hours would boost the AI bars height by 18%

narush · 2025-07-10T20:04:10 1752177850

There's additional breakdown per-minute in the appendix -- see appendix E.4!

simonw · 2025-07-10T17:55:54 1752170154

Thanks for the detailed reply! I need to spend a bunch more time with this I think - above was initial hunches from skimming the paper.

narush · 2025-07-10T18:00:46 1752170446

Sounds great. Looking forward to hearing more detailed thoughts -- my emails in the paper :)

jdp23 · 2025-07-10T18:17:32 1752171452

Really interesting paper, and thanks for the followon points.

The over-optimism is indeed a really important takeaway, and agreed that it's not tool-dependent.

polyglotfacto2 · 2025-07-13T21:00:28 1752440428

> Some prior studies that find speedup do so with developers that have similar (or less!) experience with the tools they use. In other words, the "steep learning curve" theory doesn't differentially explain our results vs. other results.

I think one would have to compare the difficulty level of tasks.

I speculate that on easy tasks, LLM's can do a great job based on their training data alone, so you'd experience a speedup regardless of your prompt engineering skill level. But on large codebases and for complex tasks, an LLM cannot stand on it's own legs, and the differentiator becomes the quality of the prompt.

I think you'd need not only expert programmers, but expert programmers who have become expert prompt engineers(you would need some kind of extensive system prompt describing how the large codebase works), and those don't really exist yet, I think.

paulmist · 2025-07-10T18:08:12 1752170892

Were participants given time to customize their Cursor settings? In my experience tool/convention mismatch kills Cursor's productivity - once it gets going with a wrong library or doesn't use project's functions I will almost always reject code and re-prompt. But, especially for large projects, having a well-crafted repo prompt mitigates most of these issues.

jspdown · 2025-07-11T10:38:35 1752230315

With today's state of LLMs and Agents, it's still not good for all the tasks. It took me couple of weeks before being able to correctly adjust on what I can ask and what I can expect. As a result, I don't use Claude Code for everything and I think I'm able to better pick the right task and the right size of task to give it. These adjustment depends on what you are doing, the complexity of and the maturity of the project at play.

Very often, I have entire tasks that I can't offload to the Agent. I won't say I'm 20x more productive, it's probably more in the range of 15% to 20% (but I can't measure that obviously).

Guillaume86 · 2025-07-11T09:54:14 1752227654

Using devs working in their own repository is certainly understandable, but it might also explain in part the results. Personally I barely use AI for my own code, while on the other hand when working on some one off script or unfamiliar code base, I get a lot more value from it.

bilbo-b-baggins · 2025-07-11T10:27:18 1752229638

Your next study should be very experienced devs working in new or early life repos where AI shines for refactoring and structured code suggestion, not to mention documentation and tests.

It’s much more useful getting something off the ground than maintaining a huge codebase.

gojomo · 2025-07-10T19:07:12 1752174432

Did each developer do a large enough mix of AI/non-AI tasks, in varying orders, that you have any hints in your data whether the "AI penalty" grew or shrunk over time?

narush · 2025-07-10T19:12:02 1752174722

You can see this analysis in the factor analysis of "Below-average use of AI tools" (C.2.7) in the paper [1], which we mark as an unclear effect.

TLDR: over the first 8 issues, developers do not appear to get majorly less slowed down.

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

gojomo · 2025-07-10T20:39:49 1752179989

Thanks, that's great!

But: if all developers did 136 AI-assisted issues, why only analyze excluding the 1st 8, rather than, say, the first 68 (half)?

narush · 2025-07-10T21:30:56 1752183056

Sorry, this is the first 8 issues per-developer!

narush · 2025-07-10T17:36:33 1752168993

Our largest funding was through The Audacious Project -- you can see an announcement here: https://metr.org/blog/2024-10-09-new-support-through-the-aud...

Per our website, “To date, April 2025, we have not accepted compensation from AI companies for the evaluations we have conducted.” You can check out the footnote on this page: https://metr.org/donate

iLoveOncall · 2025-07-10T17:45:31 1752169531

This is really disingenuous when you also say that OpenAI and Anthropic have provided you with access and compute credits (on https://metr.org/about).

Not all payment is cash. Compute credits is still by all means compensation.

golly_ned · 2025-07-10T19:30:19 1752175819

Those are compute credits that are directly spent on the experiment itself. It's no more "compensation" than a chemistry researcher being "compensated" with test tubes.

iLoveOncall · 2025-07-10T20:02:53 1752177773

> Those are compute credits that are directly spent on the experiment itself.

You're extrapolating, it's not saying this anywhere.

> It's no more "compensation" than a chemistry researcher being "compensated" with test tubes.

Yes, that's compensation too. Thanks for contributing another example. Here's another one: it's no more compensation than a software engineer being compensated with a new computer.

Actually the situation here is way worse than your example. Unless the chemistry researcher is commissioned by Big Test Tube Corp. to conduct research on the outcome of using their test tubes, there's no conflict of interest here. But there is an obvious conflict of interest on AI research being financed by credits given by AI companies to use their own AI tools.

rsynnott · 2025-07-11T10:40:28 1752230428

While it would be an ethical concern if they _hadn't_ disclosed it, it's not compensation; it was used _as part of the study_.

gtsop · 2025-07-10T17:52:40 1752169960

Are you willing to be compensated with compute credits for your job?

Such companies spit out "credits" all over the place in order to gain traction and enstablish themselves. I remember when cloud providers gave vps credits to startups like they were peanuts. To me, it really means absolutelly nothing.

bawolff · 2025-07-10T18:16:38 1752171398

I wouldn't do my job for $10, but if somehow someone did pay me $10 to do something, i wouldn't claim i wasn't compensated.

In-kind compensation is still compensation.

iLoveOncall · 2025-07-10T18:26:00 1752171960

> Are you willing to be compensated with compute credits for your job?

Well, yes? I use compute for some personal projects so I would be absolutely fine if a part of my compensation was in compute credits.

As a company, even more so.

dolebirchwood · 2025-07-10T18:57:55 1752173875

Is it "really" disingenuous, or is it just a misinterpretation of what it means to be "compensated for"? Seems more like quibbling to me.

iLoveOncall · 2025-07-10T19:59:57 1752177597

I was actually being kind by saying it's disingenuous. I think it's an outright lie.

narush · 2025-07-10T17:28:37 1752168517

Hey HN, study author here. I'm a long-time HN user -- and I'll be in the comments today to answer questions/comments when possible!

If you're short on time, I'd recommend just reading the linked blogpost or the announcement thread here [1], rather than the full paper.

[1] https://x.com/METR_Evals/status/1943360399220388093

causal · 2025-07-10T18:09:46 1752170986

Hey I just wanted to say this is one of the better studies I've seen - not clickbaity, very forthright about what is being claimed, and presented in such an easy-to-digest format. Thanks so much for doing this.

narush · 2025-07-10T18:57:33 1752173853

Thanks for the kind words!

isoprophlex · 2025-07-10T19:45:56 1752176756

I'll just say that the methodology of the paper and the professionalism with which you are answering us here is top notch. Great work.

narush · 2025-07-10T20:14:18 1752178458

Thank you!

JackC · 2025-07-10T20:10:51 1752178251

(I read the post but not paper.)

Did you measure subjective fatigue as one way to explain the misperception that AI was faster? As a developer-turned-manager I like AI because it's easier when my brain is tired.

narush · 2025-07-10T20:14:08 1752178448

We attempted to! We explore this more in the section Trading speed for ease (C.2.5) in the paper (https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf).

TLDR: mixed evidence that developers make it less effortful, from quantitative and qualitative reports. Unclear effect.

jsnider3 · 2025-07-10T18:01:24 1752170484

It's good to know that Claude 3.7 isn't enough to build Skynet!

antonvs · 2025-07-10T18:46:16 1752173176

Was any attention paid to whether the tickets being implemented with AI assistance were an appropriate use case for AI?

If the instruction is just "implement this ticket with AI", then that's very realistic in that it's how management often tries to operate, but it's also likely to be quite suboptimal. There are ways to use AI that help a lot, and other ways that hurt more than it helps.

If your developers had sufficient experience with AI to tell the difference, then they might have compensated for that, but reading the paper I didn't see any indication of that.

narush · 2025-07-10T18:57:08 1752173828

The instructions given to developers was not just "implement with AI" - but rather that they could use AI if they deemed it would be helpful, but indeed did _not need to use AI if they didn't think it would be helpful_. In about ~16% of labeled screen recordings where developers were allowed to use AI, they choose to use no AI at all!

That being said, we can't rule out that the experiment drove them to use more AI than they would have outside of the experiment (in a way that made them less productive). You can see more in section "Experimentally driven overuse of AI (C.2.1)" [1]

[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf

igorkraw · 2025-07-10T18:13:35 1752171215

Could you either release the dataset (raw but anonymized) for independent statistical évaluation or at least add the absolute times of each dev per task to the paper? I'm curious what the absolute times of each dev with/without AI was and whether the one guy with lots of Cursor experience was actually faster than the rest of just a slow typer getting a big boost out of llms

Also, cool work, very happy to see actually good evaluations instead of just vibes or observational stuies that don't account for the Hawthorne effect

narush · 2025-07-10T18:33:26 1752172406

Yep, sorry, meant to post this somewhere but forgot in final-paper-polishing-sprint yesterday!

We'll be releasing anonymized data and some basic analysis code to replicate core results within the next few weeks (probably next, depending).

Our GitHub is here (http://github.com/METR/) -- or you can follow us (https://x.com/metr_evals) and we'll probably tweet about it.

igorkraw · 2025-07-10T20:20:06 1752178806

Cool, thanks a lot. Btw, I have a very tiny tiny (50 to 100 audience ) podcast where we try to give context to what we call the "muck" of AI discourse (trying to ground claims into both what we would call objectively observable facts/évidence, and then _separately_ giving out own biased takes), if you would be interested to come on it and chat => contact email in my profile.

ryanar · 2025-07-10T22:46:30 1752187590

podcast link?

yawnxyz · 2025-07-11T03:39:54 1752205194

Does this reproduce for early/mid-career engineers who aren't at the top of their game?

narush · 2025-07-11T05:00:27 1752210027

How these results transfer to other settings is an excellent question. Previous literature would suggest speedup -- but I'd be excited to run a very similar methodology in those settings. It's already challenging as models + tools have changed!

narush · on Nov 7, 2024

I’ve replicated the OthelloGPT results mentioned in this paper personally - and it def felt like the next-move-only accuracy metric was not everything. Indeed, the authors of the original paper knew this, and so further validated the world model by intervening in a model’s forward pass to directly manipulate the world model (and check the resulting change in valid move predictions).

I’d also recommend checking out Neel Nanda’s work on OthelloGPT, where he demonstrated the world model was actually linear: https://arxiv.org/abs/2309.00941

narush · on May 23, 2024

Hey Jeff, thanks for the kind words. I'd love to learn more about your experience and transition through that pain point -- shoot me an email at nate @ sagacollab . com if you want to chat.

narush · on May 23, 2024

Woof. Always the words you stare at the most that are wrong... will update that demo video when I get the chance, but might be a bit :)

Good thing Pyoneer generates test cases for the formulas it generates! No need to trust my spelling abilities -- your Excel file is the ultimate source of truth.

narush · on May 23, 2024

Thanks for the feedback. I've updated the landing page to prominently display this language - see the how it works section.

narush · on May 23, 2024

Yep - this is a one-way process! You can think of it like an eject from Excel, in the best case.

The devs we've worked with so far have the goal of replacing the Excel process - inheriting it from the team that runs it manually, and automating it fully in Python. From them on, changes to the process would run through a more traditional software-development lifecycle, as you would be editing code.

For these devs - this is a feature not a bug! In Excel, version control, testing, and review is pretty much non-existent...

Cool username btw...

flashgordon · on May 24, 2024

Ah thank you. One idea (btw love what you are doing). Have you considered "defering" the python generation process and so there is an intermediate (possibly in-mem) layer that gives you crud access to the underlying DB (ahem excel). Then you could target this to any lang/runtime/backend with performance tradeoffs etc? Bit like a language server?