>a project manager vibe-coded the change without thinking it through at all
The PMs vibe-coding and having no idea what they're doing isn't even the main issue (although it is pretty bad).
The main issue is: how are the actual engineers supposed to "review" the slop? They probably report to the same PM or are at below in the org chart and might be evaluated by them. Not just at MS, but any company.
Such a conflict of interest would be detrimental to quality anywhere. You wouldn't build a bridge like this, nor should you software.
Maybe running additional inference on all sessions to detect OpenClaw usage would require spending more money than they would save with that detection in the first place (which is the original goal). I also suspect the Claude Code team is just a regular software team without immediate access to ML pipelines (or competence to run them) to quickly develop proper abuse detection systems with extensive testing (to avoid false positives, which people would also complain about), and they're under pressure by the management to do something right now, so a regex is all they can do within those constraints.
The benchmark is strange: single-run results (the author acknowledges it's unreliable) and uses older models like GPT-4o or Opus 4 (although the benchmark is from 2026).
>The short answer is that variable names are one of the things that confuses LLMs rather than helps them. Unlike with humans, names undermine a model's efforts to keep track of state over larger scales. Models confuse similarly named variables in different parts of the codebase easily
So I wonder, doesn't this apply to function names too, which the author keeps in? I've seen LLMs use wrong functions/classes as well.
I think a proper harness, LSP and tests already solve everything Vera is trying to solve. They mostly cite research from 2021 before coding harnesses and agentic loops were a thing, back when they were basically trying to one-shot with relatively weak models (by modern standards)
The only way the author could have come up with that rationale is that he doesn't understand what a token is, what attention is and how coding agents work.
Tokens combine multiple characters into a single vector. Attention computes similarity scores between vectors. This means you'd want each variable to be a single token so that the LLM can instantly know that two names refer to the same variable. If everything is numbered, the attention mechanism will attend every first parameter to every first parameter in every function. This means that the numbering scheme would have to be randomized instead of starting at zero.
Coding agents are now capable of using tools, including text search, which means that having the ability to look for specific variable names is extremely helpful. By using numbering, the author of the language has now given himself the burden of relying entirely on LSPs rather than innate model properties that operate on the text level.
So yeah, on a textual level, the language is designed for an era of LLMs that has been obsolete for a long time.
We're planning to do the same thing - buy something like 8xH100 and run all coding there. The CTO almost agreed to find the budget for it but I need to make sure there are no risks before we buy (i.e. it's a viable/usable setup for professional AI-assisted coding)
Can you share what models you run and find best performing for this setup? That would help a lot. I already run a smaller AI server in the office but only 32b models fit there. I already have experience optimizing inference, I'm just interested what models you think are great for 8xH100 for coding, I'll figure out the details how to fit it :)
8 x h100 80's don't give you enough to run the latest 1tn + parameter models (especially at the context window lengths to be competitive with the frontier models)
I’ve been using kimi 2.5/2.6 for the past 2 weeks and it’s really not far off OpenAI and Claude models. I am a coder so it’s not all vibes but I am definitely more in the “spec to code” mode than “edit this file for me” and it copes just fine. Needs a bit more supervision than the frontier models but it’s also significantly cheaper. If I were anthropic I’d be shitting myself, their prices are going to 10x over the next 2 years
Check out Verda you can rent whatever super powerful GPU clusters you need in 10 minute increments. Deploy any open weight model using SGLang and away you go
> What if an executive makes a wrong business decision
I jokingly tell students, "We all know executives are gonna make bad decisions no matter what the data says. Might as well give them the random numbers more quickly."
so what if there were some numbers in the report which are in actuality, an order of magnitude or two outside of what you think is reasonable, because something was wrong, but the AI agent reports something that looks normal?
Interesting, my assumption used to be that models over-edit when they're run with optimizations in attention blocks (quantization, Gated DeltaNet, sliding window etc.). I.e. they can't always reconstruct the original code precisely and may end up re-inventing some bits. Can't it be one of the reasons too?
From what I understand, ~30b is enough "intelligence" to make coding/reasoning etc. work, in general. Above ~30b, it's less about intelligence, and more about memorization. Larger models fail less and one-shot more often because they can memorize more APIs (documentation, examples, etc). Also from my experience, if a task is ambiguous, Sonnet has a better "intuition" of what my intent is. Probably also because of memorization, it has "access" to more repositories in its compressed knowledge to infer my intent more accurately.
- a project manager vibe-coded the change without thinking it through at all
- the PR was reviewed by an LLM
- an actual engineer gave LGTM without really reviewing the changes, trusting the LLM
Did I get this right?
reply