More

kgeist · 2026-05-03T02:22:41 1777774961

Interesting case:

- a project manager vibe-coded the change without thinking it through at all

- the PR was reviewed by an LLM

- an actual engineer gave LGTM without really reviewing the changes, trusting the LLM

Did I get this right?

throwaway277432 · 2026-05-03T04:25:13 1777782313

>a project manager vibe-coded the change without thinking it through at all

The PMs vibe-coding and having no idea what they're doing isn't even the main issue (although it is pretty bad).

The main issue is: how are the actual engineers supposed to "review" the slop? They probably report to the same PM or are at below in the org chart and might be evaluated by them. Not just at MS, but any company.

Such a conflict of interest would be detrimental to quality anywhere. You wouldn't build a bridge like this, nor should you software.

duskdozer · 2026-05-03T10:14:32 1777803272

The revert commit appears to have also been done by copilot

strix_varius · 2026-05-03T04:42:18 1777783338

You can't make this shit up.

teg4n_ · 2026-05-03T03:24:22 1777778662

The LLM actually points out the problem tho

isityettime · 2026-05-03T03:39:39 1777779579

Maybe the engineer's LLM agent's summary of the GitHub LLM bot's review omitted that warning.

herrherrmann · 2026-05-03T07:03:55 1777791835

Maybe the GitHub comments didn’t properly load due to the weekly partial outage.

kgeist · 2026-04-30T21:27:19 1777584439

Maybe running additional inference on all sessions to detect OpenClaw usage would require spending more money than they would save with that detection in the first place (which is the original goal). I also suspect the Claude Code team is just a regular software team without immediate access to ML pipelines (or competence to run them) to quickly develop proper abuse detection systems with extensive testing (to avoid false positives, which people would also complain about), and they're under pressure by the management to do something right now, so a regex is all they can do within those constraints.

kgeist · 2026-04-30T10:19:08 1777544348

The benchmark is strange: single-run results (the author acknowledges it's unreliable) and uses older models like GPT-4o or Opus 4 (although the benchmark is from 2026).

kgeist · 2026-04-30T02:05:18 1777514718

>The short answer is that variable names are one of the things that confuses LLMs rather than helps them. Unlike with humans, names undermine a model's efforts to keep track of state over larger scales. Models confuse similarly named variables in different parts of the codebase easily

So I wonder, doesn't this apply to function names too, which the author keeps in? I've seen LLMs use wrong functions/classes as well.

I think a proper harness, LSP and tests already solve everything Vera is trying to solve. They mostly cite research from 2021 before coding harnesses and agentic loops were a thing, back when they were basically trying to one-shot with relatively weak models (by modern standards)

imtringued · 2026-04-30T08:06:54 1777536414

The only way the author could have come up with that rationale is that he doesn't understand what a token is, what attention is and how coding agents work.

Tokens combine multiple characters into a single vector. Attention computes similarity scores between vectors. This means you'd want each variable to be a single token so that the LLM can instantly know that two names refer to the same variable. If everything is numbered, the attention mechanism will attend every first parameter to every first parameter in every function. This means that the numbering scheme would have to be randomized instead of starting at zero.

Coding agents are now capable of using tools, including text search, which means that having the ability to look for specific variable names is extremely helpful. By using numbering, the author of the language has now given himself the burden of relying entirely on LSPs rather than innate model properties that operate on the text level.

So yeah, on a textual level, the language is designed for an era of LLMs that has been obsolete for a long time.

kgeist · 2026-04-28T21:00:36 1777410036

We're planning to do the same thing - buy something like 8xH100 and run all coding there. The CTO almost agreed to find the budget for it but I need to make sure there are no risks before we buy (i.e. it's a viable/usable setup for professional AI-assisted coding)

Can you share what models you run and find best performing for this setup? That would help a lot. I already run a smaller AI server in the office but only 32b models fit there. I already have experience optimizing inference, I'm just interested what models you think are great for 8xH100 for coding, I'll figure out the details how to fit it :)

htrp · 2026-04-29T01:43:21 1777427001

8 x h100 80's don't give you enough to run the latest 1tn + parameter models (especially at the context window lengths to be competitive with the frontier models)

dools · 2026-04-29T02:05:09 1777428309

Verda has B300 clusters, 8 for USD $55/hour in 10 minute billing blocks

Havoc · 2026-04-28T23:53:00 1777420380

Deepseek, GLM, Minimax or Kimi are the most likely contenders.

dools · 2026-04-29T02:07:21 1777428441

I’ve been using kimi 2.5/2.6 for the past 2 weeks and it’s really not far off OpenAI and Claude models. I am a coder so it’s not all vibes but I am definitely more in the “spec to code” mode than “edit this file for me” and it copes just fine. Needs a bit more supervision than the frontier models but it’s also significantly cheaper. If I were anthropic I’d be shitting myself, their prices are going to 10x over the next 2 years

kakoni · 2026-04-29T07:51:44 1777449104

So are you running Kimi on Verda?

dools · 2026-04-29T02:02:45 1777428165

Check out Verda you can rent whatever super powerful GPU clusters you need in 10 minute increments. Deploy any open weight model using SGLang and away you go

kgeist · 2026-04-26T16:52:12 1777222332

How do you validate that the reports are correct? What if an executive makes a wrong business decision because the LLM wrote a wrong SQL query?

pocksuppet · 2026-04-26T17:39:44 1777225184

https://thedailywtf.com/articles/The-Great-Excel-Spreadsheet

BrentOzar · 2026-04-26T22:28:44 1777242524

> What if an executive makes a wrong business decision

I jokingly tell students, "We all know executives are gonna make bad decisions no matter what the data says. Might as well give them the random numbers more quickly."

nananana9 · 2026-04-26T17:05:20 1777223120

The same way we've always done it - glance at it and see if the numbers look like they're within an order of magnitude of what looks reasonable.

fl4regun · 2026-04-26T18:36:14 1777228574

so what if there were some numbers in the report which are in actuality, an order of magnitude or two outside of what you think is reasonable, because something was wrong, but the AI agent reports something that looks normal?

tremon · 2026-04-26T17:57:03 1777226223

So as long as the LLM only makes errors in the single-digit percentage range, everything is peachy. Make number go up, but not by too much.

marcosdumay · 2026-04-26T22:50:22 1777243822

If you already know the report's numbers, why are you asking an LLM to generate it?

nananana9 · 2026-04-28T07:51:26 1777362686

Usually because you need something vaguely technical and authoritative sounding to push for a decision you're already made.

kgeist · 2026-04-25T10:07:18 1777111638

>Stash makes your AI remember you. Every session. Forever.

How does it fight context pollution?

kgeist · 2026-04-22T21:22:40 1776892960

Custom constrained decoding could have solved this. Penalize comment tokens :)

kgeist · 2026-04-22T21:19:48 1776892788

Interesting, my assumption used to be that models over-edit when they're run with optimizations in attention blocks (quantization, Gated DeltaNet, sliding window etc.). I.e. they can't always reconstruct the original code precisely and may end up re-inventing some bits. Can't it be one of the reasons too?

kgeist · 2026-04-22T20:56:29 1776891389

From what I understand, ~30b is enough "intelligence" to make coding/reasoning etc. work, in general. Above ~30b, it's less about intelligence, and more about memorization. Larger models fail less and one-shot more often because they can memorize more APIs (documentation, examples, etc). Also from my experience, if a task is ambiguous, Sonnet has a better "intuition" of what my intent is. Probably also because of memorization, it has "access" to more repositories in its compressed knowledge to infer my intent more accurately.