So the word is actually semantically very close to "bug"! I guess we could still be using it, but the word's just too long for something that is one of the most used terms in software development.
At this point, picking that specific word is not at all a random quirk, as it's using the word literally like it's originally intended to be used.
GitHub is a long running business with a mature software stack running into scaling issues while they move to Azure and becoming Microsoft-ified. Claude is a new company in a new market with an extremely fast growing userbase running relatively novel AI infrastructure with a business model they are still figuring out.
Not trying to argue with you, GitHub (the core product) seems to have been in maintenance mode since the acquisition.
I couldn’t find any public data on GitHub, but Google Trends shows a sharp increase starting in December.
That could be in part to people complaining about the outages, but more people than ever are writing code with AI.
Hence the parallel to Eternal September – code volume is up, quality is down, and programming is never going to return to how it was (difficult for “normal” people to interface with).
I have anecdotal experience here, but I've found more success when solving the task first, and then returning it as JSON in a separate LLM call[0].
Running a single non-reasoning LLM call from source data (text/image/audio in your diagram) to structured JSON seems fragile with the current state of LLMs.
You're essentially asking the model to do two tasks in one pass: parse the input and then format the output. It's amazing it works a lot of the time, but reasonable to assume it won't all of the time.
(As a human, when I'm filling out a complex form, I'll often jump around the document)
Curious how the benchmarks change when you add an intermediary representation, either via reasoning or an additional LLM call. I'd also love to see a comparison with BAML[1].
[0]In my experience we were using structured outputs as part of an agentic state machine, where the JSON contained code snippets (html/js/py/etc.). In the cases where we first prompted the model for the code, and then wrapped it in JSON, we saw much higher quality/success than asking for JSON straightaway.
I think it was tor.com that last year had a story where the newbie hired for the corporate HR dept ended up being the last human left after all others were replaced.
Can you elaborate on “boots the app on a simulator or macOS, runs UI automation to verify behavior”
Does this handle screen captures similar to Playwright for web?
I built an app with Codex recently (to control codex/cc remotely, funnily enough) and without any skills/plugins, it was booting the simulator and running tests to verify something(?)
It seemed mostly to ensure that the app didn’t crash in certain scenarios, but it could by no means “see” what was on the screen.
I still had to do all the manual validation myself, mostly around perf/touch targets.
Curious if your tool does that or if there’s another solution out there?
Not the author but it likely means running automated UI tests in the sim, yes. This involves running the app and programmatically selecting and sending interaction events.
Your previous experience was probably the agent running regular unit tests, which obviously don’t need ui environment, but mostly *do* need an iOS runtime, which is why it needs to boot the simulator.
An idiosyncrasy of the way unit tests are executed in Xcode is that they run from the actual app deployment target, and so while running unit tests you’ll also see any app initialisation and background tasks running at the same time. It’s quite a good idea to use compiler directives or launch arguments to disable the usual app setup in the App or App Delegate. Why this isn’t a built in option is beyond me, but it’s definitely confusing behaviour when you’re just running isolated tests!
> GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.
This might be great if it translates to agentic engineering and not just benchmarks.
It seems some of the gains from Opus 4.6 to 4.7 required more tokens, not less.
Maybe more interesting is that they’ve used codex to improve model inference latency. iirc this is a new (expectedly larger) pretrain, so it’s presumably slower to serve.
With Opus it’s hard to tell what was due to the tokenizer changes. Maybe using more tokens for the same prompt means the model effectively thinks more?
I had always assumed there was some previous use of the term, neat!
[0]https://en.wikipedia.org/wiki/Gremlin
reply