More

jumploops · 2026-04-30T03:58:36 1777521516

TIL gremlins weren’t just used to explain mysterious mechanical failures in airplanes, it’s the origin story of the term ‘gremlin’ itself[0].

I had always assumed there was some previous use of the term, neat!

[0]https://en.wikipedia.org/wiki/Gremlin

helloplanets · 2026-04-30T05:06:46 1777525606

So the word is actually semantically very close to "bug"! I guess we could still be using it, but the word's just too long for something that is one of the most used terms in software development.

At this point, picking that specific word is not at all a random quirk, as it's using the word literally like it's originally intended to be used.

ricochet11 · 2026-04-30T04:53:48 1777524828

Wow fascinating I’d have thought they were a lot older.

jumploops · 2026-04-30T01:27:39 1777512459

Between GitHub and Claude, it seems Eternal December[0][1] is upon us.

[0]I say December, because that's around the time the models got good enough that non-AI folks started to notice.

[1]https://en.wikipedia.org/wiki/Eternal_September

dmix · 2026-04-30T01:41:36 1777513296

GitHub is a long running business with a mature software stack running into scaling issues while they move to Azure and becoming Microsoft-ified. Claude is a new company in a new market with an extremely fast growing userbase running relatively novel AI infrastructure with a business model they are still figuring out.

I don't really blame Anthropic here.

jumploops · 2026-04-30T02:40:37 1777516837

Not trying to argue with you, GitHub (the core product) seems to have been in maintenance mode since the acquisition.

I couldn’t find any public data on GitHub, but Google Trends shows a sharp increase starting in December.

That could be in part to people complaining about the outages, but more people than ever are writing code with AI.

Hence the parallel to Eternal September – code volume is up, quality is down, and programming is never going to return to how it was (difficult for “normal” people to interface with).

zackify · 2026-04-30T01:30:50 1777512650

I use openai team plan whenever its down because its down so much lol

jumploops · 2026-04-29T21:30:05 1777498205

I have anecdotal experience here, but I've found more success when solving the task first, and then returning it as JSON in a separate LLM call[0].

Running a single non-reasoning LLM call from source data (text/image/audio in your diagram) to structured JSON seems fragile with the current state of LLMs.

You're essentially asking the model to do two tasks in one pass: parse the input and then format the output. It's amazing it works a lot of the time, but reasonable to assume it won't all of the time.

(As a human, when I'm filling out a complex form, I'll often jump around the document)

Curious how the benchmarks change when you add an intermediary representation, either via reasoning or an additional LLM call. I'd also love to see a comparison with BAML[1].

[0]In my experience we were using structured outputs as part of an agentic state machine, where the JSON contained code snippets (html/js/py/etc.). In the cases where we first prompted the model for the code, and then wrapped it in JSON, we saw much higher quality/success than asking for JSON straightaway.

[1]https://boundaryml.com/

jumploops · 2026-04-29T21:11:44 1777497104

More like "people who wear shoes will forget how to run"[0]

[0]https://www.youtube.com/watch?v=7jrnj-7YKZE

jumploops · 2026-04-28T18:45:02 1777401902

This is so awesome they built this, I've wondered what an LLM only trained on pre-1950s data would look like!

Now we just need a voice model with the "transatlantic accent" -- ideally with the early 20th century radio effect

jumploops · 2026-04-28T18:40:30 1777401630

At what point is liability the only "job" left for humans?

B1FF_PSUVM · 2026-04-28T18:44:40 1777401880

I think it was tor.com that last year had a story where the newbie hired for the corporate HR dept ended up being the last human left after all others were replaced.

Ah, here we go, courtesy of google-ml: '"Human Resources" by Adrian Tchaikovsky, published on Reactor[...] https://reactormag.com/human-resources-adrian-tchaikovsky/ '

jumploops · 2026-04-28T04:59:51 1777352391

This is neat!

Can you elaborate on “boots the app on a simulator or macOS, runs UI automation to verify behavior”

Does this handle screen captures similar to Playwright for web?

I built an app with Codex recently (to control codex/cc remotely, funnily enough) and without any skills/plugins, it was booting the simulator and running tests to verify something(?)

It seemed mostly to ensure that the app didn’t crash in certain scenarios, but it could by no means “see” what was on the screen.

I still had to do all the manual validation myself, mostly around perf/touch targets.

Curious if your tool does that or if there’s another solution out there?

bfbf · 2026-04-28T05:37:36 1777354656

Not the author but it likely means running automated UI tests in the sim, yes. This involves running the app and programmatically selecting and sending interaction events.

Your previous experience was probably the agent running regular unit tests, which obviously don’t need ui environment, but mostly *do* need an iOS runtime, which is why it needs to boot the simulator.

An idiosyncrasy of the way unit tests are executed in Xcode is that they run from the actual app deployment target, and so while running unit tests you’ll also see any app initialisation and background tasks running at the same time. It’s quite a good idea to use compiler directives or launch arguments to disable the usual app setup in the App or App Delegate. Why this isn’t a built in option is beyond me, but it’s definitely confusing behaviour when you’re just running isolated tests!

jumploops · 2026-04-28T17:20:49 1777396849

Thanks! It's been over a decade since I've used Xcode/launched a native iOS app, so I wasn't sure what the capabilities were.

Looks like built-in UI testing was launched at WWDC 2015, so I missed it by a year!

jumploops · 2026-04-27T23:24:28 1777332268

Curious how this looks for red/green colorblind folks?

Do they see everything beyond the initial green as a shade of blue?

--Edit--

My red/green colorblind father just got back me with this result:

> Your boundary is at hue 175, bluer than 68% of the population. For you, turquoise is green.

jumploops · 2026-04-23T18:17:15 1776968235

> GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.

This might be great if it translates to agentic engineering and not just benchmarks.

It seems some of the gains from Opus 4.6 to 4.7 required more tokens, not less.

Maybe more interesting is that they’ve used codex to improve model inference latency. iirc this is a new (expectedly larger) pretrain, so it’s presumably slower to serve.

beering · 2026-04-23T18:30:11 1776969011

With Opus it’s hard to tell what was due to the tokenizer changes. Maybe using more tokens for the same prompt means the model effectively thinks more?

conradkay · 2026-04-23T18:29:54 1776968994

They say latency is the same as 5.4 and 5.5 is served on GB200 NVL72, so I assume 5.4 was served on hopper.

jumploops · 2026-04-22T04:28:17 1776832097

Looks like analog clocks work well enough now, however it still struggles with left-handed people.

Overall, quite impressed with its continuity and agentic (i.e. research) features.