Hacker Newsnew | past | comments | ask | show | jobs | submit | jumploops's commentslogin

TIL gremlins weren’t just used to explain mysterious mechanical failures in airplanes, it’s the origin story of the term ‘gremlin’ itself[0].

I had always assumed there was some previous use of the term, neat!

[0]https://en.wikipedia.org/wiki/Gremlin


So the word is actually semantically very close to "bug"! I guess we could still be using it, but the word's just too long for something that is one of the most used terms in software development.

At this point, picking that specific word is not at all a random quirk, as it's using the word literally like it's originally intended to be used.


Wow fascinating I’d have thought they were a lot older.

Between GitHub and Claude, it seems Eternal December[0][1] is upon us.

[0]I say December, because that's around the time the models got good enough that non-AI folks started to notice.

[1]https://en.wikipedia.org/wiki/Eternal_September


GitHub is a long running business with a mature software stack running into scaling issues while they move to Azure and becoming Microsoft-ified. Claude is a new company in a new market with an extremely fast growing userbase running relatively novel AI infrastructure with a business model they are still figuring out.

I don't really blame Anthropic here.


Not trying to argue with you, GitHub (the core product) seems to have been in maintenance mode since the acquisition.

I couldn’t find any public data on GitHub, but Google Trends shows a sharp increase starting in December.

That could be in part to people complaining about the outages, but more people than ever are writing code with AI.

Hence the parallel to Eternal September – code volume is up, quality is down, and programming is never going to return to how it was (difficult for “normal” people to interface with).


I use openai team plan whenever its down because its down so much lol

I have anecdotal experience here, but I've found more success when solving the task first, and then returning it as JSON in a separate LLM call[0].

Running a single non-reasoning LLM call from source data (text/image/audio in your diagram) to structured JSON seems fragile with the current state of LLMs.

You're essentially asking the model to do two tasks in one pass: parse the input and then format the output. It's amazing it works a lot of the time, but reasonable to assume it won't all of the time.

(As a human, when I'm filling out a complex form, I'll often jump around the document)

Curious how the benchmarks change when you add an intermediary representation, either via reasoning or an additional LLM call. I'd also love to see a comparison with BAML[1].

[0]In my experience we were using structured outputs as part of an agentic state machine, where the JSON contained code snippets (html/js/py/etc.). In the cases where we first prompted the model for the code, and then wrapped it in JSON, we saw much higher quality/success than asking for JSON straightaway.

[1]https://boundaryml.com/


More like "people who wear shoes will forget how to run"[0]

[0]https://www.youtube.com/watch?v=7jrnj-7YKZE


This is so awesome they built this, I've wondered what an LLM only trained on pre-1950s data would look like!

Now we just need a voice model with the "transatlantic accent" -- ideally with the early 20th century radio effect


At what point is liability the only "job" left for humans?

I think it was tor.com that last year had a story where the newbie hired for the corporate HR dept ended up being the last human left after all others were replaced.

Ah, here we go, courtesy of google-ml: '"Human Resources" by Adrian Tchaikovsky, published on Reactor[...] https://reactormag.com/human-resources-adrian-tchaikovsky/ '


This is neat!

Can you elaborate on “boots the app on a simulator or macOS, runs UI automation to verify behavior”

Does this handle screen captures similar to Playwright for web?

I built an app with Codex recently (to control codex/cc remotely, funnily enough) and without any skills/plugins, it was booting the simulator and running tests to verify something(?)

It seemed mostly to ensure that the app didn’t crash in certain scenarios, but it could by no means “see” what was on the screen.

I still had to do all the manual validation myself, mostly around perf/touch targets.

Curious if your tool does that or if there’s another solution out there?


Not the author but it likely means running automated UI tests in the sim, yes. This involves running the app and programmatically selecting and sending interaction events.

Your previous experience was probably the agent running regular unit tests, which obviously don’t need ui environment, but mostly *do* need an iOS runtime, which is why it needs to boot the simulator.

An idiosyncrasy of the way unit tests are executed in Xcode is that they run from the actual app deployment target, and so while running unit tests you’ll also see any app initialisation and background tasks running at the same time. It’s quite a good idea to use compiler directives or launch arguments to disable the usual app setup in the App or App Delegate. Why this isn’t a built in option is beyond me, but it’s definitely confusing behaviour when you’re just running isolated tests!


Thanks! It's been over a decade since I've used Xcode/launched a native iOS app, so I wasn't sure what the capabilities were.

Looks like built-in UI testing was launched at WWDC 2015, so I missed it by a year!


Curious how this looks for red/green colorblind folks?

Do they see everything beyond the initial green as a shade of blue?

--Edit--

My red/green colorblind father just got back me with this result:

> Your boundary is at hue 175, bluer than 68% of the population. For you, turquoise is green.


> GPT‑5.5 improves on GPT‑5.4’s scores while using fewer tokens.

This might be great if it translates to agentic engineering and not just benchmarks.

It seems some of the gains from Opus 4.6 to 4.7 required more tokens, not less.

Maybe more interesting is that they’ve used codex to improve model inference latency. iirc this is a new (expectedly larger) pretrain, so it’s presumably slower to serve.


With Opus it’s hard to tell what was due to the tokenizer changes. Maybe using more tokens for the same prompt means the model effectively thinks more?

They say latency is the same as 5.4 and 5.5 is served on GB200 NVL72, so I assume 5.4 was served on hopper.

Looks like analog clocks work well enough now, however it still struggles with left-handed people.

Overall, quite impressed with its continuity and agentic (i.e. research) features.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: