Hacker Newsnew | past | comments | ask | show | jobs | submit | ollin's commentslogin

- The OpenBSD one is 'TCP packets with invalid SACK options could crash the kernel' https://cdn.openbsd.org/pub/OpenBSD/patches/7.8/common/025_s...

- One (patched) Linux kernel bug is 'UaF when sys_futex_requeue() is used with different flags' https://github.com/torvalds/linux/commit/e2f78c7ec1655fedd94...

These links are from the more-detailed 'Assessing Claude Mythos Preview’s cybersecurity capabilities' post released today https://red.anthropic.com/2026/mythos-preview/, which includes more detail on some of the public/fixed issues (like the OpenBSD one) as well as hashes for several unreleased reports and PoCs.


That OpenBSD one is exactly the kind of bug that easily slips past a human. Especially as the code worked perfectly under regular circumstances.

Looks like they've been approaching folks with their findings for at least a few weeks before this article.


While not entirely unrelated, Linux also had a remote SACK issue ~ 6 years back.

So if this Mythos is just an expensive combination of better RL and the original source material, that should hopefully point out where we might see an uptick in work ( as opposed to a novel class of attack vectors).


My impression was entirely the opposite; the unsolved subset of SWE-bench verified problems are memorizable (solutions are pulled from public GitHub repos) and the evaluators are often so brittle or disconnected from the problem statement that the only way to pass is to regurgitate a memorized solution.

OpenAI had a whole post about this, where they recommended switching to SWE-bench Pro as a better (but still imperfect) benchmark:

https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

> We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions

> SWE-bench problems are sourced from open-source repositories many model providers use for training purposes. In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix

> improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time

> We’re building new, uncontaminated evaluations to better track coding capabilities, and we think this is an important area to focus on for the wider research community. Until we have those, OpenAI recommends reporting results for SWE-bench Pro.


> My impression was entirely the opposite; the unsolved subset of SWE-bench verified problems are memorizable (solutions are pulled from public GitHub repos) and the evaluators are often so brittle or disconnected from the problem statement that the only way to pass is to regurgitate a memorized solution.

Anthropic accounts for this

>To detect memorization, we use a Claude-based auditor that compares each model-generated patch against the gold patch and assigns a [0, 1] memorization probability. The auditor weighs concrete signals—verbatim code reproduction when alternative approaches exist, distinctive comment text matching ground truth, and more—and is instructed to discount overlap that any competent solver would produce given the problem constraints.


I stand corrected.


Here was the developer thread https://developer.apple.com/forums/thread/818403 I found with lots of other reports of "Unable to Verify App - An internet connection is required to verify the trust of the developer".

Although https://developer.apple.com/system-status/ was green for most of the 3-4 hour outage, the page now at least acknowledges two minutes of downtime:

    App Store Connect - Resolved Outage
    Today, 12:04 AM - 12:06 AM
    All users were affected
    Users experienced a problem with this service.
Not a great developer experience.


Can't risk those precious 9s of uptime.


The still photo (with 富士康科技 photoshopped out) is the second image of the "In Houston, workers assemble advanced AI servers" photo carousel https://www.apple.com/newsroom/images/2026/02/apple-accelera...


A lot of people mentioned this! The "dreamlike" comparison is common as well. In both cases, you have a network of neurons rendering an image approximating the real world :) so it sort of makes sense.

Regarding the specific boiling-textures effect: there's a tradeoff in recurrent world models between jittering (constantly regenerating fine details to avoid accumulating error) and drifting (propagating fine details as-is, even when that leads to accumulating error and a simplified/oversaturated/implausible result). The forest trail world is tuned way towards jittering (you can pause with `p` and step frame-by-frame with `.` to see this). So if the effect resembles LSD, it's possible that LSD applies some similar random jitter/perturbation to the neurons within your visual cortex.


Yup, similar concepts! Just at two opposite extremes of the compute/scaling spectrum.

- That forest trail world is ~5 million parameters, trained on 15 minutes of video, scoped to run on a five-year-old iPhone through a twenty-year old API (WebGL GPGPU, i.e OpenGL fragment shaders). It's the smallest '3D' world model I'm aware of.

- Genie 3 is (most likely) ~100 billion parameters trained on millions of hours of video and running across multiple TPUs. I would be shocked if it's not the largest-scale world model available to the public.

There are lots of neat intermediate-scale world models being developed as well (e.g. LingBot-World https://github.com/robbyant/lingbot-world, Waypoint 1 https://huggingface.co/blog/waypoint-1) so I expect we'll be able to play something of Genie quality locally on gaming GPUs within a year or two.


Really great to see this released! Some interesting videos from early-access users:

- https://youtu.be/15KtGNgpVnE?si=rgQ0PSRniRGcvN31&t=197 walking through various cities

- https://x.com/fofrAI/status/2016936855607136506 helicopter / flight sim

- https://x.com/venturetwins/status/2016919922727850333 space station, https://x.com/venturetwins/status/2016920340602278368 Dunkin' Donuts

- https://youtu.be/lALGud1Ynhc?si=10ERYyMFHiwL8rQ7&t=207 simulating a laptop computer, moving the mouse

- https://x.com/emollick/status/2016919989865840906 otter airline pilot with a duck on its head walking through a Rothko inspired airport


These are extremely impressive from a technological progression standpoint, and at the same time not at all compelling, in the same way AI images and LLM prose are and are not.

It's neat I guess that I can use a few words and generate the equivalent of an Unreal 5 asset flip and play around in it. Also I will never do that, much less pay some ongoing compute cost for each second I'm doing it.


Exactly. People are getting so excited that all this stuff is possible, and forgetting that we are burning through innumerable finite resources just to prove something is possible.

They were too concerned with whether or not they could, they never stopped to think if they should.


Yeah, the future I see from this is just shitty walking video games that maybe look nice but have ridiculous input lag, stuttery frame rates, and no compelling gameplay loop or story. Oh and another tool to fill up facebook with more fake videos to make people angry. Oh well, I guess this is what we've decided to direct all our energy towards.


I was lucky enough to be an early tester, here's a brief video walking through the process of creating worlds, showing examples--walking on the moon, with Nasa photo as part of the prompt, being in 221B Baker street with Holmes and Watson, wandering through a night market in Taipei as a giant boba milk tea (note how the stalls are different, and sell different foods), and also exploring the setting of my award-nominated tabletop RPG.

https://www.youtube.com/watch?v=FyTHcmWPuJE

It's an experimental research prototype, but it also feels like a hint of the future. Feel free to ask any questions.


I liked that first one and I hope someone creates one of going back to dinosaur age, i want to see that.


One step closer to the science-based dinosaur MMO we were promised.


Tim is awesome.

Ironically, he covered PixVerse's world model last week and it came close to your ask: https://youtu.be/SAjKSRRJstQ?si=dqybCnaPvMmhpOnV&t=371

(Earlier in the video it shows him live prompting.)

World models are popping up everywhere, from almost every frontier lab.


Any thoughts about Project Genie?


On a technical level, this looks like the same diffusion transformer world model design that was shown in the Genie 3 post (text/memory/d-pad input, video output, 60sec max context, 720p, sub-10FPS control latency due to 4-frame temporal compression). I expect the public release uses a cheaper step-distilled / quantized version. The limitations seen in Genie 3 (high control latency, gradual loss of detail and drift towards videogamey behavior, 60s max rollout length) are still present. The editing/sharing tools, latency, cost, etc. can probably improve over time with this same model checkpoint, but new features like audio input/output, higher resolution, precise controls, etc. likely won't happen until the next major version.

From a product perspective, I still don't have a good sense of what the market for WMs will look like. There's a tension between serious commercial applications (robotics, VFX, gamedev, etc. where you want way, way higher fidelity and very precise controllability), vs current short-form-demos-for-consumer-entertainment application (where you want the inference to be cheap-enough-to-be-ad-supported and simple/intuitive to use). Framing Genie as a "prototype" inside their most expensive AI plan makes a lot of sense while GDM figures out how to target the product commercially.

On a personal level, since I'm also working on world models (albeit very small local ones https://news.ycombinator.com/item?id=43798757), my main thought is "oh boy, lots of work to do". If everyone starts expecting Genie 3 quality, local WMs need to become a lot better :)


Z-Image is another open-weight image-generation model by Alibaba [1]. Z-Image Turbo was released around the same time as (non-Klein) FLUX.2 and received generally warmer community response [2] since Z-image Turbo was faster, also high-quality, and reportedly better at generating NSFW material. The base (non-Turbo) version of Z-Image is not yet released.

[1] https://tongyi-mai.github.io/Z-Image-blog/

[2] https://www.reddit.com/r/StableDiffusion/comments/1p9uu69/no...


Z-Image is roughly as censored as Flux 2, from my very limited testing. It got popular because Flux 2 is just really big and slow. It is, however, great at editing, has an amazing breadth of built in knowledge, and has great prompt adherence.

Z Image got popular because the people stuck with 12GB video cards could still use it, and hell - probably train on it, at least once the base version comes out. I think most people disparaging Flux 2 never tried it as they wouldn't want to deal with how slowly it would work on their system, if they even realize that they could run it.


Ahh I see, and Klein is basically a response to Z-Image Turbo, i.e. another 4-8B sized model that fits comfortably on a consumer GPU.

It’ll be interesting to see how the NSFW catering plays out for the Chinese labs. I was joking a couple months ago to someone that Seedream 4’s talents at undressing was an attempt to sow discord and it was interesting it flew under the radar.

Post-Grok going full gooner pedo, I wonder if it Grok will take the heat alone moving forward.


They are underselling Z-Image Turbo somewhat. It's arguably the best overall model for local image generation for several reasons including prompt adherence, overall output quality and realism, and freedom from censorship, even though it's also one of the smallest at 6B parameters.

ZIT is not far short of revolutionary. It is kind of surreal to contemplate how much high-quality imagery can be extracted from a model that fits on a single DVD and runs extremely quickly on consumer-grade GPUs.


Hold on now. Z-Image Turbo has gotten a lot of hype but it's worse at all of those things other than perhaps looking like it was shot on a cell phone camera than Qwen Image and Flux 2 (the full sized version). Once you get away from photographic portraits of people it quickly shows just how little it can do.

It is, however, small and quick.


Not in my experience. Flux 2 is much larger and heavily censored, and Qwen-Image is just plain not as good. You can fool me into thinking that Z-Image Turbo output isn't AI, while that's rarely the case with Qwen.

Look at the images I posted elsewhere in this section. They are crappy excuses for pogo sticks, but they absolutely do NOT look like they came from a cell phone.

Also see vunderba's page at https://genai-showdown.specr.net/ . Even when Z-Image Turbo fails a test, it still looks great most of the time.

Edit re: your other comment -- don't make the mistake of confusing censorship with lack of training data. Z-Image will try to render whatever you ask for, but at the end of the day it's a very small model that will fail once you start asking for things it simply wasn't trained on. They didn't train it with much NSFW material, so it has some rather... unorthodox anatomical ideas.


Everything you said is exactly the truth.

However.. I’m already expecting the blowback when a Z-Image release doesn’t wow people like the Turbo finetune does. SDXL hasn’t been out two years yet, seems like a decade.

We’ll see. I’m hopeful that Z works as expected and sets the new watermark. I just am not sure it does it right out the gate.


>Post-Grok going full gooner pedo

Almost afraid to ask, but anytime grok or x or musk comes up I am never sure if there is some reality based thing, or some “I just need to hate this” thing. Sometimes they’re the same thing, other times they aren’t.

I can guess here that because Grok likely uses WAN that someone wrote some gross prompts and then pretended this is an issue unique to Grok for effect?


A few days ago people were replying to every image on Twitter saying "Grok, put him/her/it in a bikini" and Grok would just do it. It was minimum effort, maximum damage trolling and people loved it.


Ah. So, see, this is exactly why I need to check apparently.

Personally, I go between “I don’t care at all” and “well it’s not ideal” on AI generations. It’s already too late, but the barrier of entry is a lot lower than it was.

But I’m applying a good faith argument where GP does not seem to have intended one.


Reducing it to some people put people in bikinis for a couple days for the lulz is...not quite what happened.

You may note I am no shirking violet, nor do I lack perspective, as evidenced by my notes on Seedream. And fortuitiously, I only mentioned it before being dismissed as bad faith: I could not have foreseen needing to call out as credentials until now.

I don't think it's kind to accuse others of bad faith, as evidence by me not passing judgement on the person you are replying to's description.

I do admit it made my stomach churn a little bit to see how quickly people will other. Not on you, I'm sure I've done this too. It's stark when you're on the other side of it.


Nah it's been happening for months and involved kids, over and over, albeit for the same reasoning, lulz & totally based. I am a bit surprised that you thought this was just a PG-rated stunt on X for a couple days, it's been in the news for weeks, including on HN.


I see absolutely no citations. Can you point to anything that shows a specific Grok issue vs generally people doing icky things with photo generation software?

Because, as I remember you said “post-pedo Grok”.


You can Google whatever you need yourself at this point, you told the world I was operating in bad faith based off one sentence from a stranger. You ignored my reply to you. And now you are engaging with me on another reply as if my claim was Grok is uniquely capable of this, when I in fact said the opposite, and the interesting part of the discussion was me pointing out all can do this. Have a good day!


“Post-pedo grok”

Just admit you’re very accustomed to shitting on x, grok, whatever Musk is associated with as a reinforcement to your political ideology.

Your comments weren’t about AI, thy were about Grok, and then you were incapable of defending that claim.


I am of no party or clique, why would Elon be doing moderation anyway? He has better things to do. If anything, sounded understaffed and thus taken advantage of by ne’er do wells - you can check if I’m pivoting by noting I noted in my original post every model can do this and Grok being focused on was a strange aberration.

I feel pathetic defending myself to someone who keeps reading my mind in the blandest way possible, then accuses me of wrongthought I must have had, based on things I never said. Hard to believe you’re living up to your ideals in this moment if you’re a fellow advocate for truth seekers and great men. I respect interlocution, but not repeated personal attacks based on thoughts projected and things unsaid. That’s not truth seeking behavior.


https://madebyoll.in

I write about on-device generative models (particularly world models). Past posts have been reasonably well-received on HN (https://news.ycombinator.com/from?site=madebyoll.in).


Yeah the issue reads as if someone asked Claude Code "find the most serious performance issue in the VSCode rendering loop" and then copied the response directly into GitHub (without profiling or testing anything).


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: