Hacker Newsnew | past | comments | ask | show | jobs | submit | coder68's commentslogin

I have not delved into the theory yet but it seems that the smaller open-source models do this already to an extent. They have less parameters, but spend much more time/tokens reasoning, as a way to close the performance gap. If you look at "tokens per problem" on https://swe-rebench.com/ it seems to be the case at least.

20 years is quite an optimistic timeline. Of course, we will use agents to solve the problems of agents!

Are there plans to release a QAT model? Similar to what was done for Gemma 3. That would be nice to see!

120B would be great to have if you have it stashed away somewhere. GPT-OSS-120B still stands as one of the best (and fastest) open-weights models out there. A direct competitor in the same size range would be awesome. The closest recent release was Qwen3.5-122B-A10B.

Nemotron 3 Super was released recently. That's a direct competitor to gpt-oss-120b. https://developer.nvidia.com/blog/introducing-nemotron-3-sup...

In terms of ability, maybe, in terms of speed, it's not even close. Check out the Prompt Processing speeds between them: https://kyuz0.github.io/amd-strix-halo-toolboxes/

gpt-oss-120b is over 600 tokens/s PP for all but one backend.

nemotron-3-super is at best 260 tokens/s PP.

Comparing token generation, it's again like 50 tokens/sec vs 15 tokens/sec

That really bogs down agentic tooling. Something needs to be categorically better to justify halving output speed, not just playing in the margins.


In my case with vLLM on dual RTX Pro 6000

gpt-oss-120b: (unknown prefill), ~175 tok/s generation. I don't remember the prefill speed but it certainly was below 10k

Nemotron-3-Super: 14070 tok/s prefill, ~194.5 tok/s generation. (Tested fresh after reload, no caching, I have a screenshot.)

Nemotron-3-Super using NVFP4 and speculative decoding via MTP 5 tokens at a time as mentioned in Nvidia cookbook: https://docs.nvidia.com/nemotron/nightly/usage-cookbook/Nemo...


I gave it a whirl but was unenthused. I'll try it again, but so far have not really enjoyed any of the nvidia models, though they are best in class for execution speed.

I'll pipe in here as someone working on an agentic harness project using mastra as the harness.

Nemotron3-super is, without question, my favorite model now for my agentic use cases. The closest model I would compare it to, in vibe and feel, is the Qwen family but this thing has an ability to hold attention through complicated (often noisy) agentic environments and I'm sometimes finding myself checking that i'm not on a frontier model.

I now just rent a Dual B6000 on a full-time basis for myself for all my stuff; this is the backbone of my "base" agentic workload, and I only step up to stronger models in rare situations in my pipelines.

The biggest thing with this model, I've found, is just making sure my environment is set up correctly; the temps and templates need to be exactly right. I've had hit-or-miss with OpenRouter. But running this model on a B6000 from Vast with a native NVFP4 model weight from Nvidia, it's really good. (2500 peak tokens/sec on that setup) batching. about 100/s 1-request, 250k context. :)

I can run on a single B6000 up to about 120k context reliably but really this thing SCREAMS on a dual-b6000. (I'm close to just ordering a couple for myself it's working so well).

Good luck .. (Sometimes I feel like I'm the crazy guy in the woods loving this model so much, I'm not sure why more people aren't jumping on it..)


> I'm not sure why more people aren't jumping on it

Simple: most of the people you’re talking to aren’t setting these things up. They’re running off the shelf software and setups and calling it a day. They’re not working with custom harnesses or even tweaking temperature or templates, most of them.


I’d be very interested in trying it if you could spare the time to write up how to tune it well. If not thanks for the input anyway.

The good news is local models have significantly improved. If it all goes down today, you can still run e.g. Qwen 3.5 at home, and it's "good enough" for most workloads.

With a gaming GPU you can run Qwen3.5-35B-A3B. I use 122B-A10B on my local rig (1x6000 Pro), and 397B-A17B on my 2x6000 Pro server (some spillover into CPU/RAM). It's pricey now but probably within a few years it'll become very affordable.


We all probably need to touch grass a bit. Our industry is really out of touch with reality right now, although the looming impact of AI is probably quite real.


It does? There is a fast drop followed by a long decay, exponential in fact. The cooling rate is proportional to the temperature difference, so the drop is sharpest at the very beginning when the object is hottest.


I mean that initial drop doesn't look like it is part of the same exponential decay.


Is there interest in benchmarking the proprietary LLMs for translation? Curious as I often use Gemini 3 Flash, but I have no idea how good it is for my language family. I prefer open models (in fact the smaller the better for offline), but it'd be useful to know how well the Big Three do.


We did some benchmarking of them internally, but not sure if we'll publish the detailed results. Just in case, keep an eye on https://huggingface.co/spaces/facebook/bouquet: if we release the evaluation results, they will be there.


Thanks! Super interested in LLMs for translation :D glad to see you folks doing this work.


Even working in "tech" but not FAANG this is so true, 10 days is still the norm at many white collar businesses for your first year of employment, sometimes 15 days if they're generous.


The tradeoff with many EU countries would be that they enjoy their leisure time a lot more and sooner than Americans. Americans make more and save more statistically, but they spend it on cars, houses, and medical care, and generally have way less free time. So I think it's a wash.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: