More

coder68 · 2026-04-08T17:02:46 1775667766

I have not delved into the theory yet but it seems that the smaller open-source models do this already to an extent. They have less parameters, but spend much more time/tokens reasoning, as a way to close the performance gap. If you look at "tokens per problem" on https://swe-rebench.com/ it seems to be the case at least.

coder68 · 2026-04-06T04:24:18 1775449458

20 years is quite an optimistic timeline. Of course, we will use agents to solve the problems of agents!

coder68 · 2026-04-02T18:43:49 1775155429

Are there plans to release a QAT model? Similar to what was done for Gemma 3. That would be nice to see!

coder68 · 2026-04-02T18:41:12 1775155272

120B would be great to have if you have it stashed away somewhere. GPT-OSS-120B still stands as one of the best (and fastest) open-weights models out there. A direct competitor in the same size range would be awesome. The closest recent release was Qwen3.5-122B-A10B.

kcb · 2026-04-02T18:49:46 1775155786

Nemotron 3 Super was released recently. That's a direct competitor to gpt-oss-120b. https://developer.nvidia.com/blog/introducing-nemotron-3-sup...

evilduck · 2026-04-02T21:16:32 1775164592

In terms of ability, maybe, in terms of speed, it's not even close. Check out the Prompt Processing speeds between them: https://kyuz0.github.io/amd-strix-halo-toolboxes/

gpt-oss-120b is over 600 tokens/s PP for all but one backend.

nemotron-3-super is at best 260 tokens/s PP.

Comparing token generation, it's again like 50 tokens/sec vs 15 tokens/sec

That really bogs down agentic tooling. Something needs to be categorically better to justify halving output speed, not just playing in the margins.

mratsim · 2026-04-02T22:06:28 1775167588

In my case with vLLM on dual RTX Pro 6000

gpt-oss-120b: (unknown prefill), ~175 tok/s generation. I don't remember the prefill speed but it certainly was below 10k

Nemotron-3-Super: 14070 tok/s prefill, ~194.5 tok/s generation. (Tested fresh after reload, no caching, I have a screenshot.)

Nemotron-3-Super using NVFP4 and speculative decoding via MTP 5 tokens at a time as mentioned in Nvidia cookbook: https://docs.nvidia.com/nemotron/nightly/usage-cookbook/Nemo...

coder68 · 2026-04-02T19:36:37 1775158597

I gave it a whirl but was unenthused. I'll try it again, but so far have not really enjoyed any of the nvidia models, though they are best in class for execution speed.

markab21 · 2026-04-02T21:08:57 1775164137

I'll pipe in here as someone working on an agentic harness project using mastra as the harness.

Nemotron3-super is, without question, my favorite model now for my agentic use cases. The closest model I would compare it to, in vibe and feel, is the Qwen family but this thing has an ability to hold attention through complicated (often noisy) agentic environments and I'm sometimes finding myself checking that i'm not on a frontier model.

I now just rent a Dual B6000 on a full-time basis for myself for all my stuff; this is the backbone of my "base" agentic workload, and I only step up to stronger models in rare situations in my pipelines.

The biggest thing with this model, I've found, is just making sure my environment is set up correctly; the temps and templates need to be exactly right. I've had hit-or-miss with OpenRouter. But running this model on a B6000 from Vast with a native NVFP4 model weight from Nvidia, it's really good. (2500 peak tokens/sec on that setup) batching. about 100/s 1-request, 250k context. :)

I can run on a single B6000 up to about 120k context reliably but really this thing SCREAMS on a dual-b6000. (I'm close to just ordering a couple for myself it's working so well).

Good luck .. (Sometimes I feel like I'm the crazy guy in the woods loving this model so much, I'm not sure why more people aren't jumping on it..)

girvo · 2026-04-02T22:30:41 1775169041

> I'm not sure why more people aren't jumping on it

Simple: most of the people you’re talking to aren’t setting these things up. They’re running off the shelf software and setups and calling it a day. They’re not working with custom harnesses or even tweaking temperature or templates, most of them.

pertymcpert · 2026-04-03T05:26:28 1775193988

I’d be very interested in trying it if you could spare the time to write up how to tune it well. If not thanks for the input anyway.

coder68 · 2026-03-30T15:31:36 1774884696

The good news is local models have significantly improved. If it all goes down today, you can still run e.g. Qwen 3.5 at home, and it's "good enough" for most workloads.

With a gaming GPU you can run Qwen3.5-35B-A3B. I use 122B-A10B on my local rig (1x6000 Pro), and 397B-A17B on my 2x6000 Pro server (some spillover into CPU/RAM). It's pricey now but probably within a few years it'll become very affordable.

coder68 · 2026-03-25T16:56:24 1774457784

We all probably need to touch grass a bit. Our industry is really out of touch with reality right now, although the looming impact of AI is probably quite real.

coder68 · 2026-03-22T22:17:33 1774217853

It does? There is a fast drop followed by a long decay, exponential in fact. The cooling rate is proportional to the temperature difference, so the drop is sharpest at the very beginning when the object is hottest.

amelius · 2026-03-22T22:22:25 1774218145

I mean that initial drop doesn't look like it is part of the same exponential decay.

coder68 · 2026-03-22T20:35:12 1774211712

Is there interest in benchmarking the proprietary LLMs for translation? Curious as I often use Gemini 3 Flash, but I have no idea how good it is for my language family. I prefer open models (in fact the smaller the better for offline), but it'd be useful to know how well the Big Three do.

cointegrated · 2026-03-24T18:47:10 1774378030

We did some benchmarking of them internally, but not sure if we'll publish the detailed results. Just in case, keep an eye on https://huggingface.co/spaces/facebook/bouquet: if we release the evaluation results, they will be there.

coder68 · 2026-03-24T20:42:34 1774384954

Thanks! Super interested in LLMs for translation :D glad to see you folks doing this work.

coder68 · 2026-03-06T20:53:06 1772830386

Even working in "tech" but not FAANG this is so true, 10 days is still the norm at many white collar businesses for your first year of employment, sometimes 15 days if they're generous.

coder68 · 2026-03-06T20:52:11 1772830331

The tradeoff with many EU countries would be that they enjoy their leisure time a lot more and sooner than Americans. Americans make more and save more statistically, but they spend it on cars, houses, and medical care, and generally have way less free time. So I think it's a wash.