Hacker Newsnew | past | comments | ask | show | jobs | submit | fy20's commentslogin

Running it on a Macbook Pro M5 48GB:

        -hf unsloth/Qwen3.6-27B-GGUF:UD-Q6_K_XL \ 
        -c 128000 \
        --parallel 1 \
        --flash-attn on \
        --no-context-shift \
        --cache-type-k q8_0 \
        --cache-type-v q8_0 \
        --temp 0.6 \
        --top-p 0.95 \
        --top-k 20 \
        --min-p 0.0 \
        --presence_penalty 0.0 \
        --reasoning on \
        --jinja \
        --chat-template-kwargs "{\"preserve_thinking\": true}" \
        --spec-type ngram-simple \
        --draft-max 64 \
        --timeout 1800
Maybe someone knows any tips to optimise prompt processing as that's the slowest part? It takes a few minutes before OpenCode with ~20k initial context first responds, but subsequent responses are pretty fast due to caching.

https://github.com/jundot/omlx

note: 27b is going to be slow; use the 35b MoE if you want decent token/sec speed.


I haven't honestly dug around to figure out if there's a hardware reason for it, but prompt processing has always been a lot slower for me on macs in general. I mostly use MLX on my 24GB M4 Pro though, so I will pull llama.cpp on it as well to see what the prefill is like.

I've gotten around 16 t/s gen with 4bit and mxfp4 on that model for generation. The 3090 I mentioned has a little over 900 gb/s, while those macs i think are around 270 GB/s. If my understanding is correct, macs do utilize the bandwidth better in this case, but it still doesn't make up the difference (on the 3090 it's around 30-35 t/s depending on size of ctx).

Also, do run a quick experiment removing the cache quants if you want to tinker with it a bit more, iirc KV quant does add a small overhead during prefill.

I would be very interested to know your prefill and generation numbers.


Interesting that the memory in the laptop is upgradeable or bring-your-own, where as in the Framework Desktop it is soldered. How does that work?

The desktops use AMD Strix Halo chips which require soldered on memory for the required bandwidth.

I think the reason is two fold:

- If you pay for unlimited trips will you choose the Ferrari or the old VW? Both are waiting outside your door, ready to go.

- Providers that let you choose models don't really price much difference between lower class models. On my grandfathered Cursor plan I pay 1x request to use Composer 2 or 2x request to use Opus 4.6. Until the price is more differentiated so people can say "ok yes Opus is smarter, but paying 10x more when Haiku would do the same isn't worth it" it won't happen.


Agreed on both points. We’re dealing with a cost/benefit analysis, and to this point, coders have been subsidized, coerced…maybe even mandated into using the most expensive option as if it was a limitless resource. Clearly not true, and so of course we’re going to see nerfing of the tools over time.

Obviously we’re a long way away from being able to rationally evaluate whether the value of X tokens in model Y is better than model Z, let alone better in terms of developer cost, but that’s kind of where we need to get to, otherwise the model providers are selling magic beans rated in ineffable units of magicalness. The only rational behavior in such a world is to gorge yourself.


They changed that recently, you need to be paying €10/mo for that now. The free plan and/or access for the basic Twitter plan are gone.

That doesn't make it better! It did somehow slow down the regulatory response because politicians are dumb, though.

It means X can identify users at least, so they are probably quite a bit less likely to do that.

You’ve obviously never attempted to complete a purchase while working under a regulatory body, required to test the theory.

What difference does that make?

it's much funnier now, that by putting it behind a paywall, they're explicitly saying "it's okay for you to do this, you just have to purchase a license first"

Security through enshittification. Nice.

Our company (~25 engineers) uses it across the entire engineering and product orgs, and yes we are quite deep into agentic coding. We use their cloud agents for a lot of things, e.g. automated investigations of alarms, handling most customer support issues that end up hitting engineering, pre-processsing linear tickets before humans triage them, bugbot for PR reviewed with learned knowledge. Although recently they have felt like they are pulling the rug out on our legacy plan, so we may end up switching.

Everyone is thinking Apple is the target, but they are actually one of the better companies with this. You can buy first-party replacement parts, tools are available. If you take a look at Chinese or sometimes even Samsung phones it's basically impossible to get replacement parts and if you do it may need other parts like the glass back to be replaced as it's impossible to remove it without breaking it.

Isn't this just because they will be refreshed soon, rumours are around June. I'd imagine Apple stops making the old hardware a few months before a refresh and then just sells the old stock. Maybe they had shorter contracted orders and/or demand is higher than expected.

Last week I got my (customised) M5 MacBook Pro that was ordered during launch week, not really any longer than expected when ordering a new model.


There was some article today claiming the updated versions won't launch before October.

The work going into local models seems to be targeting lower RAM/VRAM which will definately help.

For example Gemma 4 32B, which you can run on an off-the-shelf laptop, is around the same or even higher intelligence level as the SOTA models from 2 years ago (e.g. gpt-4o). Probably by the time memory prices come down we will have something as smart as Opus 4.7 that can be run locally.

Bigger models of course have more embedded knowledge, but just knowing that they should make a tool call to do a web search can bypass a lot of that.


It feels like a bit of history is missing... If ollama was founded 3 years before llama.cpp was released, what engine did they use then? When did they transition?

I don't think that is the case. Llama.cpp appeared within weeks after meta released llama to select researchers (which then made it out to the public). 3 years before that nobody knew of the name llama. I'm sure that llama.cpp existed first

> within weeks

One week, really, if we consider the "public" availability.

Llama announced: February 24, 2023

Weights leaked: March 3, 2023

Llama.cpp: March 10, 2023

(Ollama 0.0.1: Jul 8, 2023)


They spent several years in stealth mode but the initial release was llama.cpp.

Ollama v0.0.1 "Fast inference server written in Go, powered by llama.cpp" https://github.com/ollama/ollama/tree/v0.0.1


They spent several years in stealth mode

doing what?

trying to build themselves what llama.cpp ended up doing for them?


I asked myself the same question. Some other commenter mentioned above they started with some Kubernetes infrastructure thing and they pivoted later.

How is your physical activity? I used to get really tired at work after lunch, and after I started regularly going to the gym it fixed that. My energy levels throughout the day are now a lot more stable. Didn't fix my insomnia though :-)

Yeah barely anything except living in a walkable city and walking every day. Possibly it would be ideal to go to the gym around 7pm or so after work and get some energy there, I'd probably need to try it but I feel too tired and lazy after work heh.

Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: