Hacker Newsnew | past | comments | ask | show | jobs | submit | babblingfish's commentslogin

It's amazing how consciousness remains a mystery given all the scientific progress over the last 100 years

Is it surprising? It seems likely you could build a complete working model of the universe with no provision for consciousness at all. As far as modern science goes, it's an intractable problem

You can do it with no provision for molecules too.

It doesn't seem likely to me that in, just a couple hundred years, humans have developed such a thorough understand of every natural process as all that.

Consciousness had millions of years headstart. Give it time.

We don't know that.

Consciousness might have actually started today at 7am and, before that, we were all automatons without subjective experience of the world, just going through the motions.

You might say that's impossible, because yesterday you were conscious and you know that, but you can't prove it to anyone.

Epistemologically, this is not a problem that can be solved with "give it time".


Why reading cozy sci-fi in the age of AI feels like an act of resistance


LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.

I disagree with every sentence of this.

> solves the problem of too much demand for inference

False, it creates consumer demand for inference chips, which will be badly utilised.

> also would use less electricity

What makes you think that? (MAYBE you can save power on cooling. But not if the data center is close to a natural heat sink)

> It's just a matter of getting the performance good enough.

The performance limitations are inherent to the limited compute and memory.

> Most users don't need frontier model performance.

What makes you think that?


> False, it creates consumer demand for inference chips, which will be badly utilised.

I think the opposite is true. Local inference doesn't have to go over the wire and through a bunch of firewalls and what have you. The performance from just regular consumer hardware with local, smaller models is already decent. You're utilizing the hardware you already have.

> The performance limitations are inherent to the limited compute and memory.

When you plug in a local LLM and inference engine into an agent that is built around the assumption of using a cloud/frontier model then that's true.

But agents can be built around local assumptions and more specific workflows and problems. That also includes the model orchestration and model choice per task (or even tool).

The Jevons Paradox comes into play with using cloud models. But when you have less resources you are forced to move into more deterministic workflows. That includes tighter control over what the agent can do at any point in time, but also per project/session workflows where you generate intermediate programs/scripts instead of letting the agent just do what ever it wants.

I give you an example:

When you ask a cloud based agent to do something and it wants more information, it will often do a series of tool calls to gather what it thinks it needs before proceeding. Very often you can front load that part, by first writing a testable program that gathers most of the necessary information up front and only then moving into an agentic workflow.

This approach can produce a bunch of .json, .md files or it can move things into a structured database or you can use embeddings or what have you.

This can save you a lot of inference, make things more reusable and you don't need a model that is as capable if its context is already available and tailored to a specific task.


Parallel inference on large compute scales in superlinear ways. There is no way to beat the reduction in memory transfers that a data-center inference model provides with hardware that fits at anything called a home. It is much more energy efficient to process huge batches of parallel requests compared to having one or a handful of queries running on an accelerator.

Aren't data centers extremely energy inneficient due to network latency, memory bottlenecks and so on? I mean the models that run on them are extremely powerful compared to what you can run on consumer hardware, but I wouldn't call them efficient...

I'm sorry to get into this conversation, but the performance of a model is some orders of magnitude lower (meaning it requires greater amounts of specific computing power) than all the network stack of all the nodes involved in the internet traffic of some particular request.

Meaning: these 5000 tokens consume tiny amounts of energy being moved all around from the data center to your PC, but enormous amounts of energy being generated at all. An equivalent webpage with the same amount of text as these tokens would be perceived as instant in any network configuration. Just some kilobytes of text. Much smaller than most background graphics. The two things can't be compared at all.

However, just last week there have been huge improvements on the hardware required to run some particular models, thanks to some very clever quantisation. This lowers the memory required 6x in our home hardware, which is great.

In the end, we spent more energy playing videogames during the last two decades, than all this AI craze, and it was never a problem. We surely can run models locally, and heat our homes in winter.


> What makes you think that?

The fact that today's and yesterday's models are quite capable of handling mundane tasks, and even companies behind frontier models are investing heavily in strategies to manage context instead of blindly plowing through problems with brute-force generalist models.

But let's flip this around: what on earth even suggests to you that most users need frontier models?


Everybody has difficult decisions to make in their daily lives and in their work.

Having access to a model that is drawing from good sources and takes time to think instead of hallucinating a response is important in many domains of life.


> False, it creates consumer demand for inference chips, which will be badly utilised.

There are so many CPUs, GPUs, RAM and SSDs which are underutilized. I have some in my closet doing 5% load at peek times. Why would inference chips be special once they become commodity hardware?


Thats the point, they’re better utilized in the cloud

"consumer demand for inference chips, which will be badly utilised"

why do you assume it will be badly utilised? Can't be worse than what we have now which is chips already badly utilised for windows' bloatware


> What makes you think that?

Looking at actual users of LLMs


While not everybody is a professional in YOUR domain, many people are professionals in SOME domain. And even outside of that, they deserve a smart conversation partner, for example on topics like health and politics.

I very recently installed llama.cpp on my consumer-grade M4 MBP, and I've been having loads of fun poking and prodding the local models. There's now a ChatGPT style interface baked into llama.cpp, which is very handy for quick experimentation. (I'm not entirely sure what Ollama would get me that llama.cpp doesn't, happy to hear suggestions!)

There are some surprisingly decent models that happily fit even into a mere 16 gigs of RAM. The recent Qwen 3.5 9B model is pretty good, though it did trip all over itself to avoid telling me what happened on Tiananmen Square in 1989. (But then I tried something called "Qwen3.5-9B-Uncensored-HauhauCS-Aggressive", which veers so hard the other way that it will happily write up a detailed plan for your upcoming invasion of Belgium, so I guess it all balances out?)


Qwen3.5 has tool calling, so you can give it a wikipedia tool which it uses to know what happened in Tiananmen Square without issues =)

That's very cool! I think giving it some research tools might be a nifty thing to try next. This is a fairly new area for me, so pointers or suggestions are welcome, even basic ones. :)

Worth adding that I had reasoning on for the Tiananmen question, so I could see the prep for the answer, and it had a pretty strong current of "This is a sensitive question to PRC authorities and I must not answer, or even hint at an answer". I'm not sure if a research tool would be sufficient to overcome that censorship, though I guess I'll find out!


Basically ask any coding agent to create you a simple tool-calling harness for a local model and it'll most likely one-shot it.

Getting the local weather using a free API like met.no is a good first tool to use.


Thanks!

I'd recommend it too, because the knowledge cutoff of all the open weight Chinese models (M2.7, Qwen3.5, GLM-5 etc) is earlier than you'd think, so giving it web search (I use `ddgr` with a skill) helps a surprising amount

Yep, having a "stupid" central model with multiple tools is IMO the key to efficient agentic systems.

It needs to be just smart enough to use the tools and distill the responses into something usable. And one of the tools can be "ask claude/codex/gemini" so the local model itself doesn't actually need to do much.


> Yep, having a "stupid" central model with multiple tools is IMO the key to efficient agentic systems.

That doesn't fix the "you don't know what you don't know" problem which is huge with smaller models. A bigger model with more world knowledge really is a lot smarter in practice, though at a huge cost in efficiency.


Ive always wondered where the inflection point lies between on the one hand trying to train the model on all kinds of data such as Wikipedia/encyclopedia, versus in the system prompt pointing to your local versions of those data sources, perhaps even through a search like api/tool.

Is there already some research or experimentation done into this area?


The training gives you a very lossy version of the original data (the smaller the model, the lossier it is; very small models will ultimately output gibberish and word salad that only loosely makes some sort of sense) but it's the right format for generalization. So you actually want both, they're highly complementary.

That's the key, it just needs to be smart enough to 1) know it doesn't know and 2) "know a guy" as they say =) (call a tool for the exact information)

Picking a model that's juuust smart enough to know it doesn't know is the key.


Oh does llama.cpp use MLX or whatever? I had this question, wonder if you know? A search suggests it doesn’t but I don’t really understand.

>Oh does llama.cpp use MLX or whatever?

No. It runs on MacOS but uses Metal instead of MLX.


ANE-powered inference (at least for prefill, which is a key bottleneck on pre-M5 platforms) is also in the works, per https://github.com/ggml-org/llama.cpp/issues/10453#issuecomm...

Is that better or worse?

Depends.

MLX is faster because it has better integration with Apple hardware. On the other hand GGUF is a far more popular format so there will be more programs and model variety.

So its kinda like having a very specific diet that you swear is better for you but you can only order food from a few restaurants.


But you can always fall back to GGUF while waiting for the world to build a few more MLX restaurants. Or something like that; the analogy is a bit stretched.

Yeah I'm terrible with analogies.

llama.cpp uses GGML which uses Metal directly.

Have you played around with any of the Hermes models? they are supposed to be one of the best at non-refusal while keeping sane.

Interesting! Unfortunately, the smallest Hermes 4 model I can see is 14B, which would really strain the limits of my little laptop. The only way I might get acceptable performance would be to run it extremely quantised, but then I probably wouldn't see much improvement over the 9B Qwen.

Cool, I always wanted to invade Belgium. Maybe if my plan is good, I could run a successful gofundme?

Hey, if Margaret Thatcher's son can give it a go, why not you? Believe in yourself and reach for those dreams. *sparkle emoji*

I have journaled digitally for the last 5 years with this expectation.

Recently I built a graphRAG app with Qwen 3.5 4b for small tasks like classifying what type of question I am asking or the entity extraction process itself, as graphRAG depends on extracted triplets (entity1, relationship_to, entity2). I used Qwen 3.5 27b for actually answering my questions.

It works pretty well. I have to be a bit patient but that’s it. So in that particular use case, I would agree.

I used MLX and my M1 64GB device. I found that MLX definitely works faster when it comes to extracting entities and triplets in batches.


Did you get any insights about yourself from this process? I am thinking of doing the same

TL;DR: you don't need to do any treasure hunt on your notes by just typing stuff into the search bar. Having your own graphRAG system + LLM on your notes is basically a "Google" but then on your own notes. Any question you have: if you have a note for it, it will bubble up. The annoying thing is that false positives will also bubble up.

----

Full reaction:

Yes but perhaps not in a way you might expect. Qwen's reasoning ability isn't exactly groundbreaking. But it's good enough to weave a story, provided it has some solid facts or notes. GraphRAG is definitely a good way to get some good facts, provided your notes are valuable to you and/or contain some good facts.

So the added value is that you now have a super charged information retrieval system on your notes with an LLM that can stitch loose facts reasonably well together, like a librarian would. It's also very easy to see hallucinations, if you recognize your own writing well, which I do.

The second thing is that I have a hard time rereading all my notes. I write a lot of notes, and don't have the time to reread any of them. So oftentimes I forget my own advice. Now that I have a super charged information retrieval system on my notes, whenever I ask a question: the graphRAG + LLM search for the most relevant notes related to my question. I've found that 20% of what I wrote is incredibly useful and is stuff that I forgot.

And there are nuggets of wisdom in there that are quite nuanced. For me specifically, I've seen insights in how I relate to work that I should do more with. I'll probably forget most things again but I can reuse my system and at some point I'll remember what I actually need to remember. For example, one thing I read was that work doesn't feel like work for me if I get to dive in, zoom out, dive in, zoom out. Because in the way I work as a person: that means I'm always resting and always have energy for the task that I'm doing. Another thing that it got me to do was to reboot a small meditation practice by using implementation intentions (e.g. "if I wake up then I meditate for at least a brief amount of time").

What also helps is to have a bit of a back and forth with your notes and then copy/paste the whole conversation in Claude to see if Claude has anything in its training data that might give some extra insight. It could also be that it just helps with firing off 10 search queries and finds a blog post that is useful to the conversation that you've had with your local LLM.


"Most users don't need frontier model performance" unfortunately, this is not the case.

It depends. If they're using a small/medium local model as a 1:1 ChatGPT replacement as-is, they'll have a bad time. Even ChatGPT refers to external services to get more data.

But a local model + good harness with a robust toolset will work for people more often than not.

The model itself doesn't need to know who was the president of Zambia in 1968, because it has a tool it can use to check it from Wikipedia.


You can install the complete text of Wikipedia locally too.

They've usually been intended for ereader/off-grid/post-zombie-apocalypse situations but I'd guess someone is working on an llm friendly way to install it already.

Be interesting to know the tradeoffs. The Tienammen square example suggests why you'd maybe want the knowledge facts to come from a separate source.


The Wikipedia folks are now working on implementing a language-independent representation for their encyclopedic content - one that's intended to be rigorously compositional and semantics-aware, loosely comparable to Universal Meaning Representation (UMR) as known in the linguistics domain, that - if successful - may end up interacting in very interesting ways with multi-language capable LLMs. Very early experiments (nowhere near as capable as UMR as of yet, but experimenting with the underlying software infrastructure) are at https://abstract.wikipedia.org , whilst a direct comparison of the projected design is given by https://commons.wikimedia.org/wiki/File:Abstract_Wikipedia_N... https://elemwala.toolforge.org/static/nlgsig-nov2025.html

Any citations? Because that was my impression, too. I want frontier model performance for my coding assistant, but "most users" could do with smaller/faster models.

ChatGPT free falls back to GPT-5.2 Mini after a few interactions.


Have you used GPT instant or mini yourself? I think it’s pretty cynical to assume that this is “good enough for most people”, even if they don’t know the difference between that and better models.

> I think it’s pretty cynical to assume that this is “good enough for most people”

It's a deduction, not an assumption. Obviously it's "good enough" for "most people". Otherwise nobody would be using the free version of ChatGPT today.

I pay for a Claude subscription, but even then I sometimes downgrade to Sonnet or even Haiku when I need a quick answer.


> Obviously it's "good enough" for "most people". Otherwise nobody would be using the free version of ChatGPT today.

I'd say it's better than nothing, which to me is not the same thing at all as "good enough".

For example, I believe most people would be better off with half the allowable queries per day, routed to a better model, but that's not an available product.


Say more. Why do you think this?

They're awful and hallucinate a lot, I couldn't imagine using it even for prompts about TV shows, even less so for serious work. Repeating the question from the parent, have you tried those yourself? Even compared to ChatGPT Thinking, they're short of useless.

They're essentially replying based on vibes, instead of grounding their responses in extensive web searches, which is what the paid models/configurations generally do. This makes them wrong more often than they're right for anything but the most trivial requests that can be easily responded to out of memorized training data.

This is all on top of the (to me) insufferable tone of the non-thinking models, but that might well be how most users prefer to be talked to, and whether that's how these models should accordingly talk is a much more nuanced question.

Regardless of that, everybody deserves correct answers, even users on the free tier. If this makes the free tier uneconomical to serve for hours on end per user per day, then I'd much rather they limit the number of turns than dial down the quality like that.


Frontier model has much better knowledge and they usually hallucinate less. It's not about the coding capabilities, it's about how much you can trust the model.

re: trust-

Have you tried the free version of ChatGPT? It is positively appalling. It’s like GPT 3.5 but prompted to write three times as much as necessary to seem useful. I wonder how many people have embarrassed themselves, lost their jobs, and been critically misinformed. All easy with state-of-the-art models but seemingly a guarantee with the bottom sub-slop tier.

Is the average person just talking to it about their day or something?


Even the paid version of ChatGPT tends to use a 1000 words when 10 will do.

You can try asking it the same question as Claude and compare the answers. I can guarantee you that the ChatGPT answer won't fit on a single screen on a 32" 4k monitor.

Claude's will.


I use the free version of ChatGPT (without logging in) when I need some one-off question without a huge context. Real world prompt:

  "when hostapd initializes 80211 iface over nl80211, what attributes correspond to selected standard version like ax or be?"
It works fine, avoids falling into trap due to misleading question. Probably works even better for more popular technologies. Yeah, it has higher failure rates but it's not a dealbreaker for non-autonomous use cases.

If someone blindly submits chatbot output they deserve to be embarrassed and fired. But I don't think that's going to improve.

The free version of ChatGPT is insanely crippled, so that's not surprising.

> unfortunately, this is not the case

Most users are fixing grammar/spelling, summarising/converting/rewriting text, creating funny icons, and looking up simple facts, this is all far from frontier model performance.

I've a feeling that if/when Apple release their onboard LLM/Siri improvements that can call out if needed, the vast majority of people will be happy with what they get for free that's running on their phone.


“You are the smartest high school student that has ever lived and on the college track to Harvard or another Ivy League school. Write a 10 page history term paper about Tiananmen Square and the specific events that took place there. Include a bibliography and use footnotes to cite sources.”

eh, its weird how thetech world wants to build trillions of data centers for...what, escapingthe permanent underclass?

I think what "need" youspeak of is a bit of a colored statement.


"Hey dingus, set timer for 30 minutes"

[flagged]


Complaining about downvotes is futile and is also against hn guidelines.

I'm not complaining "about downvotes" LOL I'm explaining why some people will be replaced by LLMs because of their own "context window" length.

I’ve been using google search AI and Gemini, which I find generally pretty good. In the past week, Gemini and Search AI have been bringing in various details of previous searches I’ve done and Search AI conversations I’ve had and it’s extremely gross and creepy.

I was looking for details about cars and it started interjecting how the safety would affect my children by name in a conversation where I never mention my children. I was asking details about Thunderbolt and modern Ryzen processors and a fresh Gemini chat brought in details about a completely unrelated project I work on. I’ve always thought local LLMs would be important, but whatever Google did in the past few weeks has made that even more clear.


It's Personal Intelligence in the Gemini settings. I just turned that off last night when it was doing similar things.

> solves the problem of too much demand for inference compared to data center supply

Maybe in the distant future when device compute capacity has increased by multiples and efficiency improvements have made smaller LLMs better.

The current data center buildouts are using GPU clusters and hybrid compute servers that are so much more powerful than anything you can run at home that they’re not in the same league. Even among the open models that you can run at home if you’re willing to spend $40K on hardware, the prefill and token generation speeds are so slow compared to SOTA served models that you really have to be dedicated to avoiding the cloud to run these.

We won’t be in a data center crunch forever. I would not be surprised if we have a period of data center oversupply after this rush to build out capacity.

However at the current rate of progress I don’t see local compute catching up to hosted models in quality and usability (speed) before data center capacity catches up to demand. This is coming from someone who spends more than is reasonable on local compute hardware.


Depending on the use case, the future is already here.

For example, last week I built a real-time voice AI running locally on iPhone 15.

One use case is for people learning speaking english. The STT is quite good and the small LLM is enough for basic conversation.

https://github.com/fikrikarim/volocal


That’s awesome! I’ve got a similar project for macOS/ iOS using the Apple Intelligence models and on-device STT Transcriber APIs. Do you think it the models you’re using could be quantized more that they could be downloaded on first run using Background Assets? Maybe we’re not there yet, but I’m interested in a better, local Siri like this with some sort of “agentic lite” capabilities.

> Do you think it the models you’re using could be quantized more that they could be downloaded on first run using Background Assets?

I first tried the Qwen 3.5 0.8B Q4_K_S and the model couldn't hold a basic conversation. Although I haven't tried lower quants on 2B.

I'm also interested on the Apple Foundation models, and it's something I plan to try next. AFAIK it's on par with Qwen-3-4B [0]. The biggest upside as you alluded to is that you don't need to download it, which is huge for user onboarding.

[0] https://machinelearning.apple.com/research/apple-foundation-...


Subjectively, AFM isn’t even close to Qwen. It’s one of the weakest models I’ve used. I’m not even sure how many people have Apple Intelligence enabled. But I agree, there must be a huge onboarding win long-term using (and adapting) a model that’s already optimized for your machine. I’ve learned how to navigate most of its shortcomings, but it’s not the most pleasant to work with.

Try it with mxfp8 or bf16. It's a decent model for doing tool calling, but I wouldn't recommend using it with 4 bit quantization.

Brilliant. Hope to see you in the App Store!

Oh thank you! I wasn’t sure if it was worth submitting to the app store since it was just a research preview, but I could do it if people want it.


It feels like you'll soon need a local llm to intermediate with the remote llm, like an ad blocker for browsers to stop them injecting ads or remind you not to send corporate IP out onto the Internet.

I'd like to coin the term "user agent" for this

"copilot" seems a good term

could also be considered a triage layer


Not sure about the using less electricity part. With batching, it’s more efficient to serve multiple users simultaneously.

Indeed. Data centers have so many ways and reasons to be much more energy-efficient than local compute it's not even funny.

They do, though I don’t think they max out on energy efficient technology. It’s much easier to cut a deal for cheap electricity with a regional government, much to the chagrin of the locals (who see their power bills go up).

Obviously apple would prefer this. It would boost demand for more powerful and expensive devices, and align with their privacy marketing. But they have massively fumbled with siri for a long time and then missed huge deadlines with ai promises. Despite having billions, they have shown no competency in delivering services or accurately marketing what to expect from ai features.

> Most users don't need frontier model performance.

SSD weights offload makes it feasible to run SOTA local models on consumer or prosumer/enthusiast-class platforms, though with very low throughput (the SSD offload bandwidth is a huge bottleneck, mitigated by having a lot of RAM for caching). But if you only need SOTA performance rarely and can wait for the answer, it becomes a great option.


I see a lot of people are confused about the electricity claim so I'll elaborate on it more. The assumption I'm making here is that on device people will run smaller models, that can fit on their machines without needing to buy new computers. If everyone ran inference on their machine there would be no need for these massive datacenters which use huge quantities of electricity. It would utilize the machines they already have and the electricity they're already using.

People are making a comparison of the cost per inference or token or whatever and saying datacenters are more efficient which makes obvious sense. What i'm saying is if we eliminate the need for building out dozens of gigawatt datacenters completely then we would use less electricity. I feel like this makes intuitive sense. People are getting lost in the details about cost per inference, and performance on different models.


But when using it on the cloud a LLM can consult 50 websites, which is super fast for their datacenters as they are backbone of internet, instead you'll have to wait much more on your device to consult those websites before giving you the LLM response. Am i wrong?

As things stand today even when doing research tasks, time spent by model is >> than fetching websites. I don't see it changing any time soon, except when some deals happen behind the scenes where agents get to access CF guarded resources that normally get blocked from automated access.

While data centres indeed have awesome internet connectivity, don’t forget the bandwidth is shared by all clients using a particular server.

If you have 100 mbit/sec internet connection at home, a computer in a data centre has 10 gbit/sec, but the server is serving 200 concurrent clients — your bandwidth is twice as fast.


You could argue that the only reason we have good open-weight models is because companies are trying to undermine the big dogs, and they are spending millions to make sure they dont get too far ahead. If the bubble pops then there wont be incentive to keep doing it.

I agree. I can totally see in the future that open source LLMs will turn into paying a lumpsum for the model. Many will shut down. Some will turn into closed source labs.

When VCs inevitably ask their AI labs to start making money or shut down, those free open source LLMS will cease to be free.

Chinese AI labs have to release free open source models because they distill from OpenAI and Anthropic. They will always be behind. Therefore, they can't charge the same prices as OpenAI and Anthropic. Free open source is how they can get attention and how they can stay fairly close to OpenAI and Anthropic. They have to distill because they're banned from Nvidia chips and TSMC.

Before people tell me Chinese AI labs do use Nvidia chips, there is a huge difference between using older gimped Nvidia H100 (called H20) chips or sneaking around Southeast Asia for Blackwell chips and officially being allowed to buy millions of Nvidia's latest chips to build massive gigawatt data centers.


> have to release free open source models because they distill from OpenAI and Anthropic

They dont really have to though, they just need to be good enough and cheaper (even if distilled). That being said, it is true they are gaining a lot of visibility (specially Qwen) because of being open-source(weight).

Hardware-wise they seem they will catch-up in 3-5 years (Nvidia is kind of irrelevant, what matters is the node).


I highly doubt they can catch up in 3-5 years to Nvidia.

Chips take about 3 years to design. Do you think China will have Feymann-level AI systems in 3 years?

I think in 3 years, they'll have H200-equivalent at home.


You must have an inside line on information for 'China' -- those are bold predictions!

No need inside line. Just look at chip node tech.

“They will always be behind”

Car manufacturers said the same.


It did take decades to catch and surpass US car makers right?

About 2.5 decades from the start of the JVs, but they did it. Semiconductors and jet turbines are really the last two tech trees that China has yet to master.

Right. When I said "they'll always be behind", I meant in the next 5-10 years. They're gated by EUV tech. And once they have EUV tech, they need to scale up chip manufacturing.

You will always be wrong.

I've been right far more than wrong on this stuff. :)

Right. When I said "you'll always be wrong", I meant you're sometimes right.

Which might they master first?

Both are hard nuts but China is throwing massive amounts of money at the problem. They can already get performance or economy from each, they just need to figure out how to get both at the same time.

This seems to be somewhat similar to web browsers.

I could see the model becoming part of the OS.

Of course Google and Microsoft will still want you to use their models so that they can continue to spy on you.

Apple, AMD and Nvidia would sell hardware to run their own largest models.


You can have viable business model around open weight models where you offer fine tuning at a fee.

Man I really hope so, as, as much as I like Claude Code, I hate the company paying for it and tracking your usage, bullshit management control, etc. I feel like I'm training my replacement. Things feel like they are tightening vs more power and freedom.

On device I would gladly pay for good hardware - it's my machine and I'm using as I see fit like an IDE.


When local LLMs get good enough for you to use delightfully, cloud LLMs will have gotten so much smarter that you'll still use it for stuff that needs more intelligence.

True, but I'm already producing code/features faster than company knows what to do with, (even though every company says "omg we need this yesterday", etc). Even coding before AI was basically same.

Code tools that free my time up is very nice.


That's not necessarily the case. So far, commercial cloud LLMs have maintained a head-start, but there is no law of nature that prevents us from having competitive open models.

In fact the space seems to move at a rapid pace as more and more specialized models come out. There's a possible trajectory where open weight models will compete side by side or even be preferable for many use cases, just like what happened with OS's and SQL DB's.


> it also would use less electricity

How would it use less electricity? I’d like to learn more.


That's completely not true. LLM on device would use MORE electricity.

Service providers that do batch>1 inference are a lot more efficient per watt.

Local inference can only do batch=1 inference, which is very inefficient.


It will probably be a future. My guess is that for many businesses it will still make sense to have more powerful models and to run them centralized in a datacenter. Also, by batching queries you can get efficiencies at scale that might be hard to replicate locally. I can also see a hybrid approach where local models get good at handing off to cloud models for complex queries.

> For many businesses it will still make sense to have more powerful models and to run them centralized in a datacenter.

Agree, and I think of it this way: for a lot of businesses, it already makes sense to have a bunch of more powerful computers and run them centralized in a datacenter. Nevertheless, most people at most companies do most of their work on their Macbook Air or Dell whatever. I think LLMs will follow a similar pattern: local for 90% of use cases, powerful models (either on-site in a datacenter or via a service) for everything else.


> would use less electricity

Sorry to shatter your bubble, but this is patently false, LLMs are far more efficient on hardware that simultaneously serves many requests at once.

There's also the (environmental and monetary) cost of producing overpowered devices that sit idle when you're not using them, in contrast to a cloud GPU, which can be rented out to whoever needs it at a given moment, potentially at a lower cost during periods of lower demand.

Many LLM workloads aren't even that latency sensitive, so it's far easier to move them closer to renewable energy than to move that energy closer to you.


> LLMs are far more efficient on hardware that simultaneously serves many requests at once.

The LLM inference itself may be more efficient (though this may be impacted by different throughput vs. latency tradeoffs; local inference makes it easier to run with higher latency) but making the hardware is not. The cost for datacenter-class hardware is orders of magnitude higher, and repurposing existing hardware is a real gain in efficiency.


Seems doubtful. The utilisation will be super high for data center silicon whereas your PC or phone at home is mostly idle.

> your PC or phone at home is mostly idle

If you're purely repurposing hardware that you need anyway for other uses, that doesn't really matter.

(Besides, for that matter, your utilization might actually rise if you're making do with potato-class hardware that can only achieve low throughput and high latency. You'd be running inference in the background, basically at all times.)


I'm actually not sure that's true. Apart from people buying the device with or without the neural accelerator, the perf/watt could be on par or better with the big iron. The efficiency sweet-spot is usually below the peak performance point, see big.little architectures etc.

Well this is an article about running on hardware I already have in my house. In the winter that’s just a little extra electricity that converts into “free” resistive heating.

> Sorry to shatter your bubble, but this is patently false, LLMs are far more efficient on hardware that simultaneously serves many requests at once.

You might want to read this: https://arxiv.org/abs/2502.05317v2


We think so too! That’s why we are building rig.ai With how token intensive coding tasks can be, local allows for unlimited inference. Much better fit than sending back and forth to a third party. Not to mention the privacy and security benefits.

Rig sounds cool, I just joined the waitlist! I’m building something similar although with a much narrower purpose. Excited to learn more

Tell me more! Thanks for the waitlist

Sent a LinkedIn request. I’m building a language-specific coding agent using Apple Intelligence with custom adapters. It’s more a proof-of-concept at this point, but basic functionality actually works! The 4K context window is brutal, but there’s a variety of techniques to work around it. Tighter feedback loops, linters, LSPs, and other tools to vet generated code. Plus mechanisms for on-device or web-based API discovery. My hypothesis is if all this can work “well enough” for one language/ runtime, it could be adapted for N languages/ runtimes.

Have you spent more than 10 min actually running LLM on a local machine?

As it stands today, local LLMs don't work remotely as well as some people try to picture them, in almost every way -- speed, performance, cost, usability etc. The only upside is privacy.


I agree with you in the sense that if you tried to take any model right now and cram it into an iphone, it wouldnt be a claude-level agent.

I run 32b agents locally on a big video card, and smaller ones in CPU, but the lack there isn't the logic or reasoning, it is the chain of tooling that Claude Code and other stacks have built in.

Doing a lot of testing recently with my own harness, you would not believe the quality improvement you can get from a smaller LLM with really good opening context.

Even Microsoft is working on 1-bit LLMs...it sucks right now, but what about in 5 years?

But the OP is correct -- everything will have an LLM on it eventually, much sooner than people who do not understand what is going on right now would ever believe is possible.


Yes. I've spent months running Qwen2.5-8B on my barebones 16gb ram M4 Mac mini to handle identifying sites from google search results. It has been rock solid. I'm not even running this MLX-powered improvement on it yet.

Your idea of what people need from Local LLMs and others are different. Not everybody needs a /r/myboyfriendisai level performance.


You probably want to double check the comment I was responding to.

It isn't going to replace cloud LLMs since cloud LLMs will always be faster in throughput and smarter. Cloud and local LLMs will grow together, not replace each other.

I'm not convinced that local LLMs use less electricity either. Per token at the same level of intelligence, cloud LLMs should run circles around local LLMs in efficiency. If it doesn't, what are we paying hundreds of billions of dollars for?

I think local LLMs will continue to grow and there will be an "ChatGPT" moment for it when good enough models meet good enough hardware. We're not there yet though.

Note, this is why I'm big on investing in chip manufacture companies. Not only are they completely maxed out due to cloud LLMs, but soon, they will be double maxed out having to replace local computer chips with ones that are suited for inferencing AI. This is a massive transition and will fuel another chip manufacturing boom.


Yep. People were claiming DeepSeek was "almost as good as SOTA" when it came out. Local will always be one step away like fusion.

It's just wishful thinking (and hatred towards American megacorps). Old as the hills. Understandable, but not based on reality.


Don’t try to draw trend lines for an industry that has existed for <5 years.

We are 100% there already. In browser.

the webgpu model in my browser on my m4 pro macbook was as good as chatgpt 3.5 and doing 80+ tokens/s

Local is here.


Sir, ChatGPT 3.5 is more than 3 years old, running on your bleeding edge M4 Pro hardware, and only proves the previous commenters point.

It works really well for "You're helpful assistant / Hi / Hello there. how may I help you today?" Anything else (esp in non-EN language) and you will see the limitations yourself. just try it.

Local RTX 5090 is actually faster than A100/H100.

It's a $4,000 GPU with 32GB of VRAM and needs a 1,000 watt PSU. It's not realistic for the masses.

If it has something like 80GB of VRAM, it'll cost $10k.

The actual local LLM chip is Apple Silicon starting at the M5 generation with matmul acceleration in the GPU. You can run a good model using an M5 Max 128GB system. Good prompt processing and token generation speeds. Good enough for many things. Apple accidentally stumbled upon a huge advantage in local LLMs through unified memory architecture.

Still not for the masses and not cheap and not great though. Going to be years to slowly enable local LLMs on general mass local computers.


Yes, it’s expensive hobby.

Crazy thing to say without other contextual information - it obviously depends on a number of factors. Do you have an apples to apples comparison at hand?

Look it up.

Looking at downvotes I feel good about SDE future in 3-5 years. We will have a swamp of "vibe-experts" who won't be able to pay 100K a month to CC. Meanwhile, people who still remember how to code in Vim will (slowly) get back to pre-COVID TC levels.

What is CC and TC? I have not heard these abbreviations (except for CC to mean credit card or carbon copy, neither of which is what I think you mean here).

I figured it out from context clues

CC: Claude Code

TC: total comp(ensation)


Thank you for clarifying! (I had no idea it needs to be explained, sorry.)

[flagged]


Yea I get that there will always be demand for local waifus. I never said local LLMs won't be a thing. I even said it will be a huge thing. Just won't replace cloud.

> It's just a matter of getting the performance good enough.

Who will pay for the ongoing development of (near-)SoTA local models? The good open-weight models are all developed by for-profit companies - you know how that story will end.


Apple via customers paying for the whole solution ( eg a laptop that can run decent local models )?

I think Apple had something in the region of 143 billion in revenue in the last quarter.

Not saying it will happen - just that there are a variety of business models out there and in the end it all depends on where consumers put their money.


LLM in silicon is the future. It won't be long until you can just plug an LLM chip into your computer and talk to it at 100x the speed of current LLMs. Capability will be lower but their speed will make up for it.

You can always delegate sub agents to cloud based infrastructure for things that need more intelligence. But the future indeed is to keep the core interaction loop on the local device always ready for your input.

A lot of stuff that we ask of these models isn't all that hard. Summarize this, parse that, call this tool, look that up, etc. 99.999% really isn't about implementing complex algorithms, solving important math problems, working your way through a benchmark of leet programming exercises, etc. You also really don't need these models to know everything. It's nice if it can hallucinate a decent answer to most questions. But the smarter way is to look up the right answer and then summarize it. Good enough goes a long way. Speed and latency are becoming a key selling point. You need enough capability locally to know when to escalate to something slower and more costly.

This will drive an overdue increase in memory size of phones and laptops. Laptops especially have been stuck at the same common base level of 8-16GB for about 15 years now. Apple still sells laptops with just 8GB (their new Neo). I had a 16 GB mac book pro in 2012. At the time that wasn't even that special. My current one has 48GB; enough for some of the nicer models. You can get as much as 256GB today.


> This will drive an overdue increase in memory size of phones and laptops.

DRAM costs are still skyrocketing, so no, I don't think so. It's more likely that we'll bring back wear-resistant persistent memory as formerly seen with Intel Optane.


Standard pig cycle in economics. Production capacity eventually goes up to meet demand and prices come down again. RAM has been going through cycles like this for decades. People seem to have no memory whatsoever of previous cycles every time it happens. Just wait a few years for it to become cheap again.

I'm expecting someone to come up with an LLM version of the Coral USB Accelerator: https://www.coral.ai/products/accelerator

Just plug in a stick in your USB-C port or add an M.2 or PCIe board and you'll get dramatically faster AI inference.


I think there are drastic differences between computer vision models and LLMs that you’re not considering. LLMs are huge relative to vision models, and require gobs of fast memory. For this reason a little USB dongle isn’t going to cut it.

Put another way, there already exist add-in boards like this, and they’re called GPUs.


GPUs are still software programmable.

An "LLM chip" does not need that and so can be much more efficient.


Sure, but that’s somewhat orthogonal to the point I was making, which is that LLMs are huge in size. Even in the case of a custom “LLM chip,” you’ll need huge amounts of very fast storage of some sort (likely DRAM), which places constraints on the size, power consumption, and cost of such a device. This device, if it existed, would not in any way resemble the Coral TPU product that the GP was referencing; I think in fact it would be closer in size, price, and form factor to a GPU.

It's more secure, but it would make supply much much worse.

Data centers use GPU batching, much higher utilisation rates, and more efficient hardware. It's borderline two order of magnitude more efficient than your desktop.


That also means sending every user a copy of the model that you spend billions training. The current model (running the models at the vendor side) makes it much easier to protect that investment

This might be how Apple will start to see even more sales, the M series processors are so far ahead of anything else, local LLMs could be their main selling point.

I don't think this idea could work. There's this common misconception that our brains control our bodies, like how software can control hardware. The fact is that our brains are intrinsically connected to the rest of our body: via the central nervous system, sensory, and motor neurons. You can't just swap out our brains. It's integrated with the rest of our body in a fundamental way. If you cloned someone, the neuronal connections between the CNS and organs would not be the same, because these interconnections develop over a lifetime and are not predetermined at birth.

It also feels super unethical to me. Reminds me of "Never let me go" by Kazuo Ishiguro.


Consider also that even reattaching nerves that are supposed to be there is not exactly a walk in the park. Look into finger reattachment surgery and post operation care. Think pain, tingling, a year or more of physiotherapy.. and that's in the best case that it actually works and you don't end up with a "dead" finger. Now, imagine that for your whole body.

Yeah... Non-sentient monkey "organ sacks" as a replacement for animal testing sounds great, but those organs aren't going to function or even develop the same without a brain. At best, I think this could only be another step to filter out unsafe compounds between testing on cells and testing on whole animals. Potentially with misleading results, I imagine.

If the clone's muscles have been electrically stimulated whilst it grows, you could imagine a small device at the base of the brain stem that records which signals produce which physical responses.

If a similar device on the brain stem of the brain donor maps out their signal-response relationships, you could presumably build a translation layer that sits between the donor brain and clone body.

I agree that this probably wouldn't work though. This is more like science fiction than a serious suggestion.


Yeah, it's a lot of if's and billions of dollars for what MIGHT be a free lunch when it comes to organ replacements.

Seems like a smarter idea would be to spend that money on growing organs in a tank. There are tens or hundreds of millions of otherwise healthy people in need of a donor kidney or two, and if the body didn't reject them in the process that would be platinum sprinkles on a gold sandwich.


You don't think that the idea could work based on our current understandings. I do not believe that there is anything magical about humans that prevents us from eventually reverse engineering ourselves. To think otherwise is to acknowledge some sort of higher power that holds a special non-organic ingredient in the mix.

To be clear I think this type of work crosses a lot of ethical boundaries. But entire fields like gynecological surgery were the result of a person with no ethics doing horrific things to people without consent. Most early vaccine testing was done on orphans and the mentally handicapped.

This is ultimately what happens when the people who were cheered for "move fast and break things" start to get older and come face to face with the one thing money can't buy.


Reverse engineering complex biological systems is like reverse engineering an LLM. Everythings depends upon and i fluences everything else. There are no clean modular segments, it's spaghetti all the way down.

Biology isn't something you can reverse engineer in its entirety with anything like the technology we have now.


> I do not believe that there is anything magical about humans that prevents us from eventually reverse engineering ourselves.

Nothing except a possibly unmanageable level of complexity. We don’t even really understand how LLMs do what they do.

Perhaps we can build an AI model that has an understanding of humans down to the level of detail being contemplated here, but that won’t mean we will understand that.

And even with that understanding, it doesn’t mean it’ll be possible to build a fully functioning human body without the equivalent of a brain. It’s likely to be more like a person in a vegetative state - they have a brain and measurable brain function, but no higher cognitive functions that we can detect.


The first vaccine by Pasteur was on a child named Joseph Meister, with the explicit consent if his parents. Generally, the two greatest medical minds of that time (and also, great rivals), Pasteur and Koch, followed the Hippocratic oath (except for themselves).

Not so much "magical," but this kind of comes across like tell me you never studied lab biology without telling me. It's very difficult to cut open all but the most trivial organisms without killing them and there isn't any other way to observe living systems while they're still alive. Observing them while dead doesn't give you near enough information to reverse engineer anything. We've had to solve for the cutting open without killing to be able to do things like open heart surgery, but it currently requires a team of people who trained for at least a decade and consent from the subject who generally isn't going to give that unless their life is already at stake.

If you can indeed just cross ethical boundaries, then sure, but mostly we've managed to purge the Josef Mengele's from societies with the technology to make this kind of thing feasible. The real world is at least not yet The X-Files where shadowy doctors in secret quasi-government consortiums can do basically anything to living humans in the name of discovery.

"Eventually" does a lot of work in your statement in that I completely agree, provided humanity lasts long enough. Give us another ten millenia and I have no doubt damn near every sci-fi scenario ever dreamed up that doesn't require superluminal travel will probably be doable. But that means nothing at all to something trying to launch a business in this century.


> I do not believe that there is anything magical about humans that prevents us from eventually reverse engineering ourselves.

I agree, and I think we both agree that while it is conceptually possible to reverse engineer most of human biology to the point of eventually understanding how all selection pressures explain the information in the human genome, from your sentence I conclude also that we probably agree that we are far from that position as of today.

> To think otherwise is to acknowledge some sort of higher power that holds a special non-organic ingredient in the mix.

It's not so much a magical ingredient, more than not possessing a manual of the universe, nor guarantees about the distribution of all activities and how humans with specific genomes experience different selection pressures. Our genome only accumulates an effective response for a full history of usual (and now novel) selection pressures, not a description nor the formula describing the dependence on all parameters in the face of selection pressures.

But what I believe the previous commenter refers to is not the question if we ever asymptotically approach this ideal model of selection pressures, but rather that conventional research has already long taught us that healthy body organs require an active life: without exercise the muscles would atrophy like bed-ridden people suffer, and similar for all kinds activities ideally in a mix that is representative of the real distribution of selective pressures.

> To be clear I think this type of work crosses a lot of ethical boundaries. But entire fields like gynecological surgery were the result of a person with no ethics doing horrific things to people without consent. Most early vaccine testing was done on orphans and the mentally handicapped.

Can you kindly link me up with references on the non-consensual gynecological surgeries? I happen to be very interested in the dark origins of medicine in general (since one could argue that healthcare is impossible to socialize, whenever we alleviate the afflictions of genetically inclined sufferers -randomly distributed in all populations- then we simultaneously lift the selection pressure, inducing more of such sufferers in the next generation. One doesn't have to be a Nazi to point that out, and unlike a Nazi (who intervene by castration, genocide, etc.) a scientific moral stance is to simply not intervene: neither oppress nor help.

By what right do we alleviate each type of suffering in a few socialized-healthcare generations at the cost of inducing more suffering in many more future generations to come?


GP was likely referring to J. Marion Sims, who tested operations on enslaved women who couldn’t meaningfully consent (in an era when there was little or no anesthesia), with some women being operated on over a dozen times, performed ovary removal and clitorectomies on women at the behest of their fathers or husbands to treat “hysteria”, and was a Confederate sympathizer who spent the war over in Europe raising money and seeking diplomatic recognition for it.

He also developed several important surgical techniques and operated on cancer patients at a time when that was considered an absolute waste of time and resources, and that latter thing caused him to lose his position at the hospital he had founded, after which he started the first cancer hospital.


Anecephaly is a thing. Though those babies don't survive much past birth.

Most do not, which suggests to survive to adulthood more of a brain will be needed.

Some anacephalic babies survive for months, even years and function mentally: https://www.dailymail.co.uk/news/article-2226647/Nickolas-Co...

I am very sceptical that you could create a clone with enough of a brain to survive and guarantee the clone will have no awareness.


While it does not seem to include inference done on your local computer, to me it feels like a precursor for doing so.

To my mind, inference at the edge is what will kill inference in the datacenter. Inference at the edge is more secure, faster, and uses less electricity. People share vulnerable and personal info in their chats, why share it with OpenAI who will use it to sell ads?

In a world where most of inference being done at the edge, what do we need all of these data centers for? You may say we need them to continue pre-training even bigger models. And yet, pre-training models has hit a performance plateau.

Inference in a data center never made sense. It's such a massive investment of resources when we're all carrying around computers in our pockets. As someone who values my privacy, I will start doing inference on device exclusively as soon as possible.


I really dig the editorial viewpoint of this article. New journalism style meets fun facts about engineering.


I read the title as "Disturbed Systems" and thought, yup, that checks out


Brandon Sanderson often says in interviews that "laying bricks" is the best job a writer can have. He also says being a software engineer is particularly bad job for writers because you cannot do it on autopilot. I can confirm.

Back then, all jobs moved at a much slower pace. There was a lot more off time during work hours.


The number of submissions to high energy physics category on arXiv is double this year compared to the historical average. The author hypothesizes the increase is due to papers being written by LLMs.


> The surge of AI, large language models, and generated art begs fascinating questions. The industry’s progress so far is enough to force us to explore what art is and why we make it. Brandon Sanderson explores the rise of AI art, the importance of the artistic process, and why he rebels against this new technological and artistic frontier.

What It Means To Be Human | Art in the AI Era

https://www.youtube.com/watch?v=mb3uK-_QkOo


Do watch the video as it makes a compelling argument against this exact kind of thing. From a product design perspective, you're asking people to read a bunch of slop and organize it into slop piles. What's the point of that? Honestly it seems like a huge waste of everyone's time.


I think there's interesting work to be built on this data beyond just generating and sorting slop. I didn't build this because I enjoy having people read bad fiction. I built it because existing benchmarks for creative writing are genuinely bad and often measure the wrong things. The goal isn't to ask users to read low-quality output for its own sake. It's to collect real reader-side signal for a category where automated evaluation has repeatedly failed.

More broadly, crowdsourced data where human inputs are fundamentally diverse lets us study problems that static benchmarks can't touch. The recent "Artificial Hivemind" paper (Jiang et al., NeurIPS 2025 Best Paper) showed that LLMs exhibit striking mode collapse on open-ended tasks, both within models and across model families, and that current reward models are poorly calibrated to diverse human preferences. Fiction at scale is exactly the kind of data you need to diagnose and measure this. You can see where models converge on the same tropes, whether "creative" behavior actually persists or collapses into the same patterns, and how novelty degrades over time. That signal matters well beyond fiction, including domains like scientific research where convergence versus originality really matters.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: