Run on an old HEDT platform with a lot of parallel attached storage (probably PCIe 4) and fetch weights from SSD. You'd ultimately be limited by the latency of these per-layer fetches, since MoE weights are small. You could reduce the latencies further by buying cheap Optane memory on the second-hand market.
The "active" count is not very meaningful except as a broad measure of sparsity, since the experts in MoE models are chosen per layer. Once you're streaming experts from disk, there's nothing that inherently requires having 49B parameters in memory at once. Of course, the less caching memory does, the higher the performance overhead of fetching from disk.
These are more like experiments than a polished release as of yet. And the reduction in throughput is high compared to having the weights in RAM at all times, since you're bottlenecked by the SSD which even at its fastest is much slower than RAM.
> The EA people who decided to spend oodles of money working on AI ostensibly to prevent an insane thought experiment and then converted that effort into a for-profit corporation while insisting that actually giving money to the poor is bad because you'll have greater future utility by spending it on AI are worthy of scorn.
That was the e/acc folks though. Different acronym, even though they were a spinoff from the original EA folks.
Yes. That's the whole idea of adaptive/personalized learning.
There are some interesting ethical questions about disparity of outcomes, but we should aim to have a system that educates each individual as best we can (without leaving kids behind). The current system of bludgeoning every individual student to be the same is not working well.
Could it be that people with higher incomes are a lot more likely to actually care about their kids getting a good education, and to put pressure on the school to that effect?
> No one has mentioned defunding public education yet.
Public education has vast amounts of funding in the U.S. compared to other developed countries. If it does badly despite that, it's very likely that "more funding" is not the answer.
It's worth pointing out that wages in the US are vast compared to other developed countries, though, too. We outspend OECD by 35-40%, but our average national wage is also higher than OECD by 35-40%.
Labor compensation in the U.S. is also extremely unequal, which pulls the average up in a way that isn't very informative as to this particular issue. The average starving PhD would be a much better and more knowledgeable teacher to high school students in the subject she took her PhD in, than the typical high school teacher with nothing more than an Education credential. Are you sure that you need to pay such high wages to existing teachers?
>The average starving PhD would be a much better and more knowledgeable teacher to high school students in the subject she took her PhD in
i dont think this is true.
there is an art to educating (especially the ~10-15 year old range) that does not just manifest itself because you are smart: how to engage students, how to keep them engaged, how to adjust the message to the audience's level and communicate it effectively (which changes kid to kid), how to earn a kids respect without becoming over-bearing (or too friendly), and dozens of other things that your PhD in compsci or whatever does not teach you.
some of the smartest PhD holders i know would be very shitty elementary/high school teachers.
(context: i teach at the college level. its a lot easier than teaching at the high school level.)
Yeah there's some truth to this - I find that my Ed students don't always have sophisticated understandings of their content area (though honestly I find that ENGR and BIOL students don't, either). But they do get more content area teaching than in ED.
ED as a field is 100% all-in on AI, too, so there's a lot of discussion amongst them about what skills in the field need to be automated and what has to stay artisanal. But I'm sympathetic to zozbot's claims too - I do think the reading scores would be higher if there were more comp/rhet specialists in sec. ed.
~10-13 mostly comprises the junior high range. By the time the kids are 14, they're plenty old enough to benefit from a "college-prep" educational approach. Sure, some PhDs will be better, others will be worse. But you solve that by throwing out terrible teachers and rewarding the best ones. There's no guarantee that an Education-credentialed teacher with negligible education in the actual subject they're supposed to teach would be any better.
I'm retired from engineering. I did startups / exited / joined difficult technical domains for the funsies / etc.
I have taught 5 years at a private school. I do not have a teaching credential.
Knowing the stuff you're teaching is the easiest part. And I say that despite teaching in an environment with far better behavior, student buy-in, family support, and academic accomplishment than most places.
I thought that when I launched a student team doing spacecraft design (selected for orbital flight on the basis of the quality of their mission, btw, not their age) that the hard part would be teaching kids about power budgets, radiation aging, and the thermal environment.
Turns out the hard part is helping them figure out how to navigate the social dynamics of talking to each other, organizing their work, realizing what other people know, and coping emotionally with setbacks. Kids will teach themselves the stuff if you have buy-in and the culture in the room is right.
Yes to this! What makes a great teacher is the willingness to hold kids accountable for their behavior and their work. Sure, it helps to be a subject expert, but that won't matter if you can't manage your classroom.
And parents play an equally important role. One of the best things you can do for your child's education/life is support the teacher when they call you up and say, "Your child is making poor decisions..."
> Sure, it helps to be a subject expert, but that won't matter if you can't manage your classroom.
I've known plenty of highly credentialed teachers that were very poor communicators and/or could not manage their classroom. I think the idea that this can be, or is, effectively taught as part of the "education major" is very suspect.
Indeed, the worst-performing school districts are precisely those where "classroom management" is a serious problem, versus better districts where the children come to school ready to be managed. It seems older styles of classroom management now out of vogue and untaught by universities were more effective.
My first year of teaching high school mathematics was nearly a disaster. Managing my classroom was a nightmare. Fortunately, we had winter break which gave me an opportunity to step back and reflect honestly on why and I realized I was making a number of mistakes so I made some necessary adjustments and things went much better thereafter. I firmly believe the first year of teaching is when many teachers either rise up or give up.
Regarding managing kids...every school I've worked at (or my wife has worked at) has a mix of kids who are ready to learn and who need to be taught to learn. That includes districts in more wealthy areas and less wealthy areas.
In fact, my wife would tell you the students who cause the most problems in her classroom are from more affluent families. Why? Because they have entitled parents who don't hold their kids accountable and don't support the teacher.
Here in my state teachers in good districts start at $60,000 per year and see minimal increases due to length of service; after 20 years they might be making $75,000 per year. You ever done the math on living on $60k per year? Hard to do a lot besides support youself on that income. I note that surrounding states (even higher cost states) have lower salaries.
It depends a lot on the state. Some actually do pay alright. Some pay terribly (and may have serious issues finding enough staff, as a result).
Unions are similar. People cry about them being a huge problem, but they have effectively no power (as in: don't even collectively bargain for contracts) in lots of states, including many of the ones with poor school performance. In other states, they really do have quite a bit of power.
In states with lower teacher pay, most teachers without a much-higher-paid spouse take summer jobs or teach summer school. Also, none of them get as much time off in the summer as the kids do. Plus, you can't pay your mortgage with vacation days.
Given the (often ongoing) educational requirements, if you pro-rate it you still come out much below most positions with similar requirements. We absolutely under-pay teachers in virtually every public school.
My mother retired after working her entire career as a teacher, and I earned close to double her final salary my first year working in tech. She has her masters degree and I did not graduate college. And if you count the stocks I got at the end of that first year, it was over triple.
She was a special ed. teacher teaching emotionally disabled grade schoolers (including a first grader that tried to kill his grandmother with a tv power cord). There is no way that I worked harder than she did.
You sure they're not on 20 pay contracts? Everybody tells me "it must be so nice, getting summers off" and I'm like "actually I look for summer courses because I don't get paid."
Here average teacher salary is over $100k. Projected to be $120k by 2027 due to their new union contract.
Newbie teachers start around $70k last I checked, and hit six figures in 5-6 years.
This is roughly double median salaries.
That said, I think they earn every bit of it even with "summers off" and their relatively lucrative benefit packages. The work environment is utter shit. Basically zero ability to manage a classroom and get rid of any shitheads - with very little supportive parenting or administration having your back. Even if salaries were $500k/yr I wouldn't remotely consider taking such a job.
Pay itself though? Not an issue for one of the worst performing major urban school districts in the nation.
I'm planning on transitioning into teaching due to not being employable (apparently) in tech anymore. It's about the only career I can transition into. I wish I could make six-figures!
Move to Chicago and get a job in CPS - you'll be at ~$100k in 5-6 years!
The idea of it actually sounds initially fun to me, until I talk to friends who actually work those jobs. For my temperament I know better. At best I'd rage quit, at worst I'd end up in prison.
PhD holders are, on average, not starving. Some of them could make good primary/secondary school teachers, but knowing how to teach children effectively is a skill by itself. It's quite different from working as a college instructor. That's why earning an teaching credential is important (although the quality of some teacher training programs is terrible).
Why do you want to force your kid to study? Kids are naturally curious, it's likely that your kid will be curious about something. Introduce them to study and scholarship as a means of figuring these things out, so that it becomes natural to them.
Because _real_ study is boring. Watching videos on Youtube or playing "educational games" is not studying.
You need to repeatedly solve multiple practical problems to internalize the knowledge. And you'll eventually need to do stuff that you don't really like at all.
The proper response to that is still not to force your kid to study though. Instead, she should be made aware that in order to consistently do well, she is ultimately expected to gain the ability to force herself and defer the immediate gratification of watching a YouTube video (unless that YouTube video is a boring recorded college lecture that's relevant to her studies, I suppose).
This fails with a sizeable portion of adults... and you expect children to somehow understand delayed gratification? To force themselves? Have you ever met an actual child?
Just a week ago mine found and devoured a pack of cookies that were meant for supper and then cried at supper that there weren't any cookies left, choking on a hated apple instead.
Guess he won't do it again in a week if there's another pack of cookies in the hand's reach? Guess again!
> Just a week ago mine found and devoured a pack of cookies that were meant for supper and then cried at supper that there weren't any cookies left
This was famously tested with the Stanford marshmallow experiment. A sizeable fraction of kids - preschool kids, no less - can in fact learn not to eat the marshmallow.
Can learn? -- I'm sure! Will they learn without any "encouragement"? -- Absolutely not.
There are millions of people struggling to get on a diet, to quit smoking, to stop doom-scrolling their choice of social media. Most of them know very well how good it would be to stop doing bad things and start doing something good instead, and yet they don't.
Children are worse, overall, because life isn't as harsh for them as it is for adults. An alcoholic who can't quit drinking is risking his marriage, his career, on top of his health, and still pulls out a stashed flask and takes a sip. The worst that can happen to a child is that their parents will yell at him / her? That's a very weak deterrent against doing something that offers instant gratification.
Children don't do anything beside simple and pleasing activities unless forced to. Studying is hard, not immediately rewarding and boring. No child likes to study, some are just more pliable.
> Kids are naturally curious
This curiosity extends to simple facts (like "who's the best soccer player: Messi or Ronaldo?") it doesn't help with important subjects that need studying (like, how do you prove Pythagoras' theorem).
> Introduce them to study and scholarship as a means of figuring these things out, so that it becomes natural to them.
Children simply don't "work" like that. If you let children "figure things out", they'll "figure out" if Messi or Ronaldo are the best soccer player, and they will never even ask why is the sum of squares of the catheti equals the square of hypotenuse. Not only will they never ask that, they'll never figure out there are right triangles. Some discoveries in human knowledge took us many generations to make and some are trivial and easily discoverable. The former is difficult for people to learn. It requires effort that doesn't result in instant gratification. It's boring because often you will have to read the same thing many times over. It's frustrating because even after having read the same thing you still might not understand it. You have to be a masochist to like doing that. Generally, people who do it, don't do it because they like it. They like the side effects (eg. you may like to be paid a professor's salary, and so you'd grind the subject you want to be a professor in).
So, it never becomes "natural" to anyone, especially not to the children outside of the very few very simple things one might learn in this way. Left to their own devices, maybe half of all the children would love to become YouTube influencers who make their living by streaming the video games they play. They'd never even consider becoming an accountant or an engineer. They wouldn't even know accountants or engineers exist.
I'm all for running large MoE models on unified memory systems, but developers of inference engines should do a better job of figuring out how to run larger-than-total-RAM models on such systems, streaming in sparse weights from SSD but leveraging the large unified memory as cache. This is easily supported with pure-CPU inference via mmap, but there is no obvious equivalent when using the GPU for inference.
At least for the CPU/GPU split, llama.cpp recently added a `--fit` parameter (might default to on now?) that pairs with a `--fitc CONTEXTSIZE` parameter. That new feature will automatically look at your available VRAM and try to figure out a good CPU/GPU split for large models that leaves enough room for the context size that you request.
I use llama.cpp, and there is a way to do this - some layers to (i)GPU, the rest to CPU. I was just trying this out with Kimi K2.5 (in preparation for trying it out with Kimi K2.6 the other night. Check out the --n-cpu-moe flag in llama.cpp.
That said, my Strix Halo rig only has PCIe 4.0 for my NVMe, and I'm using a 990 Evo that had poor sustained random read, being DRAM-less. My effective read speeds from disk were averaging around 1.6-2.0 GB/s, and with unsloth's K2.5, even in IQ2_XXS at "just" 326 GB, with ~64 GB worth of layers in iGPU and the rest free for KV cache + checkpoints. Even still, that was over 250 GB of weights streaming at ~2 GB/s, so I was getting 0.35 PP tok/s and 0.22 TG tok/s.
I could go a little faster with a better drive, or a little faster still if I dropping in two of em in raid0, but it would still be on the order of magnitude of sub-1 tok/s PP (compute limited) and TG (bandwidth limited).
In a computer with 2 PCIe 5.0 SSDs or one with a PCIe 5.0 SSDs and a PCIe 4.0 SSD, it should be possible to stream weights from the SSDs at 20 GB/s, or even more.
This is not a little faster, but 10 times faster than on your system. So a couple of tokens per second generation speed should be achievable.
Nowadays even many NUCs or NUC-like mini-PCs have such SSD slots.
I have actually started working at optimizing such an inference system, so your data is helpful for comparison.
Strix Halo, to my knowledge, does not support PCIe 5.0 NVMe drives, unfortunately, despite it being Zen 5, and Zen 5 supporting the PCIe 5.0 standard.
While many other NUCs may support them, what most of them lack compared to Strix Halo is a 128 GB pool of unified LPDDR5x-8000 on a 256 bit bus and the Radeon 8060S iGPU with 40 CU of RDNA 3.5, which is roughly equivalent in processing power to a laptop 4060 or desktop 3060.
The Radeon 780M and Radeon 890M integrated graphics that come on most AMD NUCs don't hold a candle to Strix Halo's 8060S, and what little you'd gain in this narrow use case with PCIe gen 5, you'd lose a lot in the more common use cases of models that can fit into a 128 GB pool of unified memory, and there are some really nice ones.
Also, the speeds you're suggesting seem rather optimistic. Gen 5 drives, as I understand, hit peak speeds of about 28-30 GB/s (with two in RAID0, at 14-15 GB/s each), but that's peak sequential reads, which is neither reflective of sustained reads, nor the random read workloads that dominate reading model weights.
Maybe there are some Intel NUCs that compete in this space that I'm less up to speed with which do support PCIe 5. I know Panther Lake costs about as much to manufacture as Strix Halo, and while it's much more power efficient and achieves a lot more compute per Xe3 graphics core than Strix Halo achieves per RDNA 3.5 CU, they Panther Lake that's actually shipping ships with so many fewer Xe3 cores that it's still a weaker system overall.
Maybe DGX Spark supports PCIe 5.0, I don't own one and am admittedly not as familiar with that platform either, though it's worth mentioning that the price gap between Strix Halo and DGX Spark at launch ($2000 vs $4000) has closed a bit (many Strix Halo run $3000 now, vs $4700 for DGX Spark, and I think some non-Nvidia GB10 systems are a bit cheaper still)
While you are right about the advantages of Strix Halo, those advantages matter only as long as you can fit the entire model inside the 128 GB DRAM.
If you use a bigger model and your performance becomes limited by the SSD throughput, than a slower CPU and GPU will not affect the performance in an optimized implementation, where weights are streamed continuously from the SSDs and all computations are overlapped over that.
I have an ASUS NUC with Arrow Lake H and 2 SSDs, one PCIe 5.0 and one PCIe 4.0. I also have a Zen 5 desktop, which like most such desktops also has 2 SSDs, one PCIe 5.0 and one PCIe 4.0. Many Ryzen motherboards, including mine, allow multiple PCIe 4.0 SSDs, but those do not increase the throughput, because they share the same link between the I/O bridge and the CPU.
So with most cheap computers you can have 1 PCIe 5.0 SSD + 1 PCIe 4.0 SSD. With PCIe 4.0, it is easy to find SSDs that reach the maximum throughput of the interface, i.e. between 7 and 7.5 GB/s. For PCIe 5.0, the throughput depends on how expensive the SSD is and on how much power it consumes, from only around 10 GB/s up to the interface limit, i.e. around 15 GB/s.
With SSDs having different speeds, RAID0 is not appropriate, but the interleaving between weights stored on one SSD and on the other must be done in software, i.e. one third must be stored on the slower SSD and two thirds on the faster SSD.
A Zen 5 desktop with a discrete GPU is faster than Strix Halo when not limited by the main memory interface, but in the case when the performance is limited by the SSDs throughput I bet that even the Intel NUC can reach that limit and a faster GPU/CPU combo would not make a difference.
That sounds like a huge hassle for what I imagine must be peak speeds of low double digit tok/s PP and TG, even with effective prompt caching and self-ngram and all the other tricks, no?
If I really feel like I needed larger models locally (I don't, the 120/122B A10/12B models are awesome on my hardware), I think I'd rather just either pony up for a used M3 Ultra 512GB, wait for an M5 Ultra (hoping they bring back 512GB config on new setup), or do some old dual socket Xeon or Epyc 8/12-channel DDR4 setup where I can still get bandwidth speeds in the hundreds of GB/s.
What kinds of models are you running over 128GB, and what kind of speeds are you seeing, if you don't mind me asking?
Until now I have not run models that do not fit in 128 GB.
I have an Epyc server with 128 GB of high-throughput DRAM, which also has 2 AMD GPUs with 16 GB of DRAM each.
Until now I have experimented only with models that can fit in this memory, e.g. various medium-size Qwen and Gemma models, or gpt-oss.
But I am curious about how bigger models behave, e.g. GLM-5.1, Qwen3.5-397B-A17B, Kimi-K2.6, DeepSeek-V3.2, MiniMax-M2.7. I am also curious about how the non-quantized versions of the models with around 120B parameters behave, e.g such versions of Nemotron and Qwen. It is said that quantization to 8 bits or even to 4 bits has negligible effects, but I want to confirm this with my own tests.
There is no way to test big models or non-quantized medium models at a reasonable cost, otherwise than with weights read from SSDs. For some tasks, it may be preferable to use a big model at a slow speed, if that means that you need less attempts to obtain something useful. For a coding assistant, it may be possible to batch many tasks, which will progress simultaneously during a single pass over the SSD data.
For now I am studying llama.cpp in order to determine how it can be modified to achieve the maximum performance that could be reached with SSDs.
AIUI, the main obstacle to maximizing performance with SSD offload is that existing GGUF files for MoE models are not necessarily laid out so that fetching a single MoE layer-expert can be done by reading a single sequential extent off the file. It may be that the GGUF format is already flexible enough in its layout configuration that this is doable with a simple conversion; but if not, the GGUF specification would have to be extended to allow such a layout to be configured.
You are right, which is why I do not intend to use a GGUF file but a set of files with a different layout, and this is why I need to make changes in llama.cpp.
If you have to come up with a custom format anyway, why not just make it a draft extension to GGUF layout definitions (something like "coalesced expert fetch" or the like) and submit it for inclusion in the standard? Then future models could be autoconverted to such a format.
I will consider to do this after I gather enough experience to determine which is the best layout and when I will have enough benchmark data to support that.
Now I want to put two p5800x's to use. I wonder how much tinkering would be necessary to mmap a raid setup with them directly to the gpu. Im not fully busy with LLM's and more with graphics and systems, but this seems like a fun project to try out.
reply