Thanks for the links! Hopefully this doesn't come across as confrontational (thi...

benreesman · 2025-07-01T22:33:16 1751409196

I think you'll be surprised if you see the lift karpathy demonstrates from `fineweb.edu` vs `webtext` (he went back later and changed the `nanogpt` repository to use `openwebtext` because it was different enough that it wasn't a good replication of GPT-2).

But from an architecture point of view, you might be surprised at how little has changed. Rotary and/or alibi embeddings are useful, and there's a ton on the inference efficiency side (GQA -> MHA -> MLA), but you can fundamentally take a llama and start it tractably small, and then make it bigger.

You can also get checkpoint weights for tons of models that are trivially competitive, and tune heads on them for a fraction of the cost.

This leaked Google memo is a pretty good summary (and remarkably prescient in terms of how it's played out): https://semianalysis.com/2023/05/04/google-we-have-no-moat-a...

I hope I didn't inadvertently say or imply that you can make GPT-4 in a weekend, that's not true. But you can make models with highly comparable characteristics based on open software, weights, training sets, and other resources that are basically all on HuggingFace: you can know how it works.

GPT-2 is the one you can do completely by yourself starting from knowing a little Python in one day.