Thanks for the links! Hopefully this doesn't come across as confrontational (this is really something I would like to try myself) but I don't think a gpt2 arch will get to close to gpt3.5 level intelligence? I feel like there was some boundary around gpt3.5 where the stuff started to feel slightly magical for me [maybe it was only the RLHF effect]. Do you think models in gpt2 size now are getting to that capability? I know sub 10B models have been getting really smart recently.
I think you'll be surprised if you see the lift karpathy demonstrates from `fineweb.edu` vs `webtext` (he went back later and changed the `nanogpt` repository to use `openwebtext` because it was different enough that it wasn't a good replication of GPT-2).
But from an architecture point of view, you might be surprised at how little has changed. Rotary and/or alibi embeddings are useful, and there's a ton on the inference efficiency side (GQA -> MHA -> MLA), but you can fundamentally take a llama and start it tractably small, and then make it bigger.
You can also get checkpoint weights for tons of models that are trivially competitive, and tune heads on them for a fraction of the cost.
I hope I didn't inadvertently say or imply that you can make GPT-4 in a weekend, that's not true. But you can make models with highly comparable characteristics based on open software, weights, training sets, and other resources that are basically all on HuggingFace: you can know how it works.
GPT-2 is the one you can do completely by yourself starting from knowing a little Python in one day.