Does the chinchilla recipe still hold today? I got the impression that the LLaMA...

evanmays · on March 28, 2023

There’s discussion elsewhere in this thread what chinchilla actually means. I’ll only compare it to llama.

Tldr; Chinchilla isn’t wrong, it’s just useful for a different goal than the llama paper.

There’s 3 hyper parameters to tweak here. Model size (parameter count), number of tokens pre trained on, and amount of compute available. End performance is in theory a function of these three hyperparameters.

You can think of this as an optimization function.

Chinchilla says, if you have a fixed amount of compute, here’s what size and number of tokens to train for maximum performance.

A lot of times, we have a fixed model size though though, because size impact inference costs and latency. Llama operates in this territory. They choose to fix the model size instead of the amount of compute.

This could explain gaps in performance between Cerebras models of size X and llama models of size X. Llama models of size X have way more compute behind them

espadrine · on March 28, 2023

I don’t think it holds for two reasons.

First, it only holds for a given architecture and implementation. Obviously, a different architecture will have a different training slope. This is clear when comparing LSTM with Transformers, but is also true between transformers that use prenorm/SwiGLU/rotary-positional, and those that follow Vaswani 2017.

In terms of implementation, some algorithms yield the same result with fewer operations (IO, like FlashAttention and other custom CUDA kernels, and parallelism, like PaLM, which both came after Chinchilla), which unambiguously affect the Tflops side of the Chinchilla equation. Also, faster algorithms and better parallelization will yield a given loss sooner, while less power-hunger setups will do that cheaper.

Second, even in the original Chinchilla paper in figure 2, some lines are stopped early before reaching Pareto (likely because it ran out of tokens, but LLaMA makes it seem that >1 epoch training is fine).