Um, augmentation (i.e. the generation of synthetic data) is a very very well kno...

pphysch · on Dec 24, 2024

Synthetic data is being proposed here as a solution to extrapolate ML scaling.

Augmentation, interpolation, smoothing are different concepts.

phyalow · on Dec 29, 2024

I think you're drawing an artificial distinction here. Synthetic data generation is fundamentally an extension of augmentation. When OpenAI uses expert generated examples and curriculum based approaches, that's literally textbook augmentation methodology. The goal of augmentation has always been to improve model fit, and scaling is just one aspect of that.

Your concern about extrapolation is interesting but misses something key when we generate synthetic data through expert demonstration or guided curriculum, we're not trying to magically create capabilities beyond the training distribution. Instead, we're trying to better sample the actual distribution of problemsolving approaches humans use. This isn't extrapolation rather, better sampling of an existing, complex distribution!

i.e. if you think about the manifold hypothesis then we know real data lives on a lowerdimensional manifold, and good synthetic data helps fill those gaps. This naturally leads to better extrapolation, it's pretty well established at this point.

TBH I think you are characterizing this as some kind of blind data multiplication scheme, but it's much closer to curriculum learning you start with basic synthetic examples and gradually ramp up complexity. So it isn't whether synthetic data is "real" or not, but if it effectively helps map the underlying distribution and reasoning patterns.

Funny enough, your oil analogy actually supports the case for synthetic data refined petroleum is more useful than crude for specific purposes, just like well designed synthetic data can be more effective than raw internet text for certain learning objectives.