Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The reason that language models require large amounts of data is because they lack grounding. When humans write a sentence about.. let's say "fire", we can relate that word to visual, auditory and kinesthetic experiences built from a coherent world model. Without this world model the LM needs a lot of examples, essentially it has to remember all the different contexts in which the word "fire" appears and figure out when it's appropriate to use this word in a sentence. A perfect language model is literally impossible because you can always contrive a novel context that the LM has never seen before.

I suspect that the more data modalities we add the less data would required, but that's not the whole picture either. For example, text-to-image generators often makes weird mistakes that look "unphysical", or objects that look like they're flowing into eachother. The reason is because these models (including DALLE) uses a simple UNET, which basically only sees textures. What it lacks is a human inductive bias that 2d images are typically representations of a 3d world, a world which contains largely discrete objects and physics. It makes these mistakes because it doesn't know what objects are, and need to brute force this idea from a ton of observations. Even simple cognitive abilities like object persistence requires time perception, which these models lack.

I think the fact that these models can make up for this deficit with a ton of data is very telling. There is a lot of low hanging fruit in integrating more data modalities.



Even simple cognitive abilities like object persistence requires time perception, which these models lack.

What do you mean? If we send a robot to explore its environment, and train it by having it constantly predict the next video frame, wouldn’t it eventually learn the physics and therefore gained the “time perception”?


Yeah that would work. I'm talking about LLMs, DALL-E and diffusion image generators specifically.


this is exactly what clip already does, and there will be massive improvements in this area in coming years, I promise.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: