It's because ChatGPT's knowledge doesn't come through interaction with the world. Words mean things to us because they point to world interactions. I saw a demo where ChatGPT could create a VM. It could be trained to interact directly with a VM, send commands to an interpreter. In this case, it would understand the response of the VM, although it wouldn't understand the design behind the VM, because humans did that based on interaction with the physical world.
Right. Those are tied via evolution to the dynamics of the physical world. We can simulate the physical world and learn from that, but there needs to be a there there. Language assumes the listener already has that understanding.
It did, actually. The model was trained with multiple rounds of reinforcement learning where human judges provided the feedback: first with full answers, and then with ranking of answers as most relevant.
So the model in production is probably frozen, but before that it went through multiple rounds of interaction with the world.
The reinforcement learning was on giving the right answer, not on interacting with the world. But there is movement in the right direction with https://ai.googleblog.com/2022/12/rt-1-robotics-transformer-...
and other RL stuff. (RT-1 isn't RL but there is other related stuff that is)
Oh, you meant interaction as a joint training with images, actions, feedback etc. That would be the next generation I guess.
I am simply thinking of interaction here as similar to learning a language in a classroom. First the teacher provides sample questions/answers, then the teacher asks the students to come up with answers themselves, and tell them which one is better. The end result here is I think ChatGPT is quite good at answering questions and can pass as a human, especially if it's augmented with a fact database, so obviously wrong answers can be pruned.
I think we will see AGI. But for the AI to be robust, it has to interact with the world, even if it is a simulated one. We need to build an AI that knows what a toddler knows before we can build one that understands wikipedia.
Human text does interact with the real world, so I don't see the limitation. Adding more modalities (vision, sound, etc.) probably will increase performance, and I think this is where we are heading, but it's silly to say that any one of these modalities are not grounded in reality. It's like saying humans can't understand reality because we can't see infrared rays. I mean, yeah?, but it's not the only way of making sense of reality.
Language is a representation medium for the world, it isn't the world itself. When we talk, we only say what can't be inferred because we assume the listener has a basic understanding of the dynamics of the world (e.g., if I push a table the things on it will also move). Having an AI watch youtube and enabling it to act out what it sees in simulation would give it that grounding. We are heading that direction. So, I agree ChatGPT is awesome. I don't believe it understands what it is saying, but it can if it trains by acting out what it sees on Youtube.