Consistent, algorithmic performance on basic tasks.
A great example is the simple 'count how many letters' problem. If I prompt it with a word or phrase, and it gets it wrong, me pointing out the error should translate into a consistent course correction for the entire session.
If I ask it to tell me how long President Lincoln will be in power after the 2024 election, it should have a consistent ground truth to correct me (or at least ask for clarification of which country I'm referring to). If facts change, and I can cite credible sources, it should be able to assimilate that knowledge on the fly.