Wow. (I don't have any idea of what I'm talking about but) it occurs to me that ...

Wow.

(I don't have any idea of what I'm talking about but) it occurs to me that if a robot does this for the frames of its video input with a regular camera, then on static environments the output of SLAM would be great.

Also, just predict what you will see after you move from the CAD scene, move, compare the actual new image with the predicted one, and dedicate most computing resources to what differs the most - now you have a robot with attention to unrecognized objects!