GPUs are increasingly optimised for machine learning workloads, but they aren't really designed for it on an architectural level and still leave a lot of potential performance and efficiency on the table. Current ML implementations are still quite kludgy and path-dependent, with lots of technical debt. There is huge potential for optimisation throughout the stack.
I’d welcome correction, but my understanding is that NPUs are “native” neural weight processors that trade the flexibility of doing graphics for the optimization of not having a layer like CUDA between neural net and hardware.