When you already have a GPU in a system, adding tensor cores to it is much more efficient than adding a separate NPU which needs to replicate all the data transfer pipelines and storage buffers that the GPU already has. Besides, Nvidia's tensor cores are systolic.