Also, the gradients of "softmax loss" and mean-square-error loss are the same. T...

		yobbo on Aug 23, 2023 \| parent \| context \| favorite \| on: Kullback–Leibler divergence Also, the gradients of "softmax loss" and mean-square-error loss are the same. The network learns to optimize the same function, up until the output activation.