I learnt about KL Divergence recently and it was pretty cool to know that cross-entropy loss originated from KL Divergence.
But could someone give me the cases where it is preferred to use Mean-squared Error loss vs Cross-entropy loss? Is there any merits or demerits of using either?
This is the NN 101 explanation: mean-square loss is for regression, cross-entropy is for classification.
NN 201 explanation: mean-square is about finite but continuous errors (residuals) of predicted value vs true value of the output, cross-entropy is for distributions over discrete sets of categories.
NN 501 explanation: the task of the NN and the form of its outputs should be defined in terms of the "shape" or nature of the residuals of its predictions. Mean-square corresponds to predicting means with Gaussian residuals, cross-entropy corresponds to predicting over discrete outputs with multinomial (mutually exclusive) structure. Indeed you can derive any loss function you want by first defining the expected form of the residuals and then deriving the negative log-likelihood of the associated distribution.
If you look deep enough they are not exclusive. From the deeplearning book:
"Many authors use the term “cross-entropy” to identify specifically the negative log-likelihood of a Bernoulli or softmax distribution, but that is a misnomer. Any loss consisting of a negative log-likelihood is a cross-
entropy between the empirical distribution defined by the training set and the
probability distribution defined by model. For example, mean squared error is the
cross-entropy between the empirical distribution and a Gaussian model"
Also, the gradients of "softmax loss" and mean-square-error loss are the same. The network learns to optimize the same function, up until the output activation.
In NN training, minimizing cross entropy is equivalent to minimizing KL divergence. This is because cross entropy is equal to (entropy of the true distribution) + (KL divergence from the true distribution to the model). Obviously by changing the model you can't change the first term, only the second. So when you minimize cross entropy, you are minimizing KL divergence.
Minimizing mean-squared-error loss is equivalent to minimizing KL divergence (and thus cross entropy) under the assumption that your model produces a vector that parameterizes the mean of a multivariate Gaussian distribution which is then used to predict your data. This is the most natural way to set up a model that predicts continuous data.
TL;DR: One is about distance in space, the other is about spread in space.
KL-Divergence is not a metric, it's not symmetric. It's biased toward your reference distribution. Although, it gives an probabilistic / information view of a difference in distributions. One of the outcome of KL is that it will highlight the tail of your distribution.
Euclidean distance, L2, is a metric. So it is suited when you need a metric. Also, it does not give an insight of any distribution phenomenons expect the means of the distribution.
For example, you are a teacher, you have two classes. You want to compare the grades.
L2 can be a summary of how far the grades are apart. The length of tails of both grades distribution won't have impact if they have the same mean. That's good if you want to know the average level of both classes.
KL will give the point of view how the class grades spread are alike. Two classes can have small KL Divergence if they have the same shapes of distributions. If your classes are very different - one is very homogeneous and the other one is very heterogeneous - then your KL will be big, even if the average are very close.
Great explanation. Now it makes sense!
I briefly remember napping during my physics lectures looking over the transformation equations, which did not make sense to me. ( I know it's partly my fault but...)
I have never used Postman, but I am curious, how useful or good, if any, their product is? given the recent bashing it received for being over-valued and being just a wrapper around curl. If people like it and are willing to pay for it, isn't its valuation justified, which I think is true for any startup in general.
Postman was fantastic as a GUI for curl many years ago.
After raising money, they bloated the app with confusing, half-baked features, and whoever was in charge of their usability really botched it. It's unusable now and I used to dread opening it up.
I switched to Firecamp[1] and have been very happy with it.
It's useful, but it's very hard to justify the $19 per user per month to management unless you're using it constantly like an IDE instead of visiting once in a while to check the docs.
Question: Does fp16 provide more accuracy than mixed-precision? If so, any reason for this to be happening?
Looking at the discussion, everybody is agreeing to the fact that it is already well-known that fp32 is overkill and fp16(or bf16) is already industry standard(for most cases at least). But any opinions on mixed precision floating point is seems to be missing. Has anybody seen benchmarks that seem to indicate that mixed-precision fp performs worse than fp16 and fp32,(other than the paper)?