More

mrv_asura · 2025-05-21T20:52:05 1747860725

Awesome! Curious how you guys manage to make it "75% cheaper"?

aikin-nivedit · 2025-05-22T03:57:27 1747886247

Bunch of optimisations for large data, including: Smart Batching, Prompt Caching, Custom Routing Algorithms and hardware planning for batch.

mrv_asura · on Jan 25, 2025

Personal favorites:

1. Merkle Trees: https://ronkathon.pluto.xyz/src/tree/index.html

2. Digital Signatures: https://ronkathon.pluto.xyz/src/dsa/index.html

mrv_asura · on Jan 25, 2025

Also checkout the github repo: https://github.com/pluto/ronkathon

mrv_asura · on June 20, 2024

See github repo: https://github.com/mrdaybird/artspeak

mrv_asura · on Aug 22, 2023

I learnt about KL Divergence recently and it was pretty cool to know that cross-entropy loss originated from KL Divergence. But could someone give me the cases where it is preferred to use Mean-squared Error loss vs Cross-entropy loss? Is there any merits or demerits of using either?

uoaei · on Aug 22, 2023

This is the NN 101 explanation: mean-square loss is for regression, cross-entropy is for classification.

NN 201 explanation: mean-square is about finite but continuous errors (residuals) of predicted value vs true value of the output, cross-entropy is for distributions over discrete sets of categories.

NN 501 explanation: the task of the NN and the form of its outputs should be defined in terms of the "shape" or nature of the residuals of its predictions. Mean-square corresponds to predicting means with Gaussian residuals, cross-entropy corresponds to predicting over discrete outputs with multinomial (mutually exclusive) structure. Indeed you can derive any loss function you want by first defining the expected form of the residuals and then deriving the negative log-likelihood of the associated distribution.

imjonse · on Aug 22, 2023

If you look deep enough they are not exclusive. From the deeplearning book:

"Many authors use the term “cross-entropy” to identify speciﬁcally the negative log-likelihood of a Bernoulli or softmax distribution, but that is a misnomer. Any loss consisting of a negative log-likelihood is a cross- entropy between the empirical distribution deﬁned by the training set and the probability distribution deﬁned by model. For example, mean squared error is the cross-entropy between the empirical distribution and a Gaussian model"

https://stats.stackexchange.com/questions/288451/why-is-mean...

yobbo · on Aug 23, 2023

Also, the gradients of "softmax loss" and mean-square-error loss are the same. The network learns to optimize the same function, up until the output activation.

canjobear · on Aug 22, 2023

In NN training, minimizing cross entropy is equivalent to minimizing KL divergence. This is because cross entropy is equal to (entropy of the true distribution) + (KL divergence from the true distribution to the model). Obviously by changing the model you can't change the first term, only the second. So when you minimize cross entropy, you are minimizing KL divergence.

Minimizing mean-squared-error loss is equivalent to minimizing KL divergence (and thus cross entropy) under the assumption that your model produces a vector that parameterizes the mean of a multivariate Gaussian distribution which is then used to predict your data. This is the most natural way to set up a model that predicts continuous data.

mufasachan · on Aug 22, 2023

TL;DR: One is about distance in space, the other is about spread in space.

KL-Divergence is not a metric, it's not symmetric. It's biased toward your reference distribution. Although, it gives an probabilistic / information view of a difference in distributions. One of the outcome of KL is that it will highlight the tail of your distribution.

Euclidean distance, L2, is a metric. So it is suited when you need a metric. Also, it does not give an insight of any distribution phenomenons expect the means of the distribution.

For example, you are a teacher, you have two classes. You want to compare the grades. L2 can be a summary of how far the grades are apart. The length of tails of both grades distribution won't have impact if they have the same mean. That's good if you want to know the average level of both classes. KL will give the point of view how the class grades spread are alike. Two classes can have small KL Divergence if they have the same shapes of distributions. If your classes are very different - one is very homogeneous and the other one is very heterogeneous - then your KL will be big, even if the average are very close.

mrv_asura · on Aug 12, 2023

Great explanation. Now it makes sense! I briefly remember napping during my physics lectures looking over the transformation equations, which did not make sense to me. ( I know it's partly my fault but...)

mrv_asura · on July 19, 2023

I have never used Postman, but I am curious, how useful or good, if any, their product is? given the recent bashing it received for being over-valued and being just a wrapper around curl. If people like it and are willing to pay for it, isn't its valuation justified, which I think is true for any startup in general.

maccard · on July 19, 2023

It's a good tool, and it's as much a wrapper around curl as intellij is a wrapper around notepad.

That said, it's not worth paying more for than GitHub per seat for my team, so we do without.

smt88 · on July 19, 2023

Postman was fantastic as a GUI for curl many years ago.

After raising money, they bloated the app with confusing, half-baked features, and whoever was in charge of their usability really botched it. It's unusable now and I used to dread opening it up.

I switched to Firecamp[1] and have been very happy with it.

1. https://firecamp.io/

where-group-by · on July 19, 2023

It's useful, but it's very hard to justify the $19 per user per month to management unless you're using it constantly like an IDE instead of visiting once in a while to check the docs.

mrv_asura · on July 19, 2023

Read: Another nail in Netflix's coffin.

mrv_asura · on July 19, 2023

Creator of mlpack and armadillo announced Bandicoot.

mrv_asura · on May 23, 2023

Question: Does fp16 provide more accuracy than mixed-precision? If so, any reason for this to be happening?

Looking at the discussion, everybody is agreeing to the fact that it is already well-known that fp32 is overkill and fp16(or bf16) is already industry standard(for most cases at least). But any opinions on mixed precision floating point is seems to be missing. Has anybody seen benchmarks that seem to indicate that mixed-precision fp performs worse than fp16 and fp32,(other than the paper)?