I agree in the context of the non-deterministic algorithms. If there is too much...

I agree in the context of the non-deterministic algorithms. If there is too much variance between runs then the ones that do well are probably just overfitting the data and won't actually perform well when deployed.

However, I wouldn't consider differences in each implementation as a measure of variance. This is likely the result of differences in parameter choices or subtleties in the algorithms. There are multiple ways to train a NN, for example. So I doubt any of the examples shown in the study are true apples-to-apples comparisons.