Comparison of machine learning libraries used for classification

nl · on Aug 12, 2015

This is some pretty good work.

Vowpal Wabbit does pretty well, which isn't surprising - it has always benchmarked well.

I'm not really familiar with H2O at all, but those are some pretty impressive results. Not only is it competitive in terms of speed with Vowpal Wabbit, bur it also looks like it achieved the highest absolute score too (AUC = 81.2 on a GBM using H2O-3).

raus22 · on Aug 12, 2015

For the people interested: http://mlcomp.org/ MLcomp is a free website for objectively comparing machine learning programs across various datasets for multiple problem domains.

stared · on Aug 12, 2015

I see some tests, but are there nay public comparison tables/charts? (I.e. same data, but different algorithms, or different packages.)

earino · on Aug 12, 2015

My friend Szilard is the author of this benchmark. He was attempting to answer the questions in the comments, but the responses did not show up. I am posting his answers here for him:

~~~~

I was trying to answer each question, but got blocked, so here are some answers in one comment:

1. Yes, it's WIP.

2. Did RF+GBM so far, did not even start DL, but stay tuned...

3. For Spark I used number of partitions = number of cores

4. For why Spark is less accurate for RF see Databricks comments here: http://datascience.la/benchmarking-random-forest-implementat... (essentially current version aggregates votes, while it should aggregate probabilities).

5. The issue of Spark being slow is not only an overhead on small data, but lots of serializing/deserializing, garbage collection etc.

6. Non-linear SVMs scale badly, but one could look at what max size can e.g. svmlight do (results/pull requests are welcome)

7. I think Logistic regression in Python scikit-learn can be much improved (as I mentioned on github) by using a sparse format (someone should do it)

8. The difference between implementations in somewhat surprising, but some of the tools take different approaches (e.g. RF in Python/xgboost works with the data, in H2O/Spark it builds bins from the data etc.) There are tricks, options available only in 1 or 2 etc. I was trying to do my best to match parameter values/setup, but it's not the same.

I also wrote a blog post (a slightly more organized text for RF than the github README): http://datascience.la/benchmarking-random-forest-implementat...

There is also a recording of a talk I gave at a Machine Learning meetup if you want more insights: https://www.youtube.com/watch?v=DK87lCLH_6A

math_and_stuff · on Aug 12, 2015

Is there any detailed breakdown of the data that led to conclusion 5?

snnn · on Aug 12, 2015

This comparison is silly and unfair. It is tantamount to use anti-aircraft guns to fight mosquitoes then conclude that heavy weapons are too slow for anti-mosquitoes.

In these experiments, p is quite small(About 1K). So the sizeof the weight vector (or weight gradient vector) is only about 4K (for float) or 8K (for double). It should take less than 0.1 second to transfer these vectors. So there are no need to use tree allreduce or bittorrent to transfer them. These big inventions from VW and Spark become useless and burdensome. And also, in such scene, Spark's accumlator has no advantage than the tranditional map-reduce.

Another point is: Though correctness is more important than performance, it's very difficult to get a correct implementation even for the most basic problems. e.g. Most implementations of OWLQN(which is for L1 regularized convex problems) are wrong. More serious, sometimes the system you relied on is problematic or unstable by design. e.g. This Spark issue https://issues.apache.org/jira/browse/SPARK-5490 may never be fixed.

I would also recommand google's sensei(https://github.com/google/sensei) as an alternative to be evaluated. I recommand it just because it comes from google. But it's not well-known yet.

math_and_stuff · on Aug 12, 2015

Since when did Spark invent tree allreduces (which have been standard practice within MPI for decades)?

rxin · on Aug 12, 2015

I don't think he said anything about Spark inventing allreduce. Spark did use torrent broadcast, which I believe is pretty unique.

math_and_stuff · on Aug 12, 2015

"So there [sic] are no need to use tree allreduce [...] These big inventions from Spark..."

rxin · on Aug 12, 2015

"tree allreduce or bittorrent"

stared · on Aug 12, 2015

For logistic regression scikit-learn seems to be the slowest, but also the most accurate (BTW: why is it so memory-consuming?). Is it possible to tweak parameters for various languages so to get a similar scores? (Or to show score, and time, ranges?) Even the number of iterations can play a huge factor.

mikkom · on Aug 12, 2015

This paper is also worth reading, it's excellent and very little known (comparisons start from page 5)

http://www.cs.uic.edu/~tdang/file/CHIRP-KDD.pdf

sgt101 · on Aug 12, 2015

It's always good to see proper evalations and I am sure that this will provide food for thought and improvement for the tool owners.

However - a couple of points.

- One data set can't be more than an indicative test. - Spark isn't designed for single machine environments. - Spark isn't designed for "small problems" ie. where it takes 30 seconds to run the system. - If the algorithms are right then the results for accuracy should be identical. That they aren't indicates a bug, which would be worth unpicking with the relevant teams?

Analog24 · on Aug 12, 2015

These algorithms are all fairly complicated with numerous parameters involved in the calculations. Most of the packages (I can't speak for all of them since I'm not familiar with all of them) make it easy to implement these algorithms by providing reasonable default values for most of the necessary parameters. It's very unlikely that they are set to identical values, thus leading to differing results.

In addition, some of the algorithms are non-deterministic. Random forests, as the name implies, involves randomly setting the decision tree parameters numerous times. You can run the same algorithm with the same implementation and get different results. Likewise with NN's, it all depends how you set the weights initially, which is usually a non-deterministic process as well.

To sum it up, I don't think that obtaining different results from different implementations of the same ML algorithm is indicative of a bug. It's actually expected in many situations.

sgt101 · on Aug 12, 2015

edited to better explain

So...

If the variance of a particular implementation from run to run is significant then one has to question how real any gain vs the bottom result obtained is? In the old days we used to do things like cross validate 30+ times and give results x vs y vs z with some confidence interval.

I believe that any implementation of a particular algorithm should be almost exactly the same as any other implementation if it is correct. Therefore, over a number of runs the results should be the same.

Analog24 · on Aug 12, 2015

I think you have a stricter definition of what an algorithm is than what is used in the context of this study. The general concept of the random forest algorithm (for example) is the same in each implementation but the exact details of how that general algorithm is implemented (the exact algorithm) is most likely not the same. Therefor, you shouldn't expect to get completely identical results regardless of how much data you train them with. They should all be in the same neighborhood though, which they are, for the most part, in the results from the study.

Analog24 · on Aug 12, 2015

I agree in the context of the non-deterministic algorithms. If there is too much variance between runs then the ones that do well are probably just overfitting the data and won't actually perform well when deployed.

However, I wouldn't consider differences in each implementation as a measure of variance. This is likely the result of differences in parameter choices or subtleties in the algorithms. There are multiple ways to train a NN, for example. So I doubt any of the examples shown in the study are true apples-to-apples comparisons.

micro_cam · on Aug 12, 2015

I've implemented a random forest library that achieves similar or better results to some of the ones described [1] and in my experience there is a lot of room for interpretation in the algorithm. In most of these cases there isn't necessarily a right way, one implementation may handle clean data better while another handles dirty/noisy data better.

Some examples of spots where interpretations differ are:

1) Do you exhaustively or randomly search for categorical splits, require one hot encoding or use some other heuristic?

2) Do you use 32 or 64 bit floats? Or do you support both but stop tree growth but have a minimum impurity decrease constant to make sure tests pass on both (scikit does this, it is essentially an unexposed hyper parameter).

3) How do you handle features that have become constant in the branch of a tree being extended? Scikit insists that at least one non constant feature be examined for each potential split resulting in bigger trees, other implementations will examine only m features even if they are all constant. I made this switchable in mine, scikit behavior works better on many test data sets, the other way on some noisy overfitting prone datasets.

4) If default paramaters are used do you default to gini impurity or entropy? Do you round to the nearest number or just up (so you always get at least one) after taking the square root of the number of features to set the number of features examined for each split.

All of this just means it is important to try different implementations and do hyperparameter searches etc on the data you actually care about.

[1] https://github.com/ryanbressler/CloudForest

lqdc13 · on Aug 12, 2015

Not really a bug in most cases but the assumptions made by the lib that cannot be changed.

Also, sometimes more data doesn't help much so this is somewhat relevant even for spark users.

thomasrossi · on Aug 12, 2015

Very cool, thanks for doing/sharing. I am curious why he says "non linear SVM are the most accurate but can't scale". I've used the good old C svmlight on pretty large datasets, it was not real time but..

pmelendez · on Aug 12, 2015

> I've used the good old C svmlight on pretty large datasets

When the dataset size is in Gbs, Non-lineal kernels might not even finish. Kaggle has good sized datasets that you could use to see the difference, it is insane. That said, I would love to see it included in this kind of benchmark to see how ACU would compare.

chestervonwinch · on Aug 12, 2015

You have ~n^2 exp operations (assuming rbf kernel) that weren't present before. This becomes extremely slow for big n in my experience.

cozzyd · on Aug 12, 2015

I'm curious to see how TVMVA (http://tmva.sourceforge.net/) would stack up

jbssm · on Aug 12, 2015

I would advise anyone to stay away from ROOT. It's a really, really badly programmed framework that will steal many hours of your time.

cozzyd · on Aug 12, 2015

Hey now, it's better than PAW!

blueyes · on Aug 12, 2015

Would be curious to see how they compare to DL tools like http://deeplearning4j.org

helloImNew · on Aug 12, 2015

Didn't set the number of threads for Spark to use. You should oversubscribe. This is probably the biggest misstep.

ilaksh · on Aug 12, 2015

Where is the table comparing the accuracy of DL tools to others?

Seems like mainly he is just saying "DL takes a shitload of resources" and no numbers.

My understanding is that the other tools have limited relevance now with DL, since most systems are moving to cloud-based services that do have resources for DL, and DL offers much better accuracy. That is not readily concluded from this report.

Fede_V · on Aug 12, 2015

Actually, resources are the least of your concerns when it comes to DL.

DLs have a shitload of hyper-parameters (effectively, the entire architecture) which is mostly built by trial and error + intuition. There was some really exciting work on reversible SGD by an applied math group at Harvard (https://github.com/HIPS/hypergrad) to obtain derivatives with respect to network architecture, but it's not very mature yet.

Further, DLs need an immense amount of data to be efficient. If your dataset has only a few hundred of features, and you have a few thousand samples, it's unlikely DL will help you much.

Finally, DL is not really useful out of the box, at all. scikit-learn's selling point is that it has an amazingly good uniform API, fantastic documentation, and very clean code. Making a working DL implementation with a simple 'fit' sklearn-like API is impossible.

ilaksh · on Aug 13, 2015

I noticed you also did not give any information on the accuracy of DL versus other techniques.

rjdagost · on Aug 12, 2015

Deep learning is definitely "the new hotness" for machine learning but it is a big mistake to assume that all other techniques are now irrelevant. Depending on your problem space, DL might produce decidedly inferior results. For example, if you do not have a massive database DL probably isn't going to give you the best results. Or, if you have limited computational resources DL is probably not the best choice.

JD557 · on Aug 12, 2015

Is this still WIP? I would really like to see how spark fares in a distributed environment.

jbssm · on Aug 12, 2015

Do any of these libraries have support for computation on the GPU using CUDA?

felipelalli · on Aug 12, 2015

(in the end of the README file)

Conclusions: ...