From a quick search, it seems there are several general purpose machine learning packages in Python (Orange, mlpy, scikits.learn) and even more specialized ones. Does anyone have experience with more than one of them ? How do they compare to each other in terms of features, maturity, performance and community ?

asked Oct 25 '10 at 04:43

George%20Sakkis's gravatar image

George Sakkis

edited Nov 04 '10 at 03:23

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

I'd say scikit.learn is starting to look really good, avoiding many of the pitfalls you usually find in this sort of software. I'll wait for @ogrisel to comment, however, as he posts in this site and is one of the developers.

(Oct 25 '10 at 06:44) Alexandre Passos ♦


(Oct 25 '10 at 13:43) Yaroslav Bulatov

7 Answers:

Hi, I am one of the developers of scikit-learn . The project has recently gained steam and is moving fast thanks to new contributors. The design goals are:

  • wide coverage of cutting edge algorithms with a simple to use unified API

  • a permissive license for embedding (simplified BSD) and low dependencies requirements (numpy + scipy)

  • optimized yet maintainable implementation using cython when useful

  • scalable algorithms with dense and sparse representations of the features (useful for text classification of with ten of thousands of samples with ~100 000 features for instance).

  • tooling to perform cross validation & performance evaluation across all algorithms that respect the API (duck typing)

  • well tested >= 500 tests that run under 15s with a coverage of ~85% and improving (see our buildbot)

  • well documented too with worked examples (could still be very much improved though)

  • readable source code respecting PEP8 conventions

We don't plan to have a complete dataflow programming model as MDP does (but the two projects are collaborating (a bit) to make it easy to use scikit-learn algo in MDP nodes). It should be also possible to wrap scikit-learn models into Orange components if your users want the rich user interface of that framework (but AFAIK nobody tried so far).

Current limitations:

  • the current API requires to load the training data in memory but this will evolve to handle streaming / large scale datasets (by integrating the work done by Peter in bolt)

  • no command line interface: right now the user has to know basic python & numpy (the ipython shell is the most popular UI among scikit-learn devs). A generic CLI might appear in the coming months though. We will probably never offer more than CLI in terms of interface.

  • very focused towards supervised learning and linear models right now. More unsupervised approaches are planned or under development though.

We also plan to provide standard feature extractors for text classification / clustering (work under way, mostly done), image classification (some basic examples), face recognition (planned) and maybe audio / speech for segmentation / classification / fingerprinting (prospective, nothing done yet). The goal is for the user to have worked example with sane default parameters to build upon and not just machine learning building blocks that require to know the inner workings to apply to a concrete use case.

This answer is marked "community wiki".

answered Oct 25 '10 at 08:29

ogrisel's gravatar image


edited Oct 25 '10 at 18:27

Theano is a CPU and GPU compiler for mathematical expressions in Python. It combines the convenience of NumPy with the speed of optimized native machine language. For gradient-based machine learning algorithms (like training an MLP or convolutional net), Theano is from 1.6x to 7.5x faster than competitive alternatives (including those in C/C++, NumPy, SciPy, and Matlab) when compiled for the CPU and between 6.5x and 44x faster when compiled for the GPU. You can read more about it here.

answered Nov 04 '10 at 03:33

Joseph%20Turian's gravatar image

Joseph Turian ♦♦


I'd like to add to this that theano is also able to automatically compute derivatives of the functions you define. This is very neat when you are building complex combinations of models and using gradient based optimization.

(Nov 05 '10 at 07:02) Philemon Brakel

scikits-learn is starting to really shape up, as others have mentioned. Other fairly mature packages include MDP and PyMVPA. If you want kernel methods, Shogun has feature-complete Python bindings and is also quite mature. There's also Elefant which I don't know much about but has been around a long time.

answered Oct 25 '10 at 11:24

David%20Warde%20Farley's gravatar image

David Warde Farley ♦

Let me add my own package to the list: milk and another: MDP.

I tried mlpy in it was coming out and it did not support much, which is why I ended up developing my own code. In general, scikit.learn has more cutting edge things than milk, while I focus more on performance (both speed and memory-wise) and flexibility.

Unfortunately, part of the answer depends on what you mean by "machine learning"? Classification? Clustering? Graphical models? Deep learning? No package does it all well at the moment. How large is your data?

answered Oct 25 '10 at 07:36

luispedro's gravatar image


Plussed for the last paragraph; knowing your problem well is way more important than any particular choice of software. If there is no problem yet and the goal is to explore, just do that.

(Nov 04 '10 at 13:39) Radim

I agree with the fact that knowing the problem is the most important part. I am a developer in the scikit-learn, and I think that for some problems we are starting to have really nice implementations, but the choice of the optimization problem and the strategy (algorithmic implementation) is what makes the difference.

This is also why one of our goals in the scikit-learn is to find a reasonable API that enable the expression of many problems in a consistent way (and to gather algorithms with this API in the scikit). This is of course very challenging as developer tends to overfit the problems he knows and works on when designing an API. Criticism is of course welcome on the mailing list :)

(Nov 06 '10 at 06:17) Gael Varoquaux

Amund Tveit has a brief summary of Python tools for classification, with recommendations of Orange or PyML.

answered Oct 26 '10 at 18:11

Thomas%20Brox%20R%C3%B8st's gravatar image

Thomas Brox Røst

I only dabble in Orange several years ago(before Version 1) for text classification. It can handle thousands of samples with thousands features.
I think it's stable and mature even before Version 1. for the community, Orange has a web based forum for Orange related discussion. The authors respond quickly with nice help and suggestions.

edit: today I find this:"Python_and_Machine_Learning":slides, and video.

answered Oct 25 '10 at 05:58

sunqiang's gravatar image


edited Oct 26 '10 at 06:02

Don't forget MDP, probably the best for unsupervised learning with a bias towards image processing. They also have good test coverage and docs (although the site is/was a bit dated, they did a sprint and improved the looks and docs a lot)

answered Dec 27 '10 at 12:35

Jose%20Quesada's gravatar image

Jose Quesada

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.