|
From a quick search, it seems there are several general purpose machine learning packages in Python (Orange, mlpy, scikits.learn) and even more specialized ones. Does anyone have experience with more than one of them ? How do they compare to each other in terms of features, maturity, performance and community ? |
|
Hi, I am one of the developers of scikit-learn . The project has recently gained steam and is moving fast thanks to new contributors. The design goals are:
We don't plan to have a complete dataflow programming model as MDP does (but the two projects are collaborating (a bit) to make it easy to use scikit-learn algo in MDP nodes). It should be also possible to wrap scikit-learn models into Orange components if your users want the rich user interface of that framework (but AFAIK nobody tried so far). Current limitations:
We also plan to provide standard feature extractors for text classification / clustering (work under way, mostly done), image classification (some basic examples), face recognition (planned) and maybe audio / speech for segmentation / classification / fingerprinting (prospective, nothing done yet). The goal is for the user to have worked example with sane default parameters to build upon and not just machine learning building blocks that require to know the inner workings to apply to a concrete use case.
This answer is marked "community wiki".
|
|
Theano is a CPU and GPU compiler for mathematical expressions in Python. It combines the convenience of NumPy with the speed of optimized native machine language. For gradient-based machine learning algorithms (like training an MLP or convolutional net), Theano is from 1.6x to 7.5x faster than competitive alternatives (including those in C/C++, NumPy, SciPy, and Matlab) when compiled for the CPU and between 6.5x and 44x faster when compiled for the GPU. You can read more about it here. 1
I'd like to add to this that theano is also able to automatically compute derivatives of the functions you define. This is very neat when you are building complex combinations of models and using gradient based optimization.
(Nov 05 '10 at 07:02)
Philemon Brakel
|
|
scikits-learn is starting to really shape up, as others have mentioned. Other fairly mature packages include MDP and PyMVPA. If you want kernel methods, Shogun has feature-complete Python bindings and is also quite mature. There's also Elefant which I don't know much about but has been around a long time. |
|
Let me add my own package to the list: milk and another: MDP. I tried mlpy in it was coming out and it did not support much, which is why I ended up developing my own code. In general, scikit.learn has more cutting edge things than milk, while I focus more on performance (both speed and memory-wise) and flexibility. Unfortunately, part of the answer depends on what you mean by "machine learning"? Classification? Clustering? Graphical models? Deep learning? No package does it all well at the moment. How large is your data? Plussed for the last paragraph; knowing your problem well is way more important than any particular choice of software. If there is no problem yet and the goal is to explore, just do that.
(Nov 04 '10 at 13:39)
Radim
I agree with the fact that knowing the problem is the most important part. I am a developer in the scikit-learn, and I think that for some problems we are starting to have really nice implementations, but the choice of the optimization problem and the strategy (algorithmic implementation) is what makes the difference. This is also why one of our goals in the scikit-learn is to find a reasonable API that enable the expression of many problems in a consistent way (and to gather algorithms with this API in the scikit). This is of course very challenging as developer tends to overfit the problems he knows and works on when designing an API. Criticism is of course welcome on the mailing list :)
(Nov 06 '10 at 06:17)
Gael Varoquaux
|
|
Amund Tveit has a brief summary of Python tools for classification, with recommendations of Orange or PyML. |
|
I only dabble in Orange several years ago(before Version 1) for text classification. It can handle thousands of samples with thousands features. edit: today I find this:"Python_and_Machine_Learning":slides, and video. |
|
Don't forget MDP, probably the best for unsupervised learning with a bias towards image processing. They also have good test coverage and docs (although the site is/was a bit dated, they did a sprint and improved the looks and docs a lot) |
I'd say scikit.learn is starting to look really good, avoiding many of the pitfalls you usually find in this sort of software. I'll wait for @ogrisel to comment, however, as he posts in this site and is one of the developers.
http://pybrain.org/