13
10

how big their community? how well they are scalable?

asked Aug 01 '10 at 15:03

yura's gravatar image

yura
1025374854


10 Answers:
17

Matlab probably gives the most builtin support the operations necessary for machine learning. Also, there is a lot of available matlab code for specific algorithms. But for scalability, Alexandre is, of course, right it's probably not your best choice.

I'd say a strong second choice is java -- by using weka and mallet as APIs you have access to a wealth of algorithms. There are solid optimization and matrix libraries which are helpful.

With python you've got NLTK which is a great NLP resource, and numpy and scipy give solid numerical support. Though python is probably eclipsed by matlab and java libraries for ready to rock algorithms.

For c, you've got open-source distributed programs, but you've got to do some wrangling (or wrapping them with scripts) to use them as an API. But it's a solid option.

To do research in machine learning, I'd say you should be at least conversant in matlab -- it's great for rapid prototyping and testing of ideas. To try out a bunch of existing algorithms on a task, my sense is that java is your best bet, but that's just my 2 cents.

answered Aug 01 '10 at 20:13

Andrew%20Rosenberg's gravatar image

Andrew Rosenberg
173772540

edited Aug 01 '10 at 20:58

1

This is a good list, though I'll add one small caveat (personal opinion). Java just doesn't cut it for most high-performance numerical tasks. I've searched high and low; there isn't a Java library that compares with the likes of Atlas and other fast BLAS implementations. As long as the Java API you use uses JNI to get to C libraries, you'd probably be in good shape. The only exception I've seen to my statement about high performance numerical tasks in java: JTransforms seems to be able to compete on the same level and in some cases surpass FFTW for fourier transforms.

(Jun 02 '11 at 12:17) Brian Vandenberg
19

For Python, I would like to add to the above remarks that the scikit-learn is really shaping up, with an extremely active community around it. It certainly does not compare to something like Weka in terms of features, but it is most-probably easier to use for a non expert, comes with a detailed documentation, and plugs seamlessly in all the other niceties brought by Python, such as matplotlib for plotting, numpy for numerical computing, or any of the other text-processing (nltk for instance) or web-scrapping libraries.

answered Nov 12 '10 at 04:30

Gael%20Varoquaux's gravatar image

Gael Varoquaux
92141426

edited Oct 20 '11 at 17:04

14

I'd check out MLOSS.org which collects most of the open source libraries, as a fast quantitative way to answer this question. People write what they're comfortable in: MATLAB from computer science community, R from statisticians, C++ from masochists :).

But dont pick a language because you can call libSVM from it. What you want is support for implementing machine learning algorithms, which largely will consist of linear algebra support. For linear algebra, people think MATLAB is the language of choice, but MATLAB is merely a thin wrapper around some open source libraries, including LAPACK which in turn calls BLAS. The warts on MATLAB come from this history: for linear algebra the syntax is great. But the minute you want to do something else, you run into a poorly though-out incoherent mess of a language.

MATLAB is fast because LAPACK and BLAS are fast. You can call LAPACK from any sensible langauge: in Python it forms the core of Numpy/Scipy. All the work is done by the fast, decades-optimized FORTRAN code, there is no reason why you should see a speed difference because the core operations are exactly the same code. You also don't want to write your own linear algebra routines: the whole point of LAPACK is to be architecture & cache aware, and it will be orders of magnitude faster than a naive couple of for loops implementation of matrix multiplication.

Python's syntax for linear algebra is slightly more cumbersome, but outside of it you have all the amenities of a real programming language: it's what I'm using at my industry job now, and we all love it. You could use java, but I when I google java and lapack i find pages that are nearly a decade out of date, which can't possibly be the state of things: what libraries should I search for?

One way to scale is to think about what linear algebra is available. On a GPU, you can get a BLAS from nVidia, and you can call an implementation of some more advanced functions in MAGMA, which is written so you can replace LAPACK calls with MAGMA calls. On clusters, parallelizing the linear algebra is a simple and sensible approach to scalability, and libraries exist for this. It's a bad match for MapReduce, but makes sense for MPI/real supercomputing.

tl;dr : These aren't the libraries you're looking for. Just make sure you can call LAPACK from your language of choice.

answered Aug 06 '10 at 17:55

Vicente%20Malave's gravatar image

Vicente Malave
355137

1

Here are some up to date Java wrappers for BLAS, LAPACK, and ATLAS: http://jblas.org/

(Jun 04 '11 at 23:52) Ben Mabey

What about R? Although it is primarily devoted to statistical analysis and data visualization, it also includes a plethora of machine learning methods.

answered Aug 06 '10 at 13:54

Andrej's gravatar image

Andrej
210114

Why down voted? I will give a +1 after one provides a constructive answer.

(Sep 12 '10 at 14:17) Lucian Sasu

@Imsasu: that's a good question. Apparently a used named Grzegorz ( http://metaoptimize.com/qa/users/754/grzegorz-chrupaa/ ) downvoted almost all answers in this thread. I wonder why.

If you feel like checking up on this sort of issue just click on the user who posted a question/comment and in her/his user page click on "Karma History" and you can see who downvoted which of his answers.

(Sep 12 '10 at 14:20) Alexandre Passos ♦

I'm curious about this too. R is a little awkward compared to Matlab for matrix manipulation but the extensive statistical and graphical tools seem like they might be very useful.

(Sep 23 '10 at 18:48) Miles Egan

R is very powerful tool in case you know how to use it correctly. I would'n say it is worse than matlab.

(Jun 02 '11 at 10:09) Sergey Dolgopolov

I don't consider this a bad answer, though I will give my reasons for abandoning R for machine learning: 1) its interpreter isn't fast. If you want speed, the script language itself should only be used to tie fast C routines together -- eg, as a first blush attempt, I wrote an SVM in R and it was dog slow compared to what I could do in other languages. 2) To get anything fast, you need to use their C API to pull it off. That means if you're writing, say, a recurrent neural net, you need to write it in C 3) Most of the already available machine learning libraries are horridly out of date.

(Jun 02 '11 at 12:27) Brian Vandenberg

As I think, now the best choice is Python or Java libraries. But Python as a dynamic language is so slow (difference between code performance between Java and Python can reach 50 times). So, I recommend to use Java. Try to look at Apache Mahout - this is scalable machine learning library written in Java

answered Aug 02 '10 at 09:47

Sergey%20Dolgopolov's gravatar image

Sergey Dolgopolov
12115

6

If python is slow for numerical calculations, then you are aren't doing it right. Using numpy gives optimised c-routines for most tasks, using LAPACK etc. When using python, numpy should be used for nearly every computation, with python mearly being the glue between calls and for saving/loading files.

(Aug 07 '10 at 01:08) Robert Layton

Indeed, Python has no reason to be slow, since most numerical computing packages rely on optimized C code for the CPU-intensive loops. In the case of machine learning, premature optimization (for instance the choice of a 'fast' language) is particularly useless, since the algorithmic cost of the different methods vary so much: you are better off using a good algorithm in a slow language than the converse.

For machine learning, Python is starting to have some state-of-the-art algorithms available out of the box that rely on its good scientific computing facilities (excellent general-purpose libraries built on standard Fortran or C++ packs, and good binding to C features). For instance some of scikit-learn's algorithms are only within a factor of two of hand-optimized SIMD C code.

(Nov 12 '10 at 04:25) Gael Varoquaux

For not-scalable research work, matlab is the standard. If you really want scalabity you're better off going after individual libraries, but java, python, and C (in the form of standalone programs, mainly) have lots of options.

However, for whatever purpose you might have, do search for specific libraries. I don't think there's any language with very good, high performance, scalable, state of the art libraries for all problems.

answered Aug 01 '10 at 17:23

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

That last sentence "I don't think there's any language with very good, high performance, scalable, state of the art libraries for all problems." sounds somewhat like the "No free lunch theorem".

(Jul 17 '11 at 02:47) 101010

how well they are scalable?

Well, Apache Mahout is specifically designed to be very scalable, by relying on Hadoop, the Apache implementation of Map/Reduce.

From the Mahout webpage:

Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.

Of course Mahout doesn't tie you exclusively to Hadoop either. Also from their website:

However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms.

Mahout and Hadoop also both have large, active developer communities around them, and are under constant development. If you already know Java, you could do a lot worse than going with Java and Mahout.

And, if Mahout doesn't fit your needs, there are also Weka, Mallet, JOONE, OpenNLP, jBNC, JGAP, and Neuroph, among others.

This answer is marked "community wiki".

answered Nov 15 '10 at 12:49

Phillip%20Rhodes's gravatar image

Phillip Rhodes
113

edited Nov 17 '10 at 23:41

The question is about the range of libraries already implemented, rather than the suitability of the language for implementing new libraries. For that R is hard to beat and almost all the libraries can be found in centralized repositories: CRAN, R-Forge & Bioconductor. The other nice thing is that all libraries work on R's frame objects so you do not need to spen your time munging your data between formats required by different libraries. An interactive console is also a plus.

answered Jul 21 '11 at 11:46

Daniel%20Mahler's gravatar image

Daniel Mahler
122631322

Another option to go for if you want high performing and scalable library is trying to work with openCV. OpenCV is a library written in c and c++ and optimized heavily. It is developed with computer vision in mind but it includes data structures for vectors and matrix with operator overloading so usage becomes more friendly. A list of the machine learning algorithms for version 2.1 can be found here. There exist guides to use it from python and from java i believe although I have not tried any of these. Hope this can provide an interesting alternative.

answered Feb 07 '11 at 11:57

Carsten%20Lygteskov%20Hansen's gravatar image

Carsten Lygteskov Hansen
312

Taking a broad definition of "machine learning" (in which I'd include many inferential statistics techniques), based on what I've seen, I imagine that MATLAB and the C/C++/Java family likely have the largest bases of support (including free and commercial code, number of users, publication in books and periodicals, etc.). Whether that makes those languages or their existing code bases best for any particular purpose is another matter.

answered Feb 07 '11 at 21:13

Will%20Dwinnell's gravatar image

Will Dwinnell
312210

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.