1
2

I am not sure if this question is correct, but I am asking to resolve the doubts I have.

  • For Machine Learning/Data Mining, we need to learn about data, which means you need to learn Hadoop, which has implementation in Java for MapReduce(correct me if I am wrong).
  • Hadoop also provides streaming api to support other languages(like python)
  • Most grad students/researchers I know solve ML problems in python
  • we see job posts for hadoop and Java combination very often

I observed that Java and Python(in my observation) are most widely used languages for this domain.

  • My question is what is most popular language for working on this domain.
  • what factors involve in deciding which language/framework one should choose
  • I know both Java and python but confused always :
  • whether I start programming in Java(because of hadoop implementation)
  • whether I start programming in Python(because its easier and quicker to write)

This is a very open ended question, I am sure the advices might help me and people who have same doubt.
Thanks a lot in advance
P.S. I also posted this problem in stackoverflow

asked Jun 21 '11 at 13:58

daydreamer's gravatar image

daydreamer
105689

edited Jun 21 '11 at 13:58


3 Answers:

I think you have some concepts a bit mixed:

Hadoop, as far as I know is a useful tool if you want to work in a distributed environment. That is, if you want to run your algorithm in parallel computers. MapReduce as well is a really good tool to work with parallel cores.

This said, if you are choosing JAVA for Hadoop, I'll recommend you to think twice, since using Hadoop for ML algorithms (if you are starting from zero) is pretty daunting.

In my opinion, Java is not the best tool for ML implementations, since JAVA has a problem with native support for large float numbers. Python, on the other hand, is pretty good with calculations.

I would recommend you to work on Python, since is easy to get good implementations quite fast. Matlab is another tool you might use which also offers good results and a low step learning curve.

answered Jun 21 '11 at 23:14

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

In my opinion, there is no single right tool for ML/DM. For example, if you know Java you may call the Weka API and also you can add your own classifier/regression model/clusterer/etc to weka. This makes sens if you want to compare various algorithms to your model, by using Weka Experimenter (a great tool for statistical comparisons).

Another example would be R statistical package: see DATA MINING Desktop Survival Guide and Togo's book. As @Leon says, Python is widely used, and it is my belief that Python is a better approach to perform rapid prototyping of different ML models.

answered Jun 22 '11 at 14:40

Lucian%20Sasu's gravatar image

Lucian Sasu
513172634

The main problem with Weka is that it gets slow really quickly, its hard to do real world implementations, but really good for small scale problems though.

(Jun 23 '11 at 05:39) Leon Palafox ♦

@Leon: good to know...

(Jun 23 '11 at 06:35) Lucian Sasu

If you're trying to do machine learning on big data them you will probably stumble on Hadoop. You may program your own distributed algorithms in MapReduce with Python, which has a few advantages over Java that you have probably grasped, and stream them over your cluster with Hadoop Streaming API.
But, you may also use Mahout (http://mahout.apache.org) that is a framework in Java with several machine learning algorithms ready to use (mainly for recommendations, classification and clustering).

If want to go with Mahout, read Mahout in Action, which is the only complete material about it.
If you prefer Python instead, you still have to know how to translate your code to MapReduce and for that Hadoop: The definitive guide is a good reading.

As a last note, I may say that you must ensure that your data is really BIG. If you're not in the Tera scale, or at least in the high Giga, and above you probably may find other solutions that will be faster than Hadoop. For example, there is GraphLab, MLPACK, Shogun which are C++ toolboxes for scalable machine learning.

There are also other options if your problem is big computation instead of big data, to which I would generally advise go with fast parallel C++ algorithms, GPUs and programs that translate their jobs to those languages such as Theano, for example.

Since you didn't give details about what you want to do, I couldn't be more precise. But I hope this help.

answered May 19 '13 at 21:32

edersantana's gravatar image

edersantana
155259

edited May 19 '13 at 21:51

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.