1
2

I am not sure if this question is correct, but I am asking to resolve the doubts I have.

  • For Machine Learning/Data Mining, we need to learn about data, which means you need to learn Hadoop, which has implementation in Java for MapReduce(correct me if I am wrong).
  • Hadoop also provides streaming api to support other languages(like python)
  • Most grad students/researchers I know solve ML problems in python
  • we see job posts for hadoop and Java combination very often

I observed that Java and Python(in my observation) are most widely used languages for this domain.

  • My question is what is most popular language for working on this domain.
  • what factors involve in deciding which language/framework one should choose
  • I know both Java and python but confused always :
  • whether I start programming in Java(because of hadoop implementation)
  • whether I start programming in Python(because its easier and quicker to write)

This is a very open ended question, I am sure the advices might help me and people who have same doubt.
Thanks a lot in advance
P.S. I also posted this problem in stackoverflow

asked Jun 21 '11 at 13:58

daydreamer's gravatar image

daydreamer
90479

edited Jun 21 '11 at 13:58


2 Answers:

I think you have some concepts a bit mixed:

Hadoop, as far as I know is a useful tool if you want to work in a distributed environment. That is, if you want to run your algorithm in parallel computers. MapReduce as well is a really good tool to work with parallel cores.

This said, if you are choosing JAVA for Hadoop, I'll recommend you to think twice, since using Hadoop for ML algorithms (if you are starting from zero) is pretty daunting.

In my opinion, Java is not the best tool for ML implementations, since JAVA has a problem with native support for large float numbers. Python, on the other hand, is pretty good with calculations.

I would recommend you to work on Python, since is easy to get good implementations quite fast. Matlab is another tool you might use which also offers good results and a low step learning curve.

answered Jun 21 '11 at 23:14

Leon%20Palafox's gravatar image

Leon Palafox
31265471107

In my opinion, there is no single right tool for ML/DM. For example, if you know Java you may call the Weka API and also you can add your own classifier/regression model/clusterer/etc to weka. This makes sens if you want to compare various algorithms to your model, by using Weka Experimenter (a great tool for statistical comparisons).

Another example would be R statistical package: see DATA MINING Desktop Survival Guide and Togo's book. As @Leon says, Python is widely used, and it is my belief that Python is a better approach to perform rapid prototyping of different ML models.

answered Jun 22 '11 at 14:40

Lucian%20Sasu's gravatar image

Lucian Sasu
453162532

The main problem with Weka is that it gets slow really quickly, its hard to do real world implementations, but really good for small scale problems though.

(Jun 23 '11 at 05:39) Leon Palafox

@Leon: good to know...

(Jun 23 '11 at 06:35) Lucian Sasu
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.