|
I am not sure if this question is correct, but I am asking to resolve the doubts I have.
I observed that
This is a very open ended question, I am sure the advices might help me and people who have same doubt. |
|
I think you have some concepts a bit mixed: Hadoop, as far as I know is a useful tool if you want to work in a distributed environment. That is, if you want to run your algorithm in parallel computers. MapReduce as well is a really good tool to work with parallel cores. This said, if you are choosing JAVA for Hadoop, I'll recommend you to think twice, since using Hadoop for ML algorithms (if you are starting from zero) is pretty daunting. In my opinion, Java is not the best tool for ML implementations, since JAVA has a problem with native support for large float numbers. Python, on the other hand, is pretty good with calculations. I would recommend you to work on Python, since is easy to get good implementations quite fast. Matlab is another tool you might use which also offers good results and a low step learning curve. |
|
In my opinion, there is no single right tool for ML/DM. For example, if you know Java you may call the Weka API and also you can add your own classifier/regression model/clusterer/etc to weka. This makes sens if you want to compare various algorithms to your model, by using Weka Experimenter (a great tool for statistical comparisons). Another example would be R statistical package: see DATA MINING Desktop Survival Guide and Togo's book. As @Leon says, Python is widely used, and it is my belief that Python is a better approach to perform rapid prototyping of different ML models. The main problem with Weka is that it gets slow really quickly, its hard to do real world implementations, but really good for small scale problems though.
(Jun 23 '11 at 05:39)
Leon Palafox ♦
@Leon: good to know...
(Jun 23 '11 at 06:35)
Lucian Sasu
|
|
If you're trying to do machine learning on big data them you will probably stumble on Hadoop. You may program your own distributed algorithms in MapReduce with Python, which has a few advantages over Java that you have probably grasped, and stream them over your cluster with Hadoop Streaming API. If want to go with Mahout, read Mahout in Action, which is the only complete material about it. As a last note, I may say that you must ensure that your data is really BIG. If you're not in the Tera scale, or at least in the high Giga, and above you probably may find other solutions that will be faster than Hadoop. For example, there is GraphLab, MLPACK, Shogun which are C++ toolboxes for scalable machine learning. There are also other options if your problem is big computation instead of big data, to which I would generally advise go with fast parallel C++ algorithms, GPUs and programs that translate their jobs to those languages such as Theano, for example. Since you didn't give details about what you want to do, I couldn't be more precise. But I hope this help. |