To follow up on the former question. I'd like to install Hadoop on my office machine and play with it. It is a relatively old machine and you can read about it's specs in this question that I posted over OR-exchange.

What is the easiest way to get exposed to Hadoop and play with it? Can I install it an Ubuntu machine (single CPU with 2 cores)? or do I need a cluster or EC2 machines?

asked Jul 04 '10 at 21:02

Mark%20Alen's gravatar image

Mark Alen
1263233945


3 Answers:

The simplest possible way is to run Hadoop in standalone mode, which requires no configuration at all. This is an extended version of the script in the Hadoop quick start tutorial, only assuming that Java is available:

$ export JAVA_HOME=/usr/lib/jvm/java-6-sun/jre
$ wget http://mirror.its.uidaho.edu/pub/apache/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
$ tar xzf hadoop-0.20.2.tar.gz
$ cd hadoop-0.20.2
$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
$ cat output/*

Set JAVA_HOME to the correct value for your system. This runs a simple grep operation on the files in the input directory, using the supplied example JAR file. Output will be written to the newly created output directory.

To run your own MapReduce jobs, simply substitute the examples JAR with your own MapReduce JAR file. (Note that the output directory will not be overwritten, so old results must be moved or deleted.)

Once this works for you, try running the same job in pseudo-distributed mode and only after that in fully distributed mode.

answered Jul 05 '10 at 06:42

Thomas%20Brox%20R%C3%B8st's gravatar image

Thomas Brox Røst
1506

Yes you can setup on a single machine, you can use Cloudera distributed VM if you are using windows http://www.cloudera.com/developers/downloads/virtual-machine/
on Ubuntu you can install Hadoop via Apt-get and Clodera repositories
https://docs.cloudera.com/display/DOC/Enterprise+Documentation+Home
[I am suggesting cloudera because their installation process is via apt-get and bit more easy to follow]

You can easily work with hadoop on even single core machine, for multi-core you can use pseudo distributed mode so that it uses each core as different machine. You can develop test and execute without any issues.

You require EC2 or a cluster only when you actually want scalability and have a large dataset.

answered Jul 04 '10 at 21:14

DirectedGraph's gravatar image

DirectedGraph
54531422

This is also a good resource: http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/index.html

(Jul 05 '10 at 22:53) DirectedGraph

If you're starting from scratch & using the Apache distro (recommended), the following tutorial should do it: http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)

answered Jul 05 '10 at 16:54

Delip%20Rao's gravatar image

Delip Rao
6502810

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.