6
2

I have read the descriptions for Hadoop on Wikipedia and I have noticed that many machine learning folks are using Hadoop to overcome CPU intensive number crunching burdons.

I am wondering though. what is Hadoop and why should I use it. I know how to use EC2 and how to distribute my algorithms on a bunch of different machines. Can you explain to me why Hadoop is of an interest for people?

PS. and what is Pig?

asked Jul 03 '10 at 15:31

Mark%20Alen's gravatar image

Mark Alen
1323234146

edited Jul 03 '10 at 16:08

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

1

How do you currently distribute your jobs in EC2, and how do you coordinate them? Just curious.

(Jul 03 '10 at 15:52) Joseph Turian ♦♦
1

I am also interested in it's answer +1

(Jul 03 '10 at 15:52) zengr

2 Answers:

Amazon provide Elastic Map Reduce, which makes running Hadoop jobs much easier. You can sign up for the service, upload the data to S3 buckets via the browser interface and launch jobs using an online wizard (an API is also available). http://aws.amazon.com/elasticmapreduce/

Amazon EMR is based on EC2 with smallest instance costing around 20 cents per hour. You can also implement your own cluster by hacking together several EC2 instances. However I would advice you against it since its quite complicated.

Regarding Pig, its a simple data processing language which is used (generally) with tabular data. A simple Pig script is

Data = load 'data.txt'; /* loads a tab delimited (other optins possible, tab delimited is default)  file*/  
FData = Filter Data by $0!='1'; /* filters data by checking if first field is not 1 */  
GFData = Group FData by $0; /* group by first field */
Store GFData into 'result'; /* Stores the result */

The beauty of pig is that above script is converted into a Map-Reduce code and scales when run on a cluster.
You can also write and test pig scripts in local mode for rapid development.

answered Jul 03 '10 at 19:59

DirectedGraph's gravatar image

DirectedGraph
56031424

edited Jul 03 '10 at 20:07

11

Hadoop is an (re)-implementation of the MapReduce framework which was popularized by Google. MapReduce is essentially a formalization of a process you've probably done informally multiple time: You want to "process" each datum of a dataset. You split up the data into chunks, each machine gets a chunk. The machine performs a user-defined "map" operation on each datum in its chunk, which is usually a processing kind of operation (e.g. count tokens). Then, during the reduce phase, the results from multiple machines are aggregated (or "reduced") in a user-specified way; for instance, adding together word counts. The benefit is primarily to remove some of the code you've written to do this over and over again as well as to handle the issue of fault tolerance. Although depending on how often you do this sort of thing, the time to learn and deploy Hadoop may not be worth it relative to an adhoc solution.

In order to run Hadoop you need to do several things to a cluster of machines include giving them access to a specialized Hadoop File System they can all read/write from. EC2 takes care of these and many other configuration issues. So it removes some barriers to casual use of Hadoop.

Pig is an apache project meant to make many "data analysis" Hadoop problems easier to execute. Its essentially a mini-language for executing queries which behind the scene runs a Hadoop job. The queries can count words and do some simple relation algebra-like operations on text. It doesn't have the full expressiveness of the jobs you can do with Hadoop, but I think it hits a sweet spot for a lot of stuff done in document analysis.

If you're interested in more convention-over-configuration frameworks for Hadoop: I recommend crane. It is a clojure framework which makes it very easy to configure and run Hadoop jobs on EC2.

answered Jul 03 '10 at 16:22

aria42's gravatar image

aria42
209972441

edited Jul 03 '10 at 16:28

What does "convention-over-configuration" mean?

(Jul 03 '10 at 20:06) Joseph Turian ♦♦

It refers to a software design approach where a process which in principle has many knobs aren't immediately exposed to a client, but a set of sensible defaults are chosen to not overwhelm the client. Ruby on Rails I think is a recent canonical example of convention-over-configuration.

(Jul 03 '10 at 21:03) aria42

I would also recommend Dumbo which is a Python interface to execute jobs on Hadoop.

(Jul 03 '10 at 21:31) DirectedGraph
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.