6
1

What are the best and cheapest options to run MapReduce/Hadoop jobs remotely (or simple grid jobs)? Say, I want to run my large-scale experiments on 500 or 1000 computers? Is Amazon EC2 still the best service? Are there cheaper ones by now?

asked Jul 23 '10 at 11:06

Frank's gravatar image

Frank
1349274453


2 Answers:

Amazon EC2 / EMR is the cheapest/only service which i know where you can create a hadoop cluster with ease.

The smallest instance costs around 10 cent per hour. Additionally you need to fill in a special request form if you plan to have more than 20 nodes at a time.

You can bid for spot instances which are cheaper. Amazon has a varying rate which changes every hour according to availability of instances. If your hadoop job is granular enough, you can bid on instances. You need to make sure that the system is completely automated.

I checked the price for spot instances yesterday and for nearly 80% of time in last week, you could get a small instance for 4 cents per hour (nearly half of what default price is), you can check for trend using your AWS console.

answered Jul 23 '10 at 11:14

DirectedGraph's gravatar image

DirectedGraph
56031424

edited Jul 23 '10 at 11:21

So, based on 4 cents per hour per CPU -- using 1000 computers running for 24 hours would roughly be $1,000. Looks like for that scale it's still too expensive ...

(Jul 23 '10 at 12:41) Frank

I doubt you can get 1000 nodes at that price, you probably would end up increasing the price. Also I why one would need 1000 nodes unless you are doing some extremely heavy lifting like image processing on say billion images.

You can also buy extra large High Mem / High CPU instances they might give you better value for same price. I prefer Large high memory instance since it gives you 17GB ram with 7 Compute Cores.

(Jul 23 '10 at 13:12) DirectedGraph

1000 nodes would be useful if you want to, say, use an expensive model for machine translation with hidden variables, summing over all hidden alignments instead of the 1-best GIZA alignment, high-polynomial decoding algorithm, included topic model, translating multiple sentences jointly etc., on a billion words of text. :)

(Jul 23 '10 at 19:00) Frank

Actually, make that 100,000 nodes ... ;-)

(Jul 23 '10 at 19:01) Frank

I had a 6 minute movie that took 1 second to render each frame using a java applet. That would take three hours to complete sequentially single threaded. 30 frames * 60 seconds * 6 minutes / 60 seconds = 180 minutes / 60 = 3.

So, I used 20 Amazon EC2 instances. I think my rate at the time was $0.06 a cpu-hour. I used the smallest instances.

On 20 machines, you literally just divide the time by 20. 180 minutes / 20 = 9.

I ran with one machine up for a little while to build my disk image, but then I ran the 20 machine total copies for a couple of hours. I used something like 3 hours of 20 CPUs == 60 cpu hours, and then like another 6 hours with just one CPU with the initial setup + just playing around with the system. 66 cpu hours at 0.06 = $3.96.

So for 3 hours I had a system that cut a 3 hour render into 9 minutes, allowing me 20 renders, in the time I had for just one. It was extremely linear for me in my experience with this one simple thing.

So, the math was really easy. I still am paying the $0.10 a month for them to keep that disk image, and it takes only like 5 minutes to set up. I think if you have some planning for exactly what you want to accomplish, it works really well. When I start thinking about if a project is relevant to using this system, the more embarrassingly-parallel the problem is I would think the better. I imagine the cost would scale with the amount of work you're doing too. So, if you have a job for 1000 machines, I am sure it because you have something that is, or will be, worth the expense.

Here's a python script that I didn't comment that gives me a unified terminal shell for basic commands to 20 linux images over ssh, and some badly written upload / download / split routines.

http://gist.github.com/281504

answered Jul 26 '10 at 23:48

th0ma5's gravatar image

th0ma5
463

edited Jul 26 '10 at 23:53

Hi th0ma5,do you have more Python codes to manage EC2, I would love to see them if you are on GitHub.

(Jul 27 '10 at 08:05) Mark Alen

This is a good resource http://github.com/infochimps/cluster_chef
though i guess its in ruby

(Jul 27 '10 at 15:58) DirectedGraph

no i don't ... arguably there is an Amazon-specific library for python out there with more advanced features, especially more mturk and s3 stuff, but this generic multiple-host library had the easiest interface for what i needed to get done

(Jul 28 '10 at 11:57) th0ma5
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.