1
1

Hadoop has a pseudo distributed conf. which uses different cores as different machines (I assume). How much useful it is over similar number of distributed machines?

E.g. Consider an Amazon EC2 small instance which has a single machine, what if i used a Large Machine which has 8 virtual cores. The price is nearly 8 times but, will having 8 cores on two machines be more beneficial than having 8 different machines.

Is there a considerable speedup achieved by running a pseudo distributed Hadoop?

asked Jul 03 '10 at 21:05

DirectedGraph's gravatar image

DirectedGraph
54531422


One Answer:

In general, there are massive speedup gains to be obtained by using independent disks. The I/O throughput of 8 independent disks working in parallel is much higher than a single disk. So, if you have (say) 8 mappers all hitting the same disk at once, the overall progress will be very slow.

answered Jul 05 '10 at 17:27

Delip%20Rao's gravatar image

Delip Rao
6502810

but is disk read time really the limiting factor?

(Jul 05 '10 at 23:53) DirectedGraph
1

Good question! Depends on your map-reduce. If your mappers are CPU bound for a long time compared to the time taken in reading key-value pairs then no. But most map-reduces spend very little time in actual mapper (or reducer). Also it's a good practice to write mappers that finish quickly. So yes, disk read/write time can become a limiting factor.

(Jul 06 '10 at 00:33) Delip Rao
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.