|
Russell Jurney writes about the joy of using Pig, a high-level abstraction for Hadoop (MapReduce). "Pig is the duct tape of Big Data." Under what circumstances will one save time using Pig under Hadoop? In what circumstances is Pig inappropriate? |
|
I've never run into a situation in which Pig wasn't the answer to a Hadoop problem. I often run into situations where the answer is Pig + Python streaming, however. There may be instances in which writing highly optimized Java MapReduce code is called for - but there aren't many people who can write such code better than Pig can. If you prefer to code everything in Java or LISP, Cascading/Cascalog are good options. I am not such a person :) What do you mean Pig + Python streaming?
(Jul 02 '10 at 03:28)
Joseph Turian ♦♦
1
@joseph he means that Python scripts are used as mappers. Data to them is passed via Pig.
(Jul 03 '10 at 00:25)
DirectedGraph
1
i agree with the pig/python scripting; though for me it's ruby :) it's far easy to stream than use custom UDF's and though you might lose some performance it makes no difference to the scalability.
(Jul 03 '10 at 01:06)
mat kelcey
|
|
Pig is also not Turing complete so for complex flows you need to embed it in external procedural code. It also has a history of not being compatible with very many versions of Hadoop (notably the versions I was using) which can inhibit upgrades to Hadoop. The unit testing issue that Olivier brought up is also very salient, although that same problem hasn't prevented rampant use of SQL. Pig does offer the ILLUSTRATE command which helps enormously. Pig also doesn't have any decent way of packaging and abstracting computations. That means, for instance, if you have a fancy computation such as a log parser or cooccurrence counter, you essentially can't use that computation except by either assuming that its output will always be in a known file or by copying and pasting the code into your new program. And if you do copy/paste the code, you stand the risk of variable collision because Pig has no scoping rules. For tiny programs, this is fine, but it does leave you without good compositionality. All that being said, I have had good luck with Pig the one time I got to use it in earnest. I have also heard from many people who have had very good results with Cascading. Much more interesting to me is the recent description by Google of their FlumeJava project. See http://www.deepdyve.com/lp/association-for-computing-machinery/flumejava-easy-efficient-data-parallel-pipelines-wwPgFt2hWB for details. What I would |
|
Pig is really good for ad-hoc exploratory work on a pile of data. However it currently lacks a unit / regression testing framework; hence maintainability of a set of big pig scripts looks frightening to me. Hence cascalog which is a first class citizen DSL in clojure looks more appealing to me: it can blend in a standard clojure test harness while still being easy to use interactively in a REPL for exploratory work. |
|
There are good set of benchmarks regarding pig maintained here: http://wiki.apache.org/pig/PigMix Anything Complicated and not related to relational/log data should be avoided. |