Map-reduce is great. It has made it possible to process insane amounts of data on commodity hardware. However it is a very low-level programming abstraction and too low for most problems that analysts and “data scientists” encounter.
M-R is the assembly programming of big data. It is vital as the base level of the stack. Just as assembly is unproductive for general programming compared to python, ruby or <your-favorite-high-level-language>, M-R is too low level for doing significant analysis work.
PIG and Cascading (and other languages that build on top of M-R) are built with language constructs that match what analysts need to do:
- load complex data
- join multiple data sets
- filter rows
- project out columns
- aggregate based on columns
- apply functions to aggregates
Very few non-trivial analysis problems map effortlessly onto the map-reduce model. Most problems will require many M-R stages. This can make for brittle code that is hard to maintain. It might seem like you are saving effort by keeping the stack simple and using raw M-R or streaming through python, but productivity will usually suffer.