Map-reduce is great. It has made it possible to process insane amounts of data on commodity hardware. However it is a very low-level programming abstraction and too low for most problems that analysts and “data scientists” encounter.
M-R is the assembly programming of big data. It is vital as the base level of the stack. Just as assembly is unproductive for general programming compared to python, ruby or <your-favorite-high-level-language>, M-R is too low level for doing significant analysis work.
PIG and Cascading (and other languages that build on top of M-R) are built with language constructs that match what analysts need to do:
- load complex data
- join multiple data sets
- filter rows
- project out columns
- aggregate based on columns
- apply functions to aggregates
Very few non-trivial analysis problems map effortlessly onto the map-reduce model. Most problems will require many M-R stages. This can make for brittle code that is hard to maintain. It might seem like you are saving effort by keeping the stack simple and using raw M-R or streaming through python, but productivity will usually suffer.
Have you played with clojure/cascalog yet? It’s my new favorite meta-MR language.
I have looked at them (as well as Encanter) and the syntax is a bit unfamiliar but I like the concept. I will probably try it since it requires basically zero configuration change(just include a jar). Any hints for why you love it?