The music metadata matching we do as part of Operation Omar the Popyoular platform includes a couple of stages that involve some heavy data lifting. One of these requires us to import data from more than one source about a huge number of individual items. We need to perform a few crucial operations on each single item while importing the data. We tried a few different approaches, but ended up frustrated by the sheer time a complete import would take. With the number of items in the millions, super fast crunching can only do so much as long as we're dealing with a single queue.
While these complete imports don't need to be repeated very often at all, we still found the time involved unacceptable. The consequences of any mistakes or the need for adjustments would be far too damaging to our overall progress.
This problem was practically begging us to have a closer look at MapReduce and Hadoop. What we had was a massive number of similarly structured items on which we needed to perform the same basic operations. The ability to distribute the work and have it done in parallel seemed perfect. And it was. After implementing a solution using Hadoop we went from measuring the time it takes to do a full import from days to minutes. Wow!
Raw MapReduce is sometimes, admittedly, hard to wrap your head around being quite the low-level framework that it is. Our data crunching described here was actually implemented using the Cascading Framework. Cascading is a library that sits on top of Hadoop and lets you define your distributed processes on a higher level and without having to "think in MapReduce".