MapReduce is a programming framework popularized by Google and used to simplify data processing across massive data sets. As people rapidly increase their online activity and digital footprint, organizations are finding it vital to quickly analyze the huge amounts of data their customers and audiences generate to better understand and serve them. MapReduce is the tool that is helping those organizations.
Most enterprises deal with multiple types of data (text, rich text, rdbms, graph, etc…) and need to process all this data quickly and efficiently to derive meaningful insights that bring business value to the organization.
With MapReduce, computational processing can occur on data stored either in a filesystem (unstructured) or within a database (structured).
There are two fundamental pieces of a MapReduce query:
Map step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
Reduce step: The master node then takes the answers to all the sub-problems and combines them in a way to get the output – the answer to the problem it was originally trying to solve.