Monday, April 11, 2016

MapReduce Basics

MapReduce is a processing technique and a program model for distributed computing based on Java. The MapReduce contains two important tasks, basically, Map and Reduce. Map takes a set of data and converts into another set of data, where individual elements are broken down into tuples. Secondly, reduce task which takes the output from a map as an input and combines those tuples into smaller set of tuples. As the sequence of name MapReduce implies, the reduce task is always performed after the map job. 

The major advantage of MapReduce is that it is easy to scale data processing over multiple nodes. 

Below give is the algorithm for this. 

MapReduce program executes mainly in three stages. namely, map stage, shuffle stage, and reduce stage. 

Map Stage : The map or mapper’s job is to process the input data. Generally the input is in the form of text file or directory and is stored in the HDFS. The input file is passed to the mapper function line by line. The mapper processes the data and creates several chunks of of data. 

Reduce Stage: This stage is combination of shuffle stage and Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output which will be stored in the HDFS. 

During the MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster. The framework manages all the details of data processing such as issuing tasks, verifying task completion, and copying the data around the cluster between nodes. Most of the computing takes place on nodes with data on local disk that reduces the network traffic. 

After completion of the tasks, the cluster collects and reduces the data to form an appropriate result and sends back to the Hadoop server. 


From Java Perspective, MapReduce see the input as a set of key value pairs. 

Below are few terminologies 

Payload : Applications implement the Map and Reduce functions, and form the core of the job. 
Mapper : Mapper maps the input key values pairs to a set of intermediate key value pair. 
NamedNode : Node that manages the Hadoop distributed system. 
DataNode : Node where data is presented in advance before any processing takes place. 
MasterNode : Node where the JobTracker runes and which accepts Job requests from clients. 
SlaveNode : Node where Map and Reduce program runs. 
JobTracker : Schedules the job and track the assign Jobs to Task Tracker. 
Task Tracker : Tracks the task and reports status to JobTracker 
Job : A program is an execution of a mapper and reducer across a dataset. 
Task : eXecution of a Mapper or Reducer on a slice of data. 
Task Attempt : A particular instance of an attempt to execute a task on a SlaveNode. 



references:


No comments:

Post a Comment