MapReduce - RISHABH LALA

Single Node Architecture:
The CPU communicates with the hard disk and the memory.
Data is processed by the CPU.

Multi node architecture:
Each node communicates with the storage (database). CPU, and memory is somewhere else. The communication of the compute node and the storage happens over the network. This communication is slow.

To avoid this we follow distributed computation such as HDFS. In hadoop distributed communication each node is attached to its local hard disk. The HDFS resides on top of the harddisk. Thus data resides in HDFS. THerefore, data is attached on multiple hard disk attached to different nodes. These nodes can access the data locally residing on multiple hard disk. When they access the data residing on their local hard disk, its very fast. However, if you try to acceess the data on another hard disk, it will take a longer time.

HADOOP:

WHen a job is submitted to Hadoop system, it is divided into multiple tasks. Each task is assigned to a node. The task assigned to each node. The task distribution and data distribution goes hand in hand. The task acts as if the data is residing in that node. Data access becomes faster. Task one's data is sitting on node 1, task 2 data is sitting on node 2, and so on.

Thus the task distribution and data distribution goes hand in hand. This reduces the requirements of network communication for data transfer. Each node is acting on the data locally. This is faster compared to traditional parallel computing architecture.

Know that, there is replication of the data. For example for task 2 and task 2, both of them need the same data, then both will have their own data sets stored locally. If node 2 is down, and we know that the data is for node 2 is also available at node 1 hard disk, task 2 is assigned to node 1 and then processed. All nodes are connected through network and data of one node can be accessed through other nodes.

In big data, when data is distributed on multiple nodes, the failure of individual nodes is a natural phenomena. One machine fails once every day roughly. If there are 1000 machines, 1 machine fails every day. If there are 10,000 machines, 10 machines will fail everyday.

Normally, when a query is run on application server to fetch data the database returns the query with results. The results (data) is transferred and the application query is transferred through communication channels. For big data, this communication can get really slow.

Therefore, there is a map reduce system is used in big data architecture. Client Submits the job on master node. Master node distributes it to multiple nodes in the form of smaller task (components of program) of the larger program. Computation happens in multiple nodes and results from each node is sent back to the master node. The first part of distributing the tasks to multiple node is called mapping and when the results are combined together its called reducing.

Map Reduce process allows to sequentially read a lot of data. Then mapping allows to extract something that you care about. Sort and shuffle happens internally. At reduce stage results from multiple nodes undergo aggregation, summarization, and filtering or transformation.

Example: to find the maximum of 1000 integers, the master node gives smaller task to 4 nodes, A, B, C, and D. Each node computes their maximum and sends it back to master node. Then the reduce algorithm takes the maximum of each (Ma, Mb, Mc, Md) and produces the final maximum. Similarly, if mean is computed through A, B, C, D, each node will return their individual mean and the numbers that it computed the mean on. Then there is Sort and Shuffle stage. Lastly, in the reduced stage, the mean is computed using a weighted average formula. This example illustrates we need to modify the reduce stage per the distribution (mapping) that we did. Otherwise, the results would be as good as computation done on a single machine.

Example: Work Distribution in a document for applications like text analytics, author identification, sentiment analysis, keyword search, etc. For more complex operations like this, we may need many mappers. Example three mapper programs, will be linked to 3 different reducer programs, example: each mapper program looks for lets say, dog, cat, ad owl respectively producing three different reduced outputs from same document. The reducer is responsible for aggregating a key or a set of keys to produce an output. The work of sort and shuffle program is to send the results from same key to the same reducer program.

A programmer writes two programs for each key value pair. All the values for same key are reduced together. Rest of the map reduce details are taken care of by the hadoop framework or spark infrastructure based on whatever you choose. Incase of hadoop infrastructure, the initial input and the final output are written into the distributed file system. All the reducers are allowed to write the data into the HDFS. All the intermediate results at the intermediate stages are written into the local file system. Master node is responsible for coordination. Master node checks the health of each node. When the task is completed it informs the master of the location of the intermediate file so that subsequently a reducer can pick it up. Incase of Hadoop this node is called HDFS name node or H Master Node.

Dealing with Failure:
When the master or reducer node fails, the task is re-assigned to another worker node or reducer node respectively. However, if the master node fails, the whole job is aborted.

Backups:
if a specific worker node is slow it will slow down the whole job. Therefore, a backup node responds and a competition is created. Whichever node responds first the results are grabbed from that node and the results are taken to next stage.

It is preferred to have as many mappers. Maybe one mapper per line. Number of reducers are limited by the number of unique keys such as the number of distinct words in the english language. Therefore, mappers can be very large compared to number of reducers. The count of mappers and reducers are determined by the system. Typically there are number of mapper tasks are running in a single machine.

Combiner:
Typically there are multiple mapper tasks running in a single machine. Each of these mappers need to meet to meet a output for the reducer to consume. This may add to the overhead. A combiner (local reducer) runs on a single node. It combines the output of the worker nodes running on the same machine. Combiner program is an internal program that the map reducer runs to combine outputs of the individual programs and the combiner acts like a reducer program running on a local machine (worker node).

Therefore, the four stages of a map reduce program are:
Map, combine, sort shuffle, and reduce.

Its easy to join two tables the system will do it for us using Map Reduce Join with the help of sort and shuffle program. The joining is much more effecient in hadoop than traditional RDBMS.