Mapreduce And Its Implementation: A Review

Nowadays, Social media becomes very popular in everyday life. Hence, as a result the database becomes huge. Therefore, many enterprises are shifting their analytical databases towards cloud instead of high-end proprietary machines and moving towards cheaper solution. Hence, the concept of MapReduce comes into consideration that provides better scalability, fault tolerance, and flexibility in handling unstructured analytical data. This paper explores the concept of MapReduce along with the Hadoop framework and various extended implementations on top of Hadoop to provide better understanding of current state.
General Terms
Hadoop, MapReduce, Big Data.
Hadoop, MapReduce, Big Data, Mahout, Hive, MapRedoop
In the today era of 'Big Data", characterized by unprecedented volume of data, the velocity of data generation, and the variety of the structure of data, support for large-scale data analytics constitutes a particularly challenging task. To address the scalability requirements of today's data analytics, parallel shared nothing architectures of commodity machines (often consisting of thousands of nodes) have been lately established as the de-facto solution. Various systems have been developed mainly by the industry to support Big Data analysis, including Google's MapReduce [1] [2], Yahoo's PNUTS [3], Microsoft's SCOPE [4], Twitter's Storm [5], and LinkedIn's Kafka [6] and Walmart- Labs' Muppet [7]. Also, several companies, including Facebook, both use and have contributed to Apache Hadoop (an open-source implementation of MapReduce) and its ecosystem, which is the most popular open source implementation framework MapReduce proposed by Google and Hadoop proposed by Apache [8] [10]. The MapReduce is the heart of Hadoop. It is programming paradigm that allows for massive scalability across hundreds or thousand of servers in a hadoop cluster [11]. The MapReduce concept is fairly simple to understand for those who are familiar with clustered scale out data processing solutions. Although term 'MapReduce' actually is composed of Map procedure that perform filtering and sorting and reduce procedure performs a summary operation [12] [13]. The main advantage of MapReduce is a large variety of problem is easily expressible as MapReduce computations.

Google was running over one hundred thousand MapReduce jobs per day and processing over 20 PB of data in the same period [7]. By 2010, Google had created over ten thousand distinct MapReduce programs performing a variety of functions, including large-scale graph processing, text processing. Hadoop Distributed File System and Google File System have common design goals. They are both targeted at data intensive computing applications where massive data files are common [7]. Both are optimized in favour of high-sustained bandwidths instead of low latency, to better support batch-processing style workloads. Both run on clusters built with commodity hardware components where failures are common, motivating the inclusion of built-in fault tolerance mechanisms through replication.
In both systems, the file system is implemented by user level processes running on top of a standard operating system (in the case of GFS, Linux). A single GFS master server running on a dedicated node is used to coordinate storage resources and manage metadata. Multiple slave servers (chunk servers in Google parlance) are used in the cluster to store data in the form of large blocks (chunks), each identified with a 64bit ID. Files are saved by the chunk servers on local disk as native Linux files, and accessed by chunk ID and offset within the chunk [14]. Both HDFS and GFS use the same default chunk size (64MB) to reduce the amount of metadata needed to describe massive files, and to allow clients to interact less often with the single master. Finally, both use a similar replica placement policy that saves copies of data in many locations locally, to the same rack, and to a remote rack to provide fault tolerance and improve performance.
In this paper, we discussed implementation of MapReduce and its software framework along with different software extensions of Hadoop. Section 2 describes about MapReduce and some component of MapReduce. Various benefits and properties of Hadoop are discussed in section 3. In section 4, we will discuss about extensions on Hadoop. Finally concluded the paper in section 5.
Data processing on MapReduce performed in three basic operations i.e. Map, partition and reduce. MapReduce distributes data on various nodes in cluster on shared-nothing basis. Firstly, each node in the cluster performs a parallel execution of a set of Map tasks without sharing any data with other nodes. Next, data is partitioned/repartitioned across all nodes of the cluster. Finally, parallel execution of a set of reduce tasks is performed by each node on the partition it receives. This process can be performed by an arbitrary number of additional Map-partition-Reduce cycles as necessary. MapReduce does not statically manage a detailed query execution plan that specifies which task executes on which node; instead this is determined at runtime. This functionality of MapReduce helps to manage node failures and slow nodes by assigning more tasks to faster nodes and reassign tasks from failed nodes. MapReduce also manage checkpoints for the output of each Map task on local disk, in essence to minimize the amount of work that has to be re-executed upon a failure. Due to these, MapReduce best meets the properties of fault tolerance and ability to operate in heterogeneous environment. It achieves fault tolerance by detecting and reassigning Map tasks of failed nodes to other nodes in the cluster. Whereas, It achieves the ability to operate in a heterogeneous environment via redundant task execution. Tasks that take a long time to complete on slow nodes get redundantly executed on other nodes that have completed their assigned tasks. The time to complete the task becomes equal to the time for the fastest node to complete the redundantly executed task. By breaking tasks into smaller task, the effect of faults can be minimized.
2.1 Components of Map Reduce
2.1.1 Map
The Map function takes a range of key value pairs as input, operates every pair and produces zero or even more output key value pairs. However, Input and output types of the map might be distinct from each other.
2.1.2 Partition
Every Map function output is assigned to a specific reducer by the application's partition function for segmentation purposes. However, the partition function provided the key together with the number of reducers and then returns the index of the required reducer.
Conventional standard partition function would be used to hash the key and apply the hash value modulo the number of reducers. However, it is recommended to choose a partition function that provides a nearly uniform distribution of data per segment for load-balancing purposes, else the MapReduce performance might be held up waiting for slow reducers to complete.
Between the map and even reduce phases, the data is rearranged in such a way that it can shift the data from the map node that generated it to the segment in which it will be reduced. The adjustment may eventually take more time than the computation time depending on network bandwidth; CPU speeds; data generated and also times consumed by map and reduce computations.
2.1.3 Comparison
The input is taken from the machine for every Reduce wherever the Map ran and sorted with the application's comparison function
2.1.4 Reduce
The framework communicates with the application's Reduce function once for each and every unique key in the sorted order. The Reduce function can iterate through the values, which can be related with that key and generates zero or more outputs.

Hadoop [7][15-17] is a software framework, which can be installed on a cluster to allow large-scale distributed data analysis. Little or no hardware customization is needed but plausible modifications to meet minimum recommended requirements per node. Hadoop was created in 2004 by Doug Cutting and named after his son's stuffed elephant. In First month of 2008, Hadoop finally became a top-level Apache Software Foundation project. There have been a lot of contributors from both academia and industry (Yahoo and Facebook appearing the leading contributors). Furthermore, Hadoop has a broad and rapidly growing user community [18][19].
3.1 Benefits of Hadoop
Hadoop assures the robust and fault-tolerant behavior using Hadoop Distributed File System (HDFS), inspired from Google's file system [20], in addition to a Java-based API that enables parallel processing across the nodes of the cluster utilizing the MapReduce paradigm. Whereas, Hadoop Streaming provides integration of code written in other languages, for instance Python and C, a utility that allows users to create and execute tasks with any executable just as the mapper and/or the reducer. Additionally, Hadoop comes with Job and Task Trackers that monitor continuously the programs' execution across the nodes of the cluster.
3.1.1 Data locality
Hadoop makes an attempt to automatically assemble the data with the computing node. In short, Hadoop schedules Map tasks closer to the data upon which they will work, with 'close' means the same node or, at least, on the same rack. In April 2008 a Hadoop program operating on 910-node cluster broke a world record, while sorting a terabyte of data within 3 .5 minutes. Improvements in performance might continue as Hadoop becomes matured [21].
3.1.2 Fault-tolerant
Tasks must be independent in nature, with an exception of mappers feeding into reducers under Hadoop control. Hadoop is designed in such a manner it can easily detect task failure and then restart programs on alternative healthy nodes. Rather, node failures are controlled automatically and then reschedule the task as needed. However, a single point of failure currently resides at the one name node for the HDFS file system.
3.1.3 Reliability
Hadoop provides reliability by replicating the data across multiple nodes. Therefore, RAID storage is not needed.
3.1.4 Programming support
Flow of Data is implicit and managed automatically; it will probably not require further coding to manage execution of a task in cluster. For tasks that suit the MapReduce model, Hadoop simplifies the development of fault-tolerant, large-scale and distributed applications on a cluster of machines.
3.1.5 Map Reduce Paradigm
Hadoop uses a MapReduce execution engine [6][22][23] to implement its fault tolerant distributed computing system over the large data sets stored in the cluster's distributed file system. This Map Reduce method becomes popularized by use at Google. Recently, Google patented it for use on clusters and licensed to Apache [24], and further development is in process by an extensive community of researchers [25].
There are separate Map and Reduce steps, every step executed in parallel and operates on set of key-value pairs. Therefore, program execution is split into a Map and a Reduce stage that are separated by data transfer between nodes in the cluster.
First and foremost, a node executes a Map function on a segment of the input data. Whereas, Map output is a set of records in the form of key-value pairs stored on that node. However, the records for a given key are spread across numerous nodes that have been aggregated at the node running the Reducer for specific key. This phase involves data transfer between machines. The second stage (Reduce) is blocked from progressing until each and every data from the Map stage has been transferred to the appropriate machine. Finally, the Reduce stage produces one more set of key-value pairs, as output. This is a fairly easy programming model limited to the usage of key-value pairs, although unexpected number of tasks and algorithms will fit into this framework.
3.1.6 HDFS file system
There are certain disadvantages of HDFS. HDFS executes continuous updates (write many) compared to a classic relational database management system. Moreover, HDFS cannot be directly mounted onto the present operating system. Therefore obtaining data from the HDFS file system can be unsuitable.

4.1 Mahout
Apache Mahout is a project developed by Apache Software Foundation to promote free implementations of distributed or otherwise scalable machine learning algorithms focused mainly in the areas of collaborative filtering, clustering and classification. However, numerous implementations utilize the Apache Hadoop platform. Also, Mahout [26][27] provides Java libraries for well known mathematical operations and primitive Java collections.
4.2 Hive
Hive [28] is a data warehouse framework developed on top of Hadoop. It was actually developed at Facebook specially designed for ad hoc querying with an SQL type query language and also used in more complicated analysis. In Hive, users define tables and columns; data is loaded into and retrieved through these tables. Hive QL is an SQL-like query language normally used in generates summaries, reports and analyses. Hive is designed in such a way that its queries launch MapReduce jobs. Moreover, it is developed for batch processing rather than real-time queries.
4.3 Pig
Pig [29] is an execution framework whose compiler produces sequences of MapReduce programs for execution within Hadoop. Pig framework is based on high-level data-flow language named as Pig Latin. Also, Pig is designed for batch processing of data. However, its infrastructure layer consists of a compiler that turns relatively short Pig Latin programs into sequences of MapReduce programs.
4.4 MapRedoop
MapRedoop [32] has been used interchangeably to denote both the framework and the Domain Specific Language (DSL). However, DSL is used to provide a higher level of abstraction for specifying a computational or configuration need in the domain. Additionally, DSL is used to minimize the accidental complexities of the solutions space.
4.5 Cascading
Cascading [31] is a project providing a programming API for defining and executing fault tolerant data processing workflows on a Hadoop cluster. It is a thin open source Java library that sits on top of the Hadoop MapReduce layer. Also, Cascading provides a query processing API that allows programmers to operate at a higher level than MapReduce. Additionally, Cascading assemble complex distributed processes more quickly and schedule them based on dependencies.
RHIPE [33] is the R and Hadoop Integrated Programming Environment. Also, RHIPE is a merger of R and Hadoop. R is highly acclaimed interactive language and environment for data analysis. Whereas, Hadoop consists of the Hadoop Distributed File System (HDFS) and the MapReduce distributed compute engine. RHIPE allows an analyst to carry out analysis of complex big data wholly from within R. It communicates with Hadoop to carry out the big and parallel computations.
4.7 HadoopDB
HadoopDB [8][9] is an open source data base management, built completely from open source components, including Hive, which provides a SQL interface to the system. HadoopDB provides Hadoop access to multiple single-node DBMS servers deployed across the cluster. Moreover, HadoopDB pushes as much as possible data processing into the database engine by issuing SQL queries. Applying techniques taken from the database world leads to a performance boost, specifically in more complex data analysis. At the same time, the fact that HadoopDB relies on MapReduce framework ensures scores on scalability and fault/heterogeneity tolerance similar to Hadoop.

In this paper, we briefly discuss about MapReduce, its implementation Framework i.e. Hadoop with their benefits viz. Data locality, Fault tolerant, Reliability etc. Also, discussed about various tools that extends the capability of Hadoop in different domain or languages to provide better understanding of these frameworks for the developers in the field of big data. In line with above, it's clear that MapReduce provides good scalability and fault-tolerance for massive data processing. However, efficiency, especially I/O costs of MapReduce still need to be addressed for successful implications.
[1] Cooper, Brian F., et al. "PNUTS: Yahoo!'s hosted data serving platform."Proceedings of the VLDB Endowment 1.2 (2008): 1277-1288.
[2] J. Zhou, N. Bruno, M.-C.Wu, P.-_A. Larson, R. Chaiken, and D. Shakib. SCOPE: parallel databases meet Map-Reduce. VLDB Journal, 21(5):611{636, 2012.
[3] J. Leibiusky, G. Eisbruch, and D. Simonassi. Getting Started with Storm. O'Reilly, 2012.
[4] Goodhope, Ken, et al. "Building LinkedIn's Real-time Activity Data Pipeline."IEEE Data Eng. Bull. 35.2 (2012): 33-45.
[5] Lam, Wang, et al. "Muppet: MapReduce-style processing of fast data."Proceedings of the VLDB Endowment 5.12 (2012): 1814-1825.
[6] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
[7] Lam, Chuck. Hadoop in action. Manning Publications Co., 2010.
[8] Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933.
[9] HadoopDB-project home page. [].
[10] Agrawal, Divyakant, Sudipto Das, and Amr El Abbadi. "Big data and cloud computing: current state and future opportunities." Proceedings of the 14th International Conference on Extending Database Technology. ACM, 2011.
[11] Babu, Shivnath. "Towards automatic optimization of MapReduce programs."Proceedings of the 1st ACM symposium on Cloud computing. ACM, 2010.
[12] Chu, Cheng-Tao, et al. "Map-reduce for machine learning on multicore." NIPS. Vol. 6. 2006.
[13] Condie, Tyson, et al. "MapReduce Online." NSDI. Vol. 10. No. 4. 2010.
[14] Review of Distributed File Systems: Concepts and Case Studies ECE 677 Distributed Computing Systems
[15] Hadoop - Apache Software Foundation project home page. [].
[16] Venner, Jason, and Steve Cyrus. Pro Hadoop. Vol. 1. New York, NY: Apress, 2009.
[17] White, Tom. Hadoop: The Definitive Guide: The Definitive Guide. O'Reilly Media, 2009.
[18] Hadoop user listing. [].
[19] Henschen D: Emerging Options: MapReduce, Hadoop: Young, But Impressive. Information Week 2010, 24.
[20] Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS Operating Systems Review. Vol. 37. No. 5. ACM, 2003.
[21] Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds (using Jim Gray's sort benchmark, on Yahoo's Hammer cluster of ~3800 nodes). [].
[22] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: a flexible data processing tool." Communications of the ACM 53.1 (2010): 72-77.
[23] Can Your Programming Language Do This? (MapReduce concept explained in easy-to-understand way). [ items/2006/08/01.html].
[24] Google blesses Hadoop with MapReduce patent license. [].
[25] The First International Workshop on MapReduce and its Applications (MAPREDUCE'10) - June 22nd, 2010 HPDC'2010, Chicago, IL, USA. [].
[26] Anil, Robin, Ted Dunning, and Ellen Friedman. Mahout in action. Manning, 2011.
[27] Ingersoll, Grant. "Introducing apache mahout." (2009).
[28] Hive - Apache Software Foundation project home page. [].
[29] Pig - Apache Software Foundation project home page. [].
[30] Mahout - Apache Software Foundation project home page. [].
[31] Cascading - project home page. [].
[32] Jacob, Ferosh, et al. "Simplifying the Development and Deployment of MapReduce Algorithms." International Journal of Next-Generation Computing2.2 (2011).
[33] Guha, Saptarshi, et al. "Large complex data: divide and recombine (D&R) with RHIPE." Stat 1.1 (2012): 53-67.

Source: Essay UK -

About this resource

This Information Technology essay was submitted to us by a student in order to help you with your studies.

Search our content:

  • Download this page
  • Print this page
  • Search again

  • Word count:

    This page has approximately words.



    If you use part of this page in your own work, you need to provide a citation, as follows:

    Essay UK, Mapreduce And Its Implementation: A Review. Available from: <> [27-05-20].

    More information:

    If you are the original author of this content and no longer wish to have it published on our website then please click on the link below to request removal: