IHadoop: Asynchronous iterations for MapReduce

Handle URI:
http://hdl.handle.net/10754/564445
Title:
IHadoop: Asynchronous iterations for MapReduce
Authors:
Elnikety, Eslam Mohamed Ibrahim; El Sayed, Tamer S.; Ramadan, Hany E.
Abstract:
MapReduce is a distributed programming frame-work designed to ease the development of scalable data-intensive applications for large clusters of commodity machines. Most machine learning and data mining applications involve iterative computations over large datasets, such as the Web hyperlink structures and social network graphs. Yet, the MapReduce model does not efficiently support this important class of applications. The architecture of MapReduce, most critically its dataflow techniques and task scheduling, is completely unaware of the nature of iterative applications; tasks are scheduled according to a policy that optimizes the execution for a single iteration which wastes bandwidth, I/O, and CPU cycles when compared with an optimal execution for a consecutive set of iterations. This work presents iHadoop, a modified MapReduce model, and an associated implementation, optimized for iterative computations. The iHadoop model schedules iterations asynchronously. It connects the output of one iteration to the next, allowing both to process their data concurrently. iHadoop's task scheduler exploits inter-iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast local data transfer. For those iterative applications that require satisfying certain criteria before termination, iHadoop runs the check concurrently during the execution of the subsequent iteration to further reduce the application's latency. This paper also describes our implementation of the iHadoop model, and evaluates its performance against Hadoop, the widely used open source implementation of MapReduce. Experiments using different data analysis applications over real-world and synthetic datasets show that iHadoop performs better than Hadoop for iterative algorithms, reducing execution time of iterative applications by 25% on average. Furthermore, integrating iHadoop with HaLoop, a variant Hadoop implementation that caches invariant data between iterations, reduces execution time by 38% on average. © 2011 IEEE.
KAUST Department:
Computer Science Program
Publisher:
Institute of Electrical and Electronics Engineers (IEEE)
Journal:
2011 IEEE Third International Conference on Cloud Computing Technology and Science
Conference/Event name:
2011 3rd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2011
Issue Date:
Nov-2011
DOI:
10.1109/CloudCom.2011.21
Type:
Conference Paper
ISBN:
9780769546223
Appears in Collections:
Conference Papers; Computer Science Program

Full metadata record

DC FieldValue Language
dc.contributor.authorElnikety, Eslam Mohamed Ibrahimen
dc.contributor.authorEl Sayed, Tamer S.en
dc.contributor.authorRamadan, Hany E.en
dc.date.accessioned2015-08-04T07:01:10Zen
dc.date.available2015-08-04T07:01:10Zen
dc.date.issued2011-11en
dc.identifier.isbn9780769546223en
dc.identifier.doi10.1109/CloudCom.2011.21en
dc.identifier.urihttp://hdl.handle.net/10754/564445en
dc.description.abstractMapReduce is a distributed programming frame-work designed to ease the development of scalable data-intensive applications for large clusters of commodity machines. Most machine learning and data mining applications involve iterative computations over large datasets, such as the Web hyperlink structures and social network graphs. Yet, the MapReduce model does not efficiently support this important class of applications. The architecture of MapReduce, most critically its dataflow techniques and task scheduling, is completely unaware of the nature of iterative applications; tasks are scheduled according to a policy that optimizes the execution for a single iteration which wastes bandwidth, I/O, and CPU cycles when compared with an optimal execution for a consecutive set of iterations. This work presents iHadoop, a modified MapReduce model, and an associated implementation, optimized for iterative computations. The iHadoop model schedules iterations asynchronously. It connects the output of one iteration to the next, allowing both to process their data concurrently. iHadoop's task scheduler exploits inter-iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast local data transfer. For those iterative applications that require satisfying certain criteria before termination, iHadoop runs the check concurrently during the execution of the subsequent iteration to further reduce the application's latency. This paper also describes our implementation of the iHadoop model, and evaluates its performance against Hadoop, the widely used open source implementation of MapReduce. Experiments using different data analysis applications over real-world and synthetic datasets show that iHadoop performs better than Hadoop for iterative algorithms, reducing execution time of iterative applications by 25% on average. Furthermore, integrating iHadoop with HaLoop, a variant Hadoop implementation that caches invariant data between iterations, reduces execution time by 38% on average. © 2011 IEEE.en
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)en
dc.titleIHadoop: Asynchronous iterations for MapReduceen
dc.typeConference Paperen
dc.contributor.departmentComputer Science Programen
dc.identifier.journal2011 IEEE Third International Conference on Cloud Computing Technology and Scienceen
dc.conference.date29 November 2011 through 1 December 2011en
dc.conference.name2011 3rd IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2011en
dc.conference.locationAthensen
dc.contributor.institutionMax Planck Institute for Software Systems (MPI-SWS), Saarbruecken, Germanyen
dc.contributor.institutionCairo Microsoft Innovation Lab., Cairo, Egypten
kaust.authorElnikety, Eslam Mohamed Ibrahimen
kaust.authorEl Sayed, Tamer S.en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.