Scalable parallel computing on clouds using Twister4Azure iterative MapReduce
Highlights
► Twister4Azure is an iterative MapReduce framework optimized for Azure Cloud. ► Efficient easy to use scalable parallel computation can be performed using Twister4Azure. ► Twister4Azure features a light weight, distributed, decentralized architecture with fault tolerance. ► Four scientific applications, Multi-Dimensional Scaling, KMeans Clustering, BLAST+ and all pairs sequence alignment are implemented using Twister4Azure. ► Applications perform comparably or better than traditional MR frameworks (e.g. Hadoop).
Introduction
The current parallel computing landscape is vastly populated by the growing set of data-intensive computations that require enormous amounts of computational as well as storage resources and novel distributed computing frameworks. The pay-as-you-go cloud computing model provides an option for the computational and storage needs of such computations. The new generation of distributed computing frameworks such as MapReduce focuses on catering to the needs of such data-intensive computations.
Iterative computations are at the core of the vast majority of large-scale data-intensive computations. Many important data-intensive iterative scientific computations can be implemented as iterative computation and communication steps, in which computations inside an iteration are independent and are synchronized at the end of each iteration through reduce and communication steps, making it possible for individual iterations to be parallelized using technologies such as MapReduce. Examples of such applications include dimensional scaling, many clustering algorithms, many machine learning algorithms, and expectation maximization applications, among others. The growth of such data-intensive iterative computations in number as well as importance is driven partly by the need to process massive amounts of data, and partly by the emergence of data-intensive computational fields, such as bioinformatics, chemical informatics and web mining.
Twister4Azure is a distributed decentralized iterative MapReduce runtime for Windows Azure Cloud that has been developed utilizing Azure cloud infrastructure services. Twister4Azure extends the familiar, easy-to-use MapReduce programming model with iterative extensions, enabling a wide array of large-scale iterative data analysis and scientific applications to utilize the Azure platform easily and efficiently in a fault-tolerant manner. Twister4Azure effectively utilizes the eventually consistent, high-latency Azure cloud services to deliver performance that is comparable to traditional MapReduce runtimes for non-iterative MapReduce, while outperforming traditional MapReduce runtimes for iterative MapReduce computations. Twister4Azure has minimal management and maintenance overheads and provides users with the capability to dynamically scale up or down the amount of computing resources. Twister4Azure takes care of almost all the Azure infrastructure (service failures, load balancing, etc.) and coordination challenges, and frees users from having to deal with the complexity of the cloud services. Window Azure claims to allow users to “focus on your applications, not the infrastructure”. Twister4Azure takes that claim one step further and lets users focus only on the application logic without worrying about the application architecture.
Applications of Twister4Azure can be categorized according to three classes of application patterns. The first of these are the Map only applications, which are also called pleasingly (or embarrassingly) parallel applications. Examples of this type of applications include Monte Carlo simulations, BLAST+ sequence searches, parametric studies and most of the data cleansing and pre-processing applications. Section 4.5 analyzes the BLAST+ [1] Twister4Azure application.
The second type of applications includes the traditional MapReduce type applications, which utilize the reduction phase and other features of MapReduce. Twister4Azure contains sample implementations of the SmithWaterman-GOTOH (SWG) [2] pairwise sequence alignment and Word Count as traditional MapReduce type applications. Section 4.4 analyzes the SWG Twister4Azure application.
The third and most important type of applications Twister4Azure supports is the iterative MapReduce type applications. As mentioned above, there exist many data-intensive scientific computation algorithms that rely on iterative computations, wherein each iterative step can be easily specified as a MapReduce computation. Sections 4.2 Multi-dimensional scaling—iterative MapReduce, 4.3 KMeans clustering—iterative MapReduce present detailed analyses of Multi-Dimensional Scaling and KMeans Clustering iterative MapReduce implementations. Twister4Azure also contains an iterative MapReduce implementation of PageRank, and we are actively working on implementing more iterative scientific applications using Twister4Azure.
Developing Twister4Azure was an incremental process, which began with the development of pleasingly parallel cloud programming frameworks [3] for bioinformatics applications utilizing cloud infrastructure services. The MRRoles4Azure [4] MapReduce framework for Azure cloud was developed based on the success of pleasingly parallel cloud frameworks and was released in late 2010. We started working on Twister4Azure to fill the void of distributed parallel programming frameworks in the Azure environment (as of June 2010) and the first public beta release of Twister4Azure (http://salsahpc.indiana.edu/twister4azure/) was made available in mid-2011.
Section snippets
MapReduce
The MapReduce [5] data-intensive distributed computing paradigm was introduced by Google as a solution for processing massive amounts of data using commodity clusters. MapReduce provides an easy-to-use programming model that features fault tolerance, automatic parallelization, scalability and data locality-based optimizations.
Apache Hadoop
Apache Hadoop [6] MapReduce is a widely used open-source implementation of the Google MapReduce [5] distributed data processing framework. Apache Hadoop MapReduce uses the
Twister4Azure-Iterative MapReduce
Twister4Azure is an iterative MapReduce framework for the Azure cloud that extends the MapReduce programming model to support data-intensive iterative computations. Twister4Azure enables a wide array of large-scale iterative data analysis and data mining applications to utilize the Azure cloud platform in an easy, efficient and fault-tolerant manner. Twister4Azure extends the MRRoles4Azure architecture by utilizing the scalable, distributed and highly available Azure cloud services as the
Methodology
In this section, we present and analyze four real-world data-intensive scientific applications that were implemented using Twister4Azure. Two of these applications, Multi-Dimensional Scaling and KMeans Clustering, are iterative MapReduce applications, while the other two applications, sequence alignment and sequence search, are pleasingly parallel MapReduce applications.
We compare the performance of the Twister4Azure implementations of these applications with the Twister [8] and Hadoop [6]
Performance considerations for data caching on Azure
In this section, we present a performance analysis of several data caching strategies that affect the performance of large-scale parallel iterative MapReduce applications on Azure, in the context of a Multi-Dimensional Scaling application presented in Section 4.2. These applications typically perform tens to hundreds of iterations. Hence, we focus mainly on optimizing the performance of the majority of iterations, while assigning a lower priority to optimizing the initial iteration.
In this
Related work
CloudMapReduce [24] for Amazon Web Services (AWS) and Google AppEngine MapReduce [25] follow an architecture similar to that of MRRoles4Azure, as they utilize cloud services as their building blocks. Amazon ElasticMapReduce [26] offers Apache Hadoop as a hosted service on the Amazon AWS cloud environment. However, none of them supports iterative MapReduce. Spark [27] is a framework implemented using Scala to support interactive MapReduce-like operations to query and process read-only data
Conclusions and future work
We presented Twister4Azure, a novel iterative MapReduce distributed computing runtime for Windows Azure Cloud. Twiser4Azure enables the users to perform large-scale data-intensive parallel computations efficiently on Windows Azure Cloud, by hiding the complexity of scalability and fault tolerance when using clouds. The key features of Twiser4Azure presented in this paper include the novel programming model for iterative MapReduce computations, the multi-level data caching mechanisms to overcome
Acknowledgments
This work is funded in part by the Microsoft Azure Grant. We would also like to thank Geoffrey Fox and Seung-Hee Bae for many discussions. TG was supported by a fellowship sponsored by Persistent Systems.
Thilina Gunarathne is a Ph.D. candidate at the School of Informatics and Computing at Indiana University. Thilina has engaged in research in the fields of distributed & parallel computing, cloud computing, many/multicore systems and SOA. His current research focuses on exploring architectures and programming models for scalable parallel computing on cloud environments. He has contributed to several open source projects in Apache Software Foundation as a committer and a PMC member starting from
References (29)
- et al.
Identification of common molecular subsequences
Journal of Molecular Biology
(1981) An improved algorithm for matching biological sequences
Journal of Molecular Biology
(1982)- et al.
BLAST+: architecture and applications
BMC Bioinformatics 2009
(2009) - et al.
Cloud technologies for bioinformatics applications
IEEE Transactions on Parallel and Distributed Systems
(2011) - et al.
Cloud computing paradigms for pleasingly parallel biomedical applications
Concurrency and Computation: Practice and Experience
(2011) - T. Gunarathne, W. Tak-Lon, J. Qiu, G. Fox, MapReduce in the clouds for science, IEEE Second International Conference on...
- et al.
MapReduce: simplified data processing on large clusters
Communications of the ACM
(2008) - Apache Hadoop, Retrieved Mar. 20, 2012,...
- Hadoop Distributed File System HDFS, Retrieved Mar. 20, 2012,...
- J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qiu, G. Fox, Twister: a runtime for iterative MapReduce, in:...
Data-intensive text processing with MapReduce
Synthesis Lectures on Human Language Technologies
Cited by (45)
An intermediate data placement algorithm for load balancing in Spark computing environment
2018, Future Generation Computer SystemsCitation Excerpt :As a widely applied computing model for large-scale data processing, MapReduce can be used to parallelize the computation by running multiple map and reduce tasks over distributed data across multiple machines automatically and efficiently [1]. In current popular implementations of MapReduce [2,3], compared with Hadoop [4] and other distributed computing frameworks [5], Apache Spark has a more efficient implementation mechanism for large-scale data processing [6]. As the process of MapReduce in the Spark framework treats all the intermediate data as key/value tuples, a data cluster is the subset of all tuples with the same key [7].
Exploring big data-driven innovation in the manufacturing sector: evidence from UK firms
2024, Annals of Operations ResearchCandidate architectures for emerging IoV: a survey and comparative study
2021, Design Automation for Embedded SystemsMap Reduce Overview and Functionality
2021, Proceedings of the 6th International Conference on Communication and Electronics Systems, ICCES 2021Architecting and developing big data-driven innovation (ddi) in the digital economy
2021, Journal of Global Information Management
Thilina Gunarathne is a Ph.D. candidate at the School of Informatics and Computing at Indiana University. Thilina has engaged in research in the fields of distributed & parallel computing, cloud computing, many/multicore systems and SOA. His current research focuses on exploring architectures and programming models for scalable parallel computing on cloud environments. He has contributed to several open source projects in Apache Software Foundation as a committer and a PMC member starting from 2004. He received his B.Sc. (Computer Science and Engineering) from the University of Moratuwa, Sri Lanka in 2006 and M.Sc. (Computer Science) from the Indiana University in 2009.
Bingjing Zhang is a Ph.D. candidate at the School of Informatics and Computing at Indiana University. His current research focuses on design and optimization of iterative MapReduce framework. He is one of main contributors to Twister iterative MapReduce framework. He received his M.Sc. (Computer Science) from Indiana University in 2011, M.E. (Software Engineering) from Nanjing University in 2009, and B.E. (Software Engineering) from Nanjing University in 2007.
Tak-Lon (Stephen) Wu is a graduate student pursuing Ph.D. degree of School of Informatics and Computing at Indiana University, Bloomington. Besides attending the school, he is working as a Graduate Research Assistant of SalsaHPC Team, Community Grids Lab at Indiana University Bloomington. In addition, he has been an associate instructor of Dr. Judy Qiu courses, Distributed System and Cloud Computing, for four semesters. He received his B.Sc. (Computer Science and Information Engineering) from National Central University, Taiwan in 2008 and M.Sc. (Computer Science) from the Indiana University in 2010.
Dr. Judy Qiu is an Assistant Professor of Computer Science at Indiana University. Her areas of study include parallel and distributed systems, Cloud/Grid computing and high performance computing. She has extensive research experience in multicore computing and the use of cloud platforms and systems for scientific data analysis. Her research thrust is Data-Enabled Discovery Environments for Science and Engineering with novel technologies driven by applications. She has pioneered the use of Iterative MapReduce on both HPC and commercial cloud environments. She collaborates in several applications to motivate and validate her computer science systems work. Current work is in Genomics, Proteomics and Network Science. Her work has been funded by NSF, NIH, Microsoft, and Indiana University Faculty Research Support Program.