Scalable parallel computing on clouds using Twister4Azure iterative MapReduce

https://doi.org/10.1016/j.future.2012.05.027Get rights and content

Abstract

Recent advances in data-intensive computing for science discovery are fueling a dramatic growth in the use of data-intensive iterative computations. The utility computing model introduced by cloud computing, combined with the rich set of cloud infrastructure and storage services, offers a very attractive environment in which scientists can perform data analytics. The challenges to large-scale distributed computations on cloud environments demand innovative computational frameworks that are specifically tailored for cloud characteristics to easily and effectively harness the power of clouds. Twister4Azure is a distributed decentralized iterative MapReduce runtime for Windows Azure Cloud. Twister4Azure extends the familiar, easy-to-use MapReduce programming model with iterative extensions, enabling a fault-tolerance execution of a wide array of data mining and data analysis applications on the Azure cloud. Twister4Azure utilizes the scalable, distributed and highly available Azure cloud services as the underlying building blocks, and employs a decentralized control architecture that avoids single point failures. Twister4Azure optimizes the iterative computations using a multi-level caching of data, a cache-aware decentralized task scheduling, hybrid tree-based data broadcasting and hybrid intermediate data communication. This paper presents the Twister4Azure iterative MapReduce runtime and a study of four real world data-intensive scientific applications implemented using Twister4Azure–two iterative applications, Multi-Dimensional Scaling and KMeans Clustering; and two pleasingly parallel applications, BLAST+ sequence searching and SmithWaterman sequence alignment. Performance measurements show comparable or a factor of 2 to 4 better results than the traditional MapReduce runtimes deployed on up to 256 instances and for jobs with tens of thousands of tasks. We also study and present solutions to several factors that affect the performance of iterative MapReduce applications on Windows Azure Cloud.

Highlights

► Twister4Azure is an iterative MapReduce framework optimized for Azure Cloud. ► Efficient easy to use scalable parallel computation can be performed using Twister4Azure. ► Twister4Azure features a light weight, distributed, decentralized architecture with fault tolerance. ► Four scientific applications, Multi-Dimensional Scaling, KMeans Clustering, BLAST+ and all pairs sequence alignment are implemented using Twister4Azure. ► Applications perform comparably or better than traditional MR frameworks (e.g. Hadoop).

Introduction

The current parallel computing landscape is vastly populated by the growing set of data-intensive computations that require enormous amounts of computational as well as storage resources and novel distributed computing frameworks. The pay-as-you-go cloud computing model provides an option for the computational and storage needs of such computations. The new generation of distributed computing frameworks such as MapReduce focuses on catering to the needs of such data-intensive computations.

Iterative computations are at the core of the vast majority of large-scale data-intensive computations. Many important data-intensive iterative scientific computations can be implemented as iterative computation and communication steps, in which computations inside an iteration are independent and are synchronized at the end of each iteration through reduce and communication steps, making it possible for individual iterations to be parallelized using technologies such as MapReduce. Examples of such applications include dimensional scaling, many clustering algorithms, many machine learning algorithms, and expectation maximization applications, among others. The growth of such data-intensive iterative computations in number as well as importance is driven partly by the need to process massive amounts of data, and partly by the emergence of data-intensive computational fields, such as bioinformatics, chemical informatics and web mining.

Twister4Azure is a distributed decentralized iterative MapReduce runtime for Windows Azure Cloud that has been developed utilizing Azure cloud infrastructure services. Twister4Azure extends the familiar, easy-to-use MapReduce programming model with iterative extensions, enabling a wide array of large-scale iterative data analysis and scientific applications to utilize the Azure platform easily and efficiently in a fault-tolerant manner. Twister4Azure effectively utilizes the eventually consistent, high-latency Azure cloud services to deliver performance that is comparable to traditional MapReduce runtimes for non-iterative MapReduce, while outperforming traditional MapReduce runtimes for iterative MapReduce computations. Twister4Azure has minimal management and maintenance overheads and provides users with the capability to dynamically scale up or down the amount of computing resources. Twister4Azure takes care of almost all the Azure infrastructure (service failures, load balancing, etc.) and coordination challenges, and frees users from having to deal with the complexity of the cloud services. Window Azure claims to allow users to “focus on your applications, not the infrastructure”. Twister4Azure takes that claim one step further and lets users focus only on the application logic without worrying about the application architecture.

Applications of Twister4Azure can be categorized according to three classes of application patterns. The first of these are the Map only applications, which are also called pleasingly (or embarrassingly) parallel applications. Examples of this type of applications include Monte Carlo simulations, BLAST+ sequence searches, parametric studies and most of the data cleansing and pre-processing applications. Section 4.5 analyzes the BLAST+ [1] Twister4Azure application.

The second type of applications includes the traditional MapReduce type applications, which utilize the reduction phase and other features of MapReduce. Twister4Azure contains sample implementations of the SmithWaterman-GOTOH (SWG) [2] pairwise sequence alignment and Word Count as traditional MapReduce type applications. Section 4.4 analyzes the SWG Twister4Azure application.

The third and most important type of applications Twister4Azure supports is the iterative MapReduce type applications. As mentioned above, there exist many data-intensive scientific computation algorithms that rely on iterative computations, wherein each iterative step can be easily specified as a MapReduce computation. Sections 4.2 Multi-dimensional scaling—iterative MapReduce, 4.3 KMeans clustering—iterative MapReduce present detailed analyses of Multi-Dimensional Scaling and KMeans Clustering iterative MapReduce implementations. Twister4Azure also contains an iterative MapReduce implementation of PageRank, and we are actively working on implementing more iterative scientific applications using Twister4Azure.

Developing Twister4Azure was an incremental process, which began with the development of pleasingly parallel cloud programming frameworks [3] for bioinformatics applications utilizing cloud infrastructure services. The MRRoles4Azure [4] MapReduce framework for Azure cloud was developed based on the success of pleasingly parallel cloud frameworks and was released in late 2010. We started working on Twister4Azure to fill the void of distributed parallel programming frameworks in the Azure environment (as of June 2010) and the first public beta release of Twister4Azure (http://salsahpc.indiana.edu/twister4azure/) was made available in mid-2011.

Section snippets

MapReduce

The MapReduce [5] data-intensive distributed computing paradigm was introduced by Google as a solution for processing massive amounts of data using commodity clusters. MapReduce provides an easy-to-use programming model that features fault tolerance, automatic parallelization, scalability and data locality-based optimizations.

Apache Hadoop

Apache Hadoop [6] MapReduce is a widely used open-source implementation of the Google MapReduce [5] distributed data processing framework. Apache Hadoop MapReduce uses the

Twister4Azure-Iterative MapReduce

Twister4Azure is an iterative MapReduce framework for the Azure cloud that extends the MapReduce programming model to support data-intensive iterative computations. Twister4Azure enables a wide array of large-scale iterative data analysis and data mining applications to utilize the Azure cloud platform in an easy, efficient and fault-tolerant manner. Twister4Azure extends the MRRoles4Azure architecture by utilizing the scalable, distributed and highly available Azure cloud services as the

Methodology

In this section, we present and analyze four real-world data-intensive scientific applications that were implemented using Twister4Azure. Two of these applications, Multi-Dimensional Scaling and KMeans Clustering, are iterative MapReduce applications, while the other two applications, sequence alignment and sequence search, are pleasingly parallel MapReduce applications.

We compare the performance of the Twister4Azure implementations of these applications with the Twister [8] and Hadoop [6]

Performance considerations for data caching on Azure

In this section, we present a performance analysis of several data caching strategies that affect the performance of large-scale parallel iterative MapReduce applications on Azure, in the context of a Multi-Dimensional Scaling application presented in Section 4.2. These applications typically perform tens to hundreds of iterations. Hence, we focus mainly on optimizing the performance of the majority of iterations, while assigning a lower priority to optimizing the initial iteration.

In this

Related work

CloudMapReduce [24] for Amazon Web Services (AWS) and Google AppEngine MapReduce [25] follow an architecture similar to that of MRRoles4Azure, as they utilize cloud services as their building blocks. Amazon ElasticMapReduce [26] offers Apache Hadoop as a hosted service on the Amazon AWS cloud environment. However, none of them supports iterative MapReduce. Spark [27] is a framework implemented using Scala to support interactive MapReduce-like operations to query and process read-only data

Conclusions and future work

We presented Twister4Azure, a novel iterative MapReduce distributed computing runtime for Windows Azure Cloud. Twiser4Azure enables the users to perform large-scale data-intensive parallel computations efficiently on Windows Azure Cloud, by hiding the complexity of scalability and fault tolerance when using clouds. The key features of Twiser4Azure presented in this paper include the novel programming model for iterative MapReduce computations, the multi-level data caching mechanisms to overcome

Acknowledgments

This work is funded in part by the Microsoft Azure Grant. We would also like to thank Geoffrey Fox and Seung-Hee Bae for many discussions. TG was supported by a fellowship sponsored by Persistent Systems.

Thilina Gunarathne is a Ph.D. candidate at the School of Informatics and Computing at Indiana University. Thilina has engaged in research in the fields of distributed & parallel computing, cloud computing, many/multicore systems and SOA. His current research focuses on exploring architectures and programming models for scalable parallel computing on cloud environments. He has contributed to several open source projects in Apache Software Foundation as a committer and a PMC member starting from

References (29)

  • T.F. Smith et al.

    Identification of common molecular subsequences

    Journal of Molecular Biology

    (1981)
  • O. Gotoh

    An improved algorithm for matching biological sequences

    Journal of Molecular Biology

    (1982)
  • C. Camacho et al.

    BLAST+: architecture and applications

    BMC Bioinformatics 2009

    (2009)
  • J. Ekanayake et al.

    Cloud technologies for bioinformatics applications

    IEEE Transactions on Parallel and Distributed Systems

    (2011)
  • T. Gunarathne et al.

    Cloud computing paradigms for pleasingly parallel biomedical applications

    Concurrency and Computation: Practice and Experience

    (2011)
  • T. Gunarathne, W. Tak-Lon, J. Qiu, G. Fox, MapReduce in the clouds for science, IEEE Second International Conference on...
  • J. Dean et al.

    MapReduce: simplified data processing on large clusters

    Communications of the ACM

    (2008)
  • Apache Hadoop, Retrieved Mar. 20, 2012,...
  • Hadoop Distributed File System HDFS, Retrieved Mar. 20, 2012,...
  • J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qiu, G. Fox, Twister: a runtime for iterative MapReduce, in:...
  • Z. Bingjing, R. Yang, W. Tak-Lon, J. Qiu, A. Hughes, G. Fox, Applying twister to scientific applications, IEEE Second...
  • Apache ActiveMQ open source messaging system; Retrieved Mar. 20, 2012....
  • Microsoft Daytona, Retrieved Mar. 20, 2012....
  • J. Lin et al.

    Data-intensive text processing with MapReduce

    Synthesis Lectures on Human Language Technologies

    (2010)
  • Cited by (45)

    • An intermediate data placement algorithm for load balancing in Spark computing environment

      2018, Future Generation Computer Systems
      Citation Excerpt :

      As a widely applied computing model for large-scale data processing, MapReduce can be used to parallelize the computation by running multiple map and reduce tasks over distributed data across multiple machines automatically and efficiently [1]. In current popular implementations of MapReduce [2,3], compared with Hadoop [4] and other distributed computing frameworks [5], Apache Spark has a more efficient implementation mechanism for large-scale data processing [6]. As the process of MapReduce in the Spark framework treats all the intermediate data as key/value tuples, a data cluster is the subset of all tuples with the same key [7].

    • Map Reduce Overview and Functionality

      2021, Proceedings of the 6th International Conference on Communication and Electronics Systems, ICCES 2021
    View all citing articles on Scopus

    Thilina Gunarathne is a Ph.D. candidate at the School of Informatics and Computing at Indiana University. Thilina has engaged in research in the fields of distributed & parallel computing, cloud computing, many/multicore systems and SOA. His current research focuses on exploring architectures and programming models for scalable parallel computing on cloud environments. He has contributed to several open source projects in Apache Software Foundation as a committer and a PMC member starting from 2004. He received his B.Sc. (Computer Science and Engineering) from the University of Moratuwa, Sri Lanka in 2006 and M.Sc. (Computer Science) from the Indiana University in 2009.

    Bingjing Zhang is a Ph.D. candidate at the School of Informatics and Computing at Indiana University. His current research focuses on design and optimization of iterative MapReduce framework. He is one of main contributors to Twister iterative MapReduce framework. He received his M.Sc. (Computer Science) from Indiana University in 2011, M.E. (Software Engineering) from Nanjing University in 2009, and B.E. (Software Engineering) from Nanjing University in 2007.

    Tak-Lon (Stephen) Wu is a graduate student pursuing Ph.D. degree of School of Informatics and Computing at Indiana University, Bloomington. Besides attending the school, he is working as a Graduate Research Assistant of SalsaHPC Team, Community Grids Lab at Indiana University Bloomington. In addition, he has been an associate instructor of Dr. Judy Qiu courses, Distributed System and Cloud Computing, for four semesters. He received his B.Sc. (Computer Science and Information Engineering) from National Central University, Taiwan in 2008 and M.Sc. (Computer Science) from the Indiana University in 2010.

    Dr. Judy Qiu is an Assistant Professor of Computer Science at Indiana University. Her areas of study include parallel and distributed systems, Cloud/Grid computing and high performance computing. She has extensive research experience in multicore computing and the use of cloud platforms and systems for scientific data analysis. Her research thrust is Data-Enabled Discovery Environments for Science and Engineering with novel technologies driven by applications. She has pioneered the use of Iterative MapReduce on both HPC and commercial cloud environments. She collaborates in several applications to motivate and validate her computer science systems work. Current work is in Genomics, Proteomics and Network Science. Her work has been funded by NSF, NIH, Microsoft, and Indiana University Faculty Research Support Program.

    View full text