Abstract
The MapReduce framework is considered to be an effective resolution for huge and parallel data processing. This paper treats a massive data processing workflow as a DAG graph consisting of MapReduce jobs. In a heterogeneous computing environment, the computation speed can be different even on the same slot depending on various jobs. For this problem, this paper proposes an optimized MapReduce workflow scheduling algorithm. This algorithm comprises a job prioritizing phase and a task assignment phase. First, the jobs can be classified as I/O-intensive and computing-intensive, and the priorities of all jobs are computed according to their corresponding types. Then, the suitable slots are allocated for each block, and the MapReduce tasks in the workflow are scheduled with respect to data locality. The experimental results show that the optimized MapReduce workflow scheduling algorithm can improve the performance of task scheduling and the rationality of resources allocation in heterogeneous computing.








Similar content being viewed by others
References
Oozie. http://oozie.apache.org/
Barker A, Van Hemert J (2007) Scientific workflow: a survey and research directions. In: Proceedings of the 7th international conference on Parallel processing and applied mathematics, pp. 746–753. Springer
Barker A, Weissman JB, Hemert JI (2009) The circulate architecture: avoiding workflow bottlenecks caused by centralised orchestration. Clust Comput 12(2):221–235
Barseghian D, Altintas I, Jones M, Crawl D, Potter N, Gallagher J, Cornillon P, Schildhauer M, Borer E, Seabloom E et al (2010) Workflows and extensions to the kepler scientific workflow system to support environmental sensor data access and analysis. Ecol Inform 5(1):42–50
Calheiros R, Ranjan R, Beloglazov A, De Rose C, Buyya R (2011) Cloudsim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw 41(1):23–50
Chen Q, Wang L, Shang Z (2008) Mrgis: a mapreduce-enabled high performance workflow system for gis. In: eScience, 2008. eScience’08. IEEE Fourth International Conference on, IEEE, pp. 646–651
Craddock Tracy Harwood (2008) e.a.: e-science: relieving bottlenecks in large-scale genome analyses. Nature Publishing Group, pp. 948–954
Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation(OSDI), p. 137C150
Deelman E, Singh G, Su M, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman G, Good J et al (2005) Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci Progr 13(3):219–237
Fei X, Lu S (2012) A dataflow-based scientific workflow composition framework. Serv Comput IEEE Trans 5(1):45–58. doi:10.1109/TSC.2010.58
Fei X, Lu S, Lin C (2009) A mapreduce-enabled scientific workflow composition framework. In: IEEE International Conference on Web Services, 2009. ICWS 2009, IEEE, pp. 663–670
Group K. Opencl (open computing language) - the open standard for parallel programming of heterogeneous systems. In: URL http://www.khronos.org/opencl/
Johnson D, Garey M (1979) Computers and intractability: a guide to the theory of np-completeness. Freeman&Co, San Francisco
Lander G, Stagg S, Voss N, Cheng A, Fellmann D, Pulokas J, Yoshioka C, Irving C, Mulder A, Lau P et al (2009) Appion: an integrated, database-driven pipeline to facilitate em image processing. J Struct Biol 166(1):95–102
Lin C, Lu S, Lai Z, Chebotko A, Fei X, Hua J, Fotouhi F (2008) Service-oriented architecture for view: A visual scientific workflow management system. In: IEEE International Conference on Services Computing, 2008. SCC’08. IEEE, vol. 1, pp. 335–342
Ludäscher B, Weske M, Mcphillips T, Bowers S (2009) Scientific workflows: business as usual? Business process management pp. 31–47
McPhillips T, Bowers S, Zinn D, Ludäscher B (2009) Scientific workflow design for mere mortals. Futur Gener Comput Syst 25(5):541–551
Nguyen P, Halem M (2011) A mapreduce workflow system for architecting scientific data intensive applications. In: Proceeding of the 2nd international workshop on Software engineering for cloud computing, ACM, pp. 57–63
Oinn T Greenwood M, e.a. (2005) Taverna:lessons in creating a workflow environment for the life sciences. pp. 1067–1100
Pireddu L, Leo S, Zanetti G (2011). Mapreducing a genomic sequencing workflow. In: Proceedings of the second international workshop on MapReduce and its applications, ACM, pp. 67–74
Polo J, Carrera D, Becerra Y, Beltran V, Torres J, Ayguadé E (2010) Performance management of accelerated mapreduce workloads in heterogeneous clusters. In: 39th International Conference on Parallel Processing (ICPP2010)
Polo J, Carrera D, Becerra Y, Steinder M, Whalley I (2010) Performance-driven task co-scheduling for mapreduce environments. In: Network Operations and Management Symposium (NOMS), 2010 IEEE, pp. 373–380
Rooijers K, Kolmeder C, Juste C, Doré J, de Been M, Boeren S, Galan P, Beauvallet C, de Vos W, Schaap P (2011) An iterative workflow for mining the human intestinal metaproteome. BMC Genomics 12(1):6
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) 2010, pp. 1–10
Topcuouglu H, Hariri S, Wu MY (2002) Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib Syst 13(3):260–274
Wang J, Crawl D, Altintas I (2009) Kepler+ hadoop: A general architecture facilitating data-intensive applications in scientific workflow systems. In: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, ACM, p. 12
Warr W (2012) Scientific workflow systems: Pipeline pilot and knime. Journal of computer-aided molecular design pp. 1–4
White T (2012) Hadoop: The definitive guide. O’Reilly Media
Wolf J, Rajan D, Hildrum K, Khandekar R, Kumar V, Parekh S, Wu K, Balmin A (2010) Flex: a slot allocation scheduling optimizer for mapreduce workloads. Middleware 2010:1–20
Jacob JC, Katz DS et. al (2004) The Montage architecture for gridenabled science processing of large, distributed datasets. In: Proceedings of the Earth Science Technology Conference, June 2004
Acknowledgments
The authors are grateful to the three anonymous reviewers for their criticism and comments which have helped to improve the presentation and quality of the paper. This work is supported by the Key Program of National Natural Science Foundation of China (Grant Nos. 61133005, 61432005) National Natural Science Foundation of China (Grant Nos. 61103047,61370095).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tang, Z., Liu, M., Ammar, A. et al. An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. J Supercomput 72, 2059–2079 (2016). https://doi.org/10.1007/s11227-014-1335-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1335-2