Skip to main content
Log in

An optimized MapReduce workflow scheduling algorithm for heterogeneous computing

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The MapReduce framework is considered to be an effective resolution for huge and parallel data processing. This paper treats a massive data processing workflow as a DAG graph consisting of MapReduce jobs. In a heterogeneous computing environment, the computation speed can be different even on the same slot depending on various jobs. For this problem, this paper proposes an optimized MapReduce workflow scheduling algorithm. This algorithm comprises a job prioritizing phase and a task assignment phase. First, the jobs can be classified as I/O-intensive and computing-intensive, and the priorities of all jobs are computed according to their corresponding types. Then, the suitable slots are allocated for each block, and the MapReduce tasks in the workflow are scheduled with respect to data locality. The experimental results show that the optimized MapReduce workflow scheduling algorithm can improve the performance of task scheduling and the rationality of resources allocation in heterogeneous computing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. http://www.hpc.ncep.noaa.gov/

  2. Oozie. http://oozie.apache.org/

  3. Barker A, Van Hemert J (2007) Scientific workflow: a survey and research directions. In: Proceedings of the 7th international conference on Parallel processing and applied mathematics, pp. 746–753. Springer

  4. Barker A, Weissman JB, Hemert JI (2009) The circulate architecture: avoiding workflow bottlenecks caused by centralised orchestration. Clust Comput 12(2):221–235

    Article  Google Scholar 

  5. Barseghian D, Altintas I, Jones M, Crawl D, Potter N, Gallagher J, Cornillon P, Schildhauer M, Borer E, Seabloom E et al (2010) Workflows and extensions to the kepler scientific workflow system to support environmental sensor data access and analysis. Ecol Inform 5(1):42–50

    Article  Google Scholar 

  6. Calheiros R, Ranjan R, Beloglazov A, De Rose C, Buyya R (2011) Cloudsim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw 41(1):23–50

    Google Scholar 

  7. Chen Q, Wang L, Shang Z (2008) Mrgis: a mapreduce-enabled high performance workflow system for gis. In: eScience, 2008. eScience’08. IEEE Fourth International Conference on, IEEE, pp. 646–651

  8. Craddock Tracy Harwood (2008) e.a.: e-science: relieving bottlenecks in large-scale genome analyses. Nature Publishing Group, pp. 948–954

  9. Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation(OSDI), p. 137C150

  10. Deelman E, Singh G, Su M, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman G, Good J et al (2005) Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci Progr 13(3):219–237

    Google Scholar 

  11. Fei X, Lu S (2012) A dataflow-based scientific workflow composition framework. Serv Comput IEEE Trans 5(1):45–58. doi:10.1109/TSC.2010.58

    Article  Google Scholar 

  12. Fei X, Lu S, Lin C (2009) A mapreduce-enabled scientific workflow composition framework. In: IEEE International Conference on Web Services, 2009. ICWS 2009, IEEE, pp. 663–670

  13. Group K. Opencl (open computing language) - the open standard for parallel programming of heterogeneous systems. In: URL http://www.khronos.org/opencl/

  14. Johnson D, Garey M (1979) Computers and intractability: a guide to the theory of np-completeness. Freeman&Co, San Francisco

    MATH  Google Scholar 

  15. Lander G, Stagg S, Voss N, Cheng A, Fellmann D, Pulokas J, Yoshioka C, Irving C, Mulder A, Lau P et al (2009) Appion: an integrated, database-driven pipeline to facilitate em image processing. J Struct Biol 166(1):95–102

    Article  Google Scholar 

  16. Lin C, Lu S, Lai Z, Chebotko A, Fei X, Hua J, Fotouhi F (2008) Service-oriented architecture for view: A visual scientific workflow management system. In: IEEE International Conference on Services Computing, 2008. SCC’08. IEEE, vol. 1, pp. 335–342

  17. Ludäscher B, Weske M, Mcphillips T, Bowers S (2009) Scientific workflows: business as usual? Business process management pp. 31–47

  18. McPhillips T, Bowers S, Zinn D, Ludäscher B (2009) Scientific workflow design for mere mortals. Futur Gener Comput Syst 25(5):541–551

    Article  Google Scholar 

  19. Nguyen P, Halem M (2011) A mapreduce workflow system for architecting scientific data intensive applications. In: Proceeding of the 2nd international workshop on Software engineering for cloud computing, ACM, pp. 57–63

  20. Oinn T Greenwood M, e.a. (2005) Taverna:lessons in creating a workflow environment for the life sciences. pp. 1067–1100

  21. Pireddu L, Leo S, Zanetti G (2011). Mapreducing a genomic sequencing workflow. In: Proceedings of the second international workshop on MapReduce and its applications, ACM, pp. 67–74

  22. Polo J, Carrera D, Becerra Y, Beltran V, Torres J, Ayguadé E (2010) Performance management of accelerated mapreduce workloads in heterogeneous clusters. In: 39th International Conference on Parallel Processing (ICPP2010)

  23. Polo J, Carrera D, Becerra Y, Steinder M, Whalley I (2010) Performance-driven task co-scheduling for mapreduce environments. In: Network Operations and Management Symposium (NOMS), 2010 IEEE, pp. 373–380

  24. Rooijers K, Kolmeder C, Juste C, Doré J, de Been M, Boeren S, Galan P, Beauvallet C, de Vos W, Schaap P (2011) An iterative workflow for mining the human intestinal metaproteome. BMC Genomics 12(1):6

    Article  Google Scholar 

  25. Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) 2010, pp. 1–10

  26. Topcuouglu H, Hariri S, Wu MY (2002) Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib Syst 13(3):260–274

    Article  Google Scholar 

  27. Wang J, Crawl D, Altintas I (2009) Kepler+ hadoop: A general architecture facilitating data-intensive applications in scientific workflow systems. In: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, ACM, p. 12

  28. Warr W (2012) Scientific workflow systems: Pipeline pilot and knime. Journal of computer-aided molecular design pp. 1–4

  29. White T (2012) Hadoop: The definitive guide. O’Reilly Media

  30. Wolf J, Rajan D, Hildrum K, Khandekar R, Kumar V, Parekh S, Wu K, Balmin A (2010) Flex: a slot allocation scheduling optimizer for mapreduce workloads. Middleware 2010:1–20

    Google Scholar 

  31. Jacob JC, Katz DS et. al (2004) The Montage architecture for gridenabled science processing of large, distributed datasets. In: Proceedings of the Earth Science Technology Conference, June 2004

Download references

Acknowledgments

The authors are grateful to the three anonymous reviewers for their criticism and comments which have helped to improve the presentation and quality of the paper. This work is supported by the Key Program of National Natural Science Foundation of China (Grant Nos. 61133005, 61432005) National Natural Science Foundation of China (Grant Nos. 61103047,61370095).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhuo Tang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, Z., Liu, M., Ammar, A. et al. An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. J Supercomput 72, 2059–2079 (2016). https://doi.org/10.1007/s11227-014-1335-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1335-2

Keywords

Navigation