Skip to main content

Advertisement

Log in

GCSS: a global collaborative scheduling strategy for wide-area high-performance computing

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Wide-area high-performance computing is widely used for large-scale parallel computing applications owing to its high computing and storage resources. However, the geographical distribution of computing and storage resources makes efficient task distribution and data placement more challenging. To achieve a higher system performance, this study proposes a two-level global collaborative scheduling strategy for wide-area high-performance computing environments. The collaborative scheduling strategy integrates lightweight solution selection, redundant data placement and task stealing mechanisms, optimizing task distribution and data placement to achieve efficient computing in wide-area environments. The experimental results indicate that compared with the state-of-the-art collaborative scheduling algorithm HPS+, the proposed scheduling strategy reduces the makespan by 23.24%, improves computing and storage resource utilization by 8.28% and 21.73% respectively, and achieves similar global data migration costs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Towns J, Cockerill T, Dahan M, Foster I, Gaither K, Grimshaw A, Hazlewood V, Lathrop S, Lifka D, Peterson G D, Roskies R, Scott J R, Wilkins-Diehr N. XSEDE: accelerating scientific discovery. Computing in Science & Engineering, 2014, 16(5): 62–74

    Article  Google Scholar 

  2. Xie X, Xiao N, Xu Z, Zha L, Li W, Yu H. CNGrid software 2: service oriented approach to grid computing. In: Proceedings of the UK e-Science All Hands Meeting. 2005, 701–708

  3. Skamarock W C, Klemp J B, Dudhia J, Gill D O, Powers J G. A description of the Advanced Research WRF version 2. NCAR/TN-468+STR. Boulder: National Center for Atmospheric Research, 2005

    Google Scholar 

  4. Kosar T, Balman M. A new paradigm: data-aware scheduling in grid computing. Future Generation Computer Systems, 2009, 25(4): 406–413

    Article  Google Scholar 

  5. Chowdhury M, Zaharia M, Ma J, Jordan M I, Stoica I. Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Computer Communication Review, 2011, 41(4): 98–109

    Article  Google Scholar 

  6. Wang K, Qiao K, Sadooghi I, Zhou X, Li T, Lang M, Raicu I. Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales. Concurrency and Computation: Practice and Experience, 2016, 28(1): 70–94

    Article  Google Scholar 

  7. Kang S, Veeravalli B, Aung K M M. Dynamic scheduling strategy with efficient node availability prediction for handling divisible loads in multi-cloud systems. Journal of Parallel and Distributed Computing, 2018, 113: 1–16

    Article  Google Scholar 

  8. Wei W, Li B, Liang B, Li J. Multi-resource fair sharing for datacenter jobs with placement constraints. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 1003–1014

  9. Buddhika T, Stern R, Lindburg K, Ericson K, Pallickara S. Online scheduling and interference alleviation for low-latency, high-throughput processing of data streams. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(12): 3553–3569

    Article  Google Scholar 

  10. Kremer-Herman N, Tovar B, Thain D. A lightweight model for rightsizing master-worker applications. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 504–516

  11. Gaussier E, Lelong J, Reis V, Trystram D. Online tuning of EASY-backfilling using queue reordering policies. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(10): 2304–2316

    Article  Google Scholar 

  12. Carastan-Santos D, De Camargo R Y. Obtaining dynamic scheduling policies with simulation and machine learning. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 32

  13. Chen C Y. Task scheduling for maximizing performance and reliability considering fault recovery in heterogeneous distributed systems. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(2): 521–532

    Article  Google Scholar 

  14. Xu H, Lau W C. Optimization for speculative execution in big data processing clusters. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(2): 530–545

    Google Scholar 

  15. He S, Wang Y, Sun X. Boosting parallel file system performance via heterogeneity-aware selective data layout. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(9): 2492–2505

    Article  Google Scholar 

  16. Cameron D G, Carvajal-Schiaffino R, Millar A P, Nicholson C, Stockinger K, Zini F. Evaluating scheduling and replica optimisation strategies in OptorSim. In: Proceedings of the 1st Latin American Web Congress. 2003, 52–59

  17. Bryk P, Malawski M, Juve G, Deelman E. Storage-aware algorithms for scheduling of workflow ensembles in clouds. Journal of Grid Computing, 2016, 14(2): 359–378

    Article  Google Scholar 

  18. Mon E E, Thein M M, Aung M T. Clustering based on task dependency for data-intensive workflow scheduling optimization. In: Proceedings of the 9th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers. 2016, 20–25

  19. Szabo C, Sheng Q Z, Kroeger T, Zhang Y, Yu J. Science in the cloud: allocation and execution of data-intensive scientific workflows. Journal of Grid Computing, 2014, 12(2): 245–264

    Article  Google Scholar 

  20. Zhao L, Yang Y, Munir A, Liu A X, Qu W. Optimizing geo-distributed data analytics with coordinated task scheduling and routing. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(2): 279–293

    Article  Google Scholar 

  21. Wang S, Chen W, Zhou X, Zhang L, Wang Y. Dependency-aware network adaptive scheduling of data-intensive parallel jobs. IEEE Transactions on Parallel and Distributed Systems, 2019, 30(3): 515–529

    Article  Google Scholar 

  22. Wei X, Li L, Li X, Wang X, Gao S, Li H. Pec: proactive elastic collaborative resource scheduling in data stream processing. IEEE Transactions on Parallel and Distributed Systems, 2019, 30(7): 1628–1642

    Article  Google Scholar 

  23. Li C, Bai J, Tang J. Joint optimization of data placement and scheduling for improving user experience in edge computing. Journal of Parallel and Distributed Computing, 2019, 125: 93–105

    Article  Google Scholar 

  24. Liu F, Keahey K, Riteau P, Weissman J. Dynamically negotiating capacity between on-demand and batch clusters. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 493–503

  25. Frey J, Tannenbaum T, Livny M, Foster I, Tuecke S. Condor-G: a computation management agent for multi-institutional grids. In: Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing. 2001

  26. Wang S, Zhang X, Yang K, Wang L, Wang W. Distributed edge caching scheme considering the tradeoff between the diversity and redundancy of cached content. In: Proceedings of 2015 IEEE/CIC International Conference on Communications in China. 2015, 1–5

  27. Yuan D, Yang Y, Liu X, Chen J. A data placement strategy in scientific cloud workflows. Future Generation Computer Systems, 2010, 26(8): 1200–1214

    Article  Google Scholar 

  28. Edinger J, Schäfer D, Krupitzer C, Raychoudhury V, Becker C. Fault-avoidance strategies for context-aware schedulers in pervasive computing systems. In: Proceedings of 2017 IEEE International Conference on Pervasive Computing and Communications. 2017, 79–88

  29. Schafer D, Edinger J, Paluska J M, Vansyckel S, Becker C. Tasklets: “better than best-effort” computing. In: Proceedings of the 25th International Conference on Computer Communication and Networks. 2016, 1–11

  30. Breitbach M, Schäfer D, Edinger J, Becker C. Context-aware data and task placement in edge computing environments. In: Proceedings of 2019 IEEE International Conference on Pervasive Computing and Communications. 2019, 1–10

  31. Wang T, Zhou J, Zhang G, Wei T, Hu S. Customer perceived value- and risk-aware multiserver configuration for profit maximization. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(5): 1074–1088

    Article  Google Scholar 

  32. Xu Z, Stewart C, Deng N, Wang X. Blending on-demand and spot instances to lower costs for in-memory storage. In: Proceedings of the 35th Annual IEEE International Conference on Computer Communications. 2016, 1–9

  33. Zheng N, Chen Q, Yang Y, Li J, Zheng W, Guo M. POSTER: precise capacity planning for database public clouds. In: Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques. 2019, 457–458

  34. Bharadwaj V, Ghose D, Robertazzi T G. Divisible load theory: a new paradigm for load scheduling in distributed systems. Cluster Computing, 2003, 6(1): 7–17

    Article  Google Scholar 

  35. McKenna R, Herbein S, Moody A, Gamblin T, Taufer M. Machine learning predictions of runtime and IO traffic on high-end clusters. In: Proceedings of 2016 IEEE International Conference on Cluster Computing. 2016, 255–258

  36. Casanova H, Legrand A, Quinson M. SimGrid: a generic framework for large-scale distributed experiments. In: Proceedings of the 10th International Conference on Computer Modeling and Simulation. 2008, 126–131

Download references

Acknowledgements

This work was supported by the National key R&D Program of China (2018YFB0203901), the National Natural Science Foundation of China under (Grant No. 61772053), and the fund of the State Key Laboratory of Software Development Environment (SKLSDE-2020ZX-15).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Limin Xiao or Guangjun Qin.

Additional information

Yao Song received his BS degree in spacecraft design and engineering from Beihang University, China in 2016. He is currently pursuing his PhD degree in cyberspace security at Beihang University. His main research interests include cyberspace security, parallel and distributed file systems, high performance computing, and scheduling system.

Limin Xiao received his BS degree in computer science from Tsinghua University, China in 1993, his MS and PhD degrees in computer science from Institute of computing, Chinese Academy of Sciences, China in 1996 and 1998, respectively. He is a professor of the School of Computer Science and Engineering, Beihang University, China. He is a senior membership of China Computer Federation. His main research areas are computer architecture, computer system software, high performance computing, virtualization and cloud computing.

Liang Wang received his BS and MS degrees in electronics engineering from Harbin Institute of Technology, China in 2011 and 2013 respectively, and his PhD degree in computer science and engineering from University of Hong Kong, China in 2017. He is currently an assistant professor with the School of Computer Science and Engineering, Beihang University, China. He was a postdoctoral research fellow in Institute of Microelectronics, Tsinghua University, China during 2017 and 2020. His research interests include power-efficient and reliability-aware design for network-on-chip and many-core system.

Guangjun Qin received his MS degree in computer application technology in Zhengzhou University, China in 2006 and his PhD degree in computer architecture from Beihang University, China in 2015. From 2015 to 2017, he was a postdoctoral fellow at Beihang University, China. Since 2017, he has been a lecturer of Smart City College, Beijing Union University, China. He is the author of one book, more than 20 articles, and a member of China Computer Federation. His main research areas are computer architecture, information security and big data analysis.

Bing Wei received his BS degree in electrical engineering and MS degree in computer science from Capital Normal University, China in 2012 and 2015, respectively, He is currently pursuing his PhD degree in computer science at Beihang University, China. His main research interests include file systems, high performance computing, software engineering and clusters.

Baicheng Yan received his BS degree in computer science and technology from Harbin Engineering University, China in 2016. He is currently pursuing his PhD degree in computer science at Beihang University, China. His research interests include high performance computing, parallel and distributed computing.

Chenhao Zhang received his BS degree in Internet of Things engineering from China University of Petroleum (East China), China in 2019. He is currently pursuing his PhD degree in computer science at Beihang University, China. His main research interests include distributed file systems, storage system, high performance computing, bigdata analysis and computer system software.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Song, Y., Xiao, L., Wang, L. et al. GCSS: a global collaborative scheduling strategy for wide-area high-performance computing. Front. Comput. Sci. 16, 165105 (2022). https://doi.org/10.1007/s11704-021-0353-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-021-0353-5

Keywords