GCSS: a global collaborative scheduling strategy for wide-area high-performance computing

Song, Yao; Xiao, Limin; Wang, Liang; Qin, Guangjun; Wei, Bing; Yan, Baicheng; Zhang, Chenhao

doi:10.1007/s11704-021-0353-5

GCSS: a global collaborative scheduling strategy for wide-area high-performance computing

Research Article
Published: 08 January 2022

Volume 16, article number 165105, (2022)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Yao Song^1,2,
Limin Xiao^1,2,
Liang Wang²,
Guangjun Qin³,
Bing Wei^1,2,
Baicheng Yan^1,2 &
…
Chenhao Zhang^1,2

113 Accesses
1 Altmetric
Explore all metrics

Abstract

Wide-area high-performance computing is widely used for large-scale parallel computing applications owing to its high computing and storage resources. However, the geographical distribution of computing and storage resources makes efficient task distribution and data placement more challenging. To achieve a higher system performance, this study proposes a two-level global collaborative scheduling strategy for wide-area high-performance computing environments. The collaborative scheduling strategy integrates lightweight solution selection, redundant data placement and task stealing mechanisms, optimizing task distribution and data placement to achieve efficient computing in wide-area environments. The experimental results indicate that compared with the state-of-the-art collaborative scheduling algorithm HPS+, the proposed scheduling strategy reduces the makespan by 23.24%, improves computing and storage resource utilization by 8.28% and 21.73% respectively, and achieves similar global data migration costs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HTDcr: a job execution framework for high-throughput computing on supercomputers

Article 22 December 2023

Locality-aware task scheduling for homogeneous parallel computing systems

Article 01 November 2017

L3C Model of High-Performance Computing Cluster for Scientific Applications

References

Towns J, Cockerill T, Dahan M, Foster I, Gaither K, Grimshaw A, Hazlewood V, Lathrop S, Lifka D, Peterson G D, Roskies R, Scott J R, Wilkins-Diehr N. XSEDE: accelerating scientific discovery. Computing in Science & Engineering, 2014, 16(5): 62–74
Article Google Scholar
Xie X, Xiao N, Xu Z, Zha L, Li W, Yu H. CNGrid software 2: service oriented approach to grid computing. In: Proceedings of the UK e-Science All Hands Meeting. 2005, 701–708
Skamarock W C, Klemp J B, Dudhia J, Gill D O, Powers J G. A description of the Advanced Research WRF version 2. NCAR/TN-468+STR. Boulder: National Center for Atmospheric Research, 2005
Google Scholar
Kosar T, Balman M. A new paradigm: data-aware scheduling in grid computing. Future Generation Computer Systems, 2009, 25(4): 406–413
Article Google Scholar
Chowdhury M, Zaharia M, Ma J, Jordan M I, Stoica I. Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Computer Communication Review, 2011, 41(4): 98–109
Article Google Scholar
Wang K, Qiao K, Sadooghi I, Zhou X, Li T, Lang M, Raicu I. Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales. Concurrency and Computation: Practice and Experience, 2016, 28(1): 70–94
Article Google Scholar
Kang S, Veeravalli B, Aung K M M. Dynamic scheduling strategy with efficient node availability prediction for handling divisible loads in multi-cloud systems. Journal of Parallel and Distributed Computing, 2018, 113: 1–16
Article Google Scholar
Wei W, Li B, Liang B, Li J. Multi-resource fair sharing for datacenter jobs with placement constraints. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 1003–1014
Buddhika T, Stern R, Lindburg K, Ericson K, Pallickara S. Online scheduling and interference alleviation for low-latency, high-throughput processing of data streams. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(12): 3553–3569
Article Google Scholar
Kremer-Herman N, Tovar B, Thain D. A lightweight model for rightsizing master-worker applications. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 504–516
Gaussier E, Lelong J, Reis V, Trystram D. Online tuning of EASY-backfilling using queue reordering policies. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(10): 2304–2316
Article Google Scholar
Carastan-Santos D, De Camargo R Y. Obtaining dynamic scheduling policies with simulation and machine learning. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 32
Chen C Y. Task scheduling for maximizing performance and reliability considering fault recovery in heterogeneous distributed systems. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(2): 521–532
Article Google Scholar
Xu H, Lau W C. Optimization for speculative execution in big data processing clusters. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(2): 530–545
Google Scholar
He S, Wang Y, Sun X. Boosting parallel file system performance via heterogeneity-aware selective data layout. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(9): 2492–2505
Article Google Scholar
Cameron D G, Carvajal-Schiaffino R, Millar A P, Nicholson C, Stockinger K, Zini F. Evaluating scheduling and replica optimisation strategies in OptorSim. In: Proceedings of the 1st Latin American Web Congress. 2003, 52–59
Bryk P, Malawski M, Juve G, Deelman E. Storage-aware algorithms for scheduling of workflow ensembles in clouds. Journal of Grid Computing, 2016, 14(2): 359–378
Article Google Scholar
Mon E E, Thein M M, Aung M T. Clustering based on task dependency for data-intensive workflow scheduling optimization. In: Proceedings of the 9th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers. 2016, 20–25
Szabo C, Sheng Q Z, Kroeger T, Zhang Y, Yu J. Science in the cloud: allocation and execution of data-intensive scientific workflows. Journal of Grid Computing, 2014, 12(2): 245–264
Article Google Scholar
Zhao L, Yang Y, Munir A, Liu A X, Qu W. Optimizing geo-distributed data analytics with coordinated task scheduling and routing. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(2): 279–293
Article Google Scholar
Wang S, Chen W, Zhou X, Zhang L, Wang Y. Dependency-aware network adaptive scheduling of data-intensive parallel jobs. IEEE Transactions on Parallel and Distributed Systems, 2019, 30(3): 515–529
Article Google Scholar
Wei X, Li L, Li X, Wang X, Gao S, Li H. Pec: proactive elastic collaborative resource scheduling in data stream processing. IEEE Transactions on Parallel and Distributed Systems, 2019, 30(7): 1628–1642
Article Google Scholar
Li C, Bai J, Tang J. Joint optimization of data placement and scheduling for improving user experience in edge computing. Journal of Parallel and Distributed Computing, 2019, 125: 93–105
Article Google Scholar
Liu F, Keahey K, Riteau P, Weissman J. Dynamically negotiating capacity between on-demand and batch clusters. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 493–503
Frey J, Tannenbaum T, Livny M, Foster I, Tuecke S. Condor-G: a computation management agent for multi-institutional grids. In: Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing. 2001
Wang S, Zhang X, Yang K, Wang L, Wang W. Distributed edge caching scheme considering the tradeoff between the diversity and redundancy of cached content. In: Proceedings of 2015 IEEE/CIC International Conference on Communications in China. 2015, 1–5
Yuan D, Yang Y, Liu X, Chen J. A data placement strategy in scientific cloud workflows. Future Generation Computer Systems, 2010, 26(8): 1200–1214
Article Google Scholar
Edinger J, Schäfer D, Krupitzer C, Raychoudhury V, Becker C. Fault-avoidance strategies for context-aware schedulers in pervasive computing systems. In: Proceedings of 2017 IEEE International Conference on Pervasive Computing and Communications. 2017, 79–88
Schafer D, Edinger J, Paluska J M, Vansyckel S, Becker C. Tasklets: “better than best-effort” computing. In: Proceedings of the 25th International Conference on Computer Communication and Networks. 2016, 1–11
Breitbach M, Schäfer D, Edinger J, Becker C. Context-aware data and task placement in edge computing environments. In: Proceedings of 2019 IEEE International Conference on Pervasive Computing and Communications. 2019, 1–10
Wang T, Zhou J, Zhang G, Wei T, Hu S. Customer perceived value- and risk-aware multiserver configuration for profit maximization. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(5): 1074–1088
Article Google Scholar
Xu Z, Stewart C, Deng N, Wang X. Blending on-demand and spot instances to lower costs for in-memory storage. In: Proceedings of the 35th Annual IEEE International Conference on Computer Communications. 2016, 1–9
Zheng N, Chen Q, Yang Y, Li J, Zheng W, Guo M. POSTER: precise capacity planning for database public clouds. In: Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques. 2019, 457–458
Bharadwaj V, Ghose D, Robertazzi T G. Divisible load theory: a new paradigm for load scheduling in distributed systems. Cluster Computing, 2003, 6(1): 7–17
Article Google Scholar
McKenna R, Herbein S, Moody A, Gamblin T, Taufer M. Machine learning predictions of runtime and IO traffic on high-end clusters. In: Proceedings of 2016 IEEE International Conference on Cluster Computing. 2016, 255–258
Casanova H, Legrand A, Quinson M. SimGrid: a generic framework for large-scale distributed experiments. In: Proceedings of the 10th International Conference on Computer Modeling and Simulation. 2008, 126–131

Download references

Acknowledgements

This work was supported by the National key R&D Program of China (2018YFB0203901), the National Natural Science Foundation of China under (Grant No. 61772053), and the fund of the State Key Laboratory of Software Development Environment (SKLSDE-2020ZX-15).

Author information

Authors and Affiliations

State Key Laboratory of Software Development Environment, Beihang University, Beijing, 100191, China
Yao Song, Limin Xiao, Bing Wei, Baicheng Yan & Chenhao Zhang
School of Computer Science and Engineering, Beihang University, Beijing, 100191, China
Yao Song, Limin Xiao, Liang Wang, Bing Wei, Baicheng Yan & Chenhao Zhang
Smart City College, Beijing Union University, Beijing, 100101, China
Guangjun Qin

Authors

Yao Song
View author publications
You can also search for this author inPubMed Google Scholar
Limin Xiao
View author publications
You can also search for this author inPubMed Google Scholar
Liang Wang
View author publications
You can also search for this author inPubMed Google Scholar
Guangjun Qin
View author publications
You can also search for this author inPubMed Google Scholar
Bing Wei
View author publications
You can also search for this author inPubMed Google Scholar
Baicheng Yan
View author publications
You can also search for this author inPubMed Google Scholar
Chenhao Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Limin Xiao or Guangjun Qin.

Additional information

Yao Song received his BS degree in spacecraft design and engineering from Beihang University, China in 2016. He is currently pursuing his PhD degree in cyberspace security at Beihang University. His main research interests include cyberspace security, parallel and distributed file systems, high performance computing, and scheduling system.

Limin Xiao received his BS degree in computer science from Tsinghua University, China in 1993, his MS and PhD degrees in computer science from Institute of computing, Chinese Academy of Sciences, China in 1996 and 1998, respectively. He is a professor of the School of Computer Science and Engineering, Beihang University, China. He is a senior membership of China Computer Federation. His main research areas are computer architecture, computer system software, high performance computing, virtualization and cloud computing.

Liang Wang received his BS and MS degrees in electronics engineering from Harbin Institute of Technology, China in 2011 and 2013 respectively, and his PhD degree in computer science and engineering from University of Hong Kong, China in 2017. He is currently an assistant professor with the School of Computer Science and Engineering, Beihang University, China. He was a postdoctoral research fellow in Institute of Microelectronics, Tsinghua University, China during 2017 and 2020. His research interests include power-efficient and reliability-aware design for network-on-chip and many-core system.

Guangjun Qin received his MS degree in computer application technology in Zhengzhou University, China in 2006 and his PhD degree in computer architecture from Beihang University, China in 2015. From 2015 to 2017, he was a postdoctoral fellow at Beihang University, China. Since 2017, he has been a lecturer of Smart City College, Beijing Union University, China. He is the author of one book, more than 20 articles, and a member of China Computer Federation. His main research areas are computer architecture, information security and big data analysis.

Bing Wei received his BS degree in electrical engineering and MS degree in computer science from Capital Normal University, China in 2012 and 2015, respectively, He is currently pursuing his PhD degree in computer science at Beihang University, China. His main research interests include file systems, high performance computing, software engineering and clusters.

Baicheng Yan received his BS degree in computer science and technology from Harbin Engineering University, China in 2016. He is currently pursuing his PhD degree in computer science at Beihang University, China. His research interests include high performance computing, parallel and distributed computing.

Chenhao Zhang received his BS degree in Internet of Things engineering from China University of Petroleum (East China), China in 2019. He is currently pursuing his PhD degree in computer science at Beihang University, China. His main research interests include distributed file systems, storage system, high performance computing, bigdata analysis and computer system software.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, Y., Xiao, L., Wang, L. et al. GCSS: a global collaborative scheduling strategy for wide-area high-performance computing. Front. Comput. Sci. 16, 165105 (2022). https://doi.org/10.1007/s11704-021-0353-5

Download citation

Received: 16 July 2020
Accepted: 09 March 2021
Published: 08 January 2022
DOI: https://doi.org/10.1007/s11704-021-0353-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GCSS: a global collaborative scheduling strategy for wide-area high-performance computing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

HTDcr: a job execution framework for high-throughput computing on supercomputers

Locality-aware task scheduling for homogeneous parallel computing systems

L3C Model of High-Performance Computing Cluster for Scientific Applications

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now