Abstract
Recently, the use of interactive jobs in addition to traditional batch jobs is attracting attention in supercomputer systems. We expect overcommitting scheduling, in which multiple HPC jobs share computational resources, to accept them while keeping resource utilization higher and response time lower. In order to realize overcommitting scheduling, the following approaches are necessary: 1) to understand the impact on performance when various applications share resources, and 2) to predict the performance before overcommitting. With this knowledge, we will be able to optimize overcommitting scheduling that avoids performance degradation of jobs. In this paper, we describe the overall picture of overcommitting scheduling and took the two approaches shown above. We confirmed that overcommitting shows performance improvement of the overall system and also built a performance model that can be applied to job scheduling.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Some batch jobs can have fluctuation of CPU utilization for file I/O or network. This paper omits discussion for such jobs for simplicity; however, overcommitting scheduling is expected to work efficiently for such jobs.
- 2.
If the opponent is a batch job, it is same as the execution time.
References
Bailey, D., et al.: The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5(3), 63–73 (1991). https://doi.org/10.1177/109434209100500306
Chen, W., Rao, J., Zhou, X.: Preemptive, low latency datacenter scheduling via lightweight virtualization. In: 2017 USENIX Annual Technical Conference (USENIX ATC 2017), pp. 251–263 (2017)
Delgado, P., Didona, D., Dinu, F., Zwaenepoel, W.: Kairos: preemptive data center scheduling without runtime estimates. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 135–148 (2018)
Feitelson, D.G.: Metrics for parallel job scheduling and their convergence. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 2001. LNCS, vol. 2221, pp. 188–205. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45540-X_11
Feitelson, D.G., Rudolph, L., Schwiegelshohn, U., Sevcik, K.C., Wong, P.: Theory and practice in parallel job scheduling. In: Feitelson, D.G., Rudolph, L. (eds.) JSSPP 1997. LNCS, vol. 1291, pp. 1–34. Springer, Heidelberg (1997). https://doi.org/10.1007/3-540-63574-2_14
Knüpfer, A., et al.: Score-P: a joint performance measurement run-time infrastructure for Periscope, Scalasca, TAU, and Vampir. In: Brunst, H., Müller, M., Nagel, W., Resch, M. (eds.) Tools for High Performance Computing 2011, pp. 79–91. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31476-6_7
Matsuoka, S., et al.: Overview of TSUBAME3.0, green cloud supercomputer for convergence of HPC, AI and big-data. TSUBAME e-Science J. 16, 2–9 (2017)
Minami, S., Endo, T., Nomura, A.: Performance modeling of HPC applications on overcommitted systems (extended abstract). In: The International Conference on High Performance Computing in Asia-Pacific Region, pp. 129–132 (2021)
Tau Leng, R.A., Hsieh, J., Mashayekhi, V., Rooholamini, R.: An empirical study of hyper-threading in high performance computing clusters. Linux HPC Revolution 45 (2002)
Terpstra, D., Jagode, H., You, H., Dongarra, J.: Collecting performance data with PAPI-C. In: Müller, M.S., Resch, M.M., Schulz, A., Nagel, W.E. (eds.) Tools for High Performance Computing 2009, pp. 157–173. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11261-4_11
Wong, F.C., Arpaci-Dusseau, A.C., Culler, D.E.: Building MPI for multi-programming systems using implicit information. In: Dongarra, J., Luque, E., Margalef, T. (eds.) EuroPVM/MPI 1999. LNCS, vol. 1697, pp. 215–222. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48158-3_27
Yabuuchi, H., Taniwaki, D., Omura, S.: Low-latency job scheduling with preemption for the development of deep learning. In: 2019 USENIX Conference on Operational Machine Learning (OpML 2019), pp. 27–30 (2019)
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple Linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_3
Acknowledgements
This work was supported by JSPS KAKENHI Grant Number JP19H04121.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Minami, S., Endo, T., Nomura, A. (2021). Measurement and Modeling of Performance of HPC Applications Towards Overcommitting Scheduling Systems. In: Klusáček, D., Cirne, W., Rodrigo, G.P. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2021. Lecture Notes in Computer Science(), vol 12985. Springer, Cham. https://doi.org/10.1007/978-3-030-88224-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-88224-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88223-5
Online ISBN: 978-3-030-88224-2
eBook Packages: Computer ScienceComputer Science (R0)