Abstract
Regression testing of HPC systems is of crucial importance when it comes to ensure the quality of service offered to the end users. At the same time, it poses a great challenge to the systems and application engineers to continuously maintain regression tests that cover as many aspects as possible of the user experience. In this paper, we briefly present ReFrame, a framework for writing regression tests for HPC systems and how this is used by CSCS, NERSC and OSC to continuously test their systems. ReFrame is designed to abstract away the complexity of the interactions with the system and to separate the logic of a regression test from the low-level details, which pertain to the system configuration and setup. Regression tests in ReFrame are simple Python classes that specify the basic parameters of the test plus any additional logic. The framework will load the test and send it down a well-defined pipeline which will take care of its execution. ReFrame can be easily set up on any cluster and its straightforward invocation allows it to be easily integrated with common continuous integration/deployment (CI/CD) tools, in order to perform continuous testing of an HPC system. Finally, its ability to feed the collected performance data to well known log channels, such as Syslog, Graylog or, simply, parsable log files, make it also a powerful tool for continuously monitoring the health of the system from user’s perspective.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
OpenHPC: Community building blocks for HPC systems. https://github.com/openhpc/ohpc
Pavilion2. https://github.com/lanl-preteam/pavilion2
Cray Lightweight Log Manager (LLM) (2019). https://pubs.cray.com/content/S-2393/CLE%207.0.UP00/xctm-series-system-administration-guide/cray-lightweight-log-manager-llm
Adams, M., Brown, J., Shalf, J., Straalen, B.V., Strohmaier, E., Williams, S.: HPGMG 1.0: a benchmark for ranking high performance computing systems. Technical report, LBNL-6630E, Lawrence Berkeley National Laboratory, May 2014. http://escholarship.org/uc/item/00r9w79m
Checconi, F., Petrini, F., Willcock, J., Lumsdaine, A., Choudhury, A.R., Sabharwal, Y.: Breaking the speed and scalability Barriers for Graph exploration on distributed-memory machines. In: SC 2012: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12, November 2012. https://doi.org/10.1109/SC.2012.25
Chun, B.N.: DART: distributed automated regression testing for large-scale network applications. In: Higashino, T. (ed.) OPODIS 2004. LNCS, vol. 3544, pp. 20–36. Springer, Heidelberg (2005). https://doi.org/10.1007/11516798_2
Colby, K., Maji, A.K., Rahman, J., Bottum, J.: Testpilot: A flexible framework for user-centric testing of HPC clusters. In: Proceedings of the Fourth International Workshop on HPC User Support Tools, HUST 2017, pp. 5:1–5:10. ACM, New York (2017). https://doi.org/10.1145/3152493.3152555. http://doi.acm.org/10.1145/3152493.3152555
Dongarra, J., Heroux, M.A., Luszczek, P.: HPCG benchmark: a new metric for ranking high performance computing systems. Technical report, UT-EECS-15-736, Electrical Engineering and Compute Science Department, University of Tennessee, Knoxville, November 2015. https://library.eecs.utk.edu/storage/594phpwDhjVNut-eecs-15-736.pdf
Dubois, P.F.: Testing scientific programs. Comput. Sci. Eng. 14(4), 69–73 (2012). https://doi.org/10.1109/MCSE.2012.84
Furlani, J.L., Osel, P.W.: Abstract yourself with modules. In: Proceedings of the 10th USENIX Conference on System Administration, LISA 1996, pp. 193–204. USENIX Association, Berkeley (1996). http://dl.acm.org/citation.cfm?id=1029824.1029858
Gamblin, T., et al.: The Spack package manager: bringing order to HPC software chaos. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 40:1–40:12. ACM, New York (2015). https://doi.org/10.1145/2807591.2807623. http://doi.acm.org/10.1145/2807591.2807623
GrafanaLabs: Grafana: The open platform for beautiful analytics and monitoring. https://grafana.com/
Graylog Community: Enterprise Log Management for All. https://www.graylog.org/
Horenko, I.: Finite element approach to clustering of multidimensional time series. SIAM J. Sci. Comput. 32(1), 62–83 (2010). https://doi.org/10.1137/080715962
Hoste, K., Timmerman, J., Georges, A., Weirdt, S.D.: Easybuild: building software with ease. In: 2012 IEEE International Conference on Services Computing (SCC), pp. 572–582, November 2013. https://doi.org/10.1109/SC.Companion.2012.81. doi.ieeecomputersociety.org/10.1109/SC.Companion.2012.81
Jülich Supercomputing Centre: JUBE Benchmarking Environment. https://apps.fz-juelich.de/jsc/jube/jube2/docu/index.html
Khuvis, S., et al.: A continuous integration-based framework for software management. In: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning), PEARC 2019, pp. 28:1–28:7. ACM, New York (2019). https://doi.org/10.1145/3332186.3332219. http://doi.acm.org/10.1145/3332186.3332219
Kurth, T., et al.: Analyzing performance of selected NESAP applications on the Cori HPC system. In: Kunkel, J.M., Yokota, R., Taufer, M., Shalf, J. (eds.) ISC High Performance 2017. LNCS, vol. 10524, pp. 334–347. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67630-2_25
Lockwood, G.: IOR and mdtest (2019). https://github.com/hpc/ior
Ma, Wenjing, Ao, Yulong, Yang, Chao, Williams, Samuel: Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight. Cluster Comput. 1–15 (2019). https://doi.org/10.1007/s10586-019-02938-w
McLay, R.: Lmod: A New Environment Module System. https://lmod.readthedocs.io/
Merchant, S., Prabhakar, G.: Tool for performance tuning and regression analyses of HPC systems and applications. In: 2012 19th International Conference on High Performance Computing, pp. 1–6, December 2012. https://doi.org/10.1109/HiPC.2012.6507528
Open Source: Environment Modules. http://modules.sourceforge.net/
Sauers, J.: Onyx Point works with Exascale Computing Project to bring CI to supercomputing centers (2018). https://www.onyxpoint.com/onyxpoint-works-with-ecp-to-bring-ci-to-supercomputers/
Shan, H., Williams, S., Zheng, Y., Kamil, A., Yelick, K.: Implementing high-performance geometric multigrid solver with naturally grained messages. In: 2015 9th International Conference on Partitioned Global Address Space Programming Models, pp. 38–46, September 2015. https://doi.org/10.1109/PGAS.2015.12
Siddiqui, S.: Buildtest: A HPC Application Testing Framework. https://github.com/HPC-buildtest/buildtest
Whitney, C., Bautista, E., Davis, T.: The NERSC Data Collect Environment. In: Cray User Group 2016. CUG16 (2016). https://cug.org/proceedings/cug2016_proceedings/includes/files/pap101s2-file1.pdf
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple Linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_3. https://slurm.schedmd.com/
Acknowledgements
CSCS would like to thank the members of the User Engagement and Support and the HPC Operations units for their valuable feedback regarding the framework and their contributions in writing regression tests for the system.
This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Karakasis, V. et al. (2020). Enabling Continuous Testing of HPC Systems Using ReFrame. In: Juckeland, G., Chandrasekaran, S. (eds) Tools and Techniques for High Performance Computing. HUST SE-HER WIHPC 2019 2019 2019. Communications in Computer and Information Science, vol 1190. Springer, Cham. https://doi.org/10.1007/978-3-030-44728-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-44728-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-44727-4
Online ISBN: 978-3-030-44728-1
eBook Packages: Computer ScienceComputer Science (R0)