Skip to main content

Enabling Continuous Testing of HPC Systems Using ReFrame

  • Conference paper
  • First Online:
Tools and Techniques for High Performance Computing (HUST 2019, SE-HER 2019, WIHPC 2019)

Abstract

Regression testing of HPC systems is of crucial importance when it comes to ensure the quality of service offered to the end users. At the same time, it poses a great challenge to the systems and application engineers to continuously maintain regression tests that cover as many aspects as possible of the user experience. In this paper, we briefly present ReFrame, a framework for writing regression tests for HPC systems and how this is used by CSCS, NERSC and OSC to continuously test their systems. ReFrame is designed to abstract away the complexity of the interactions with the system and to separate the logic of a regression test from the low-level details, which pertain to the system configuration and setup. Regression tests in ReFrame are simple Python classes that specify the basic parameters of the test plus any additional logic. The framework will load the test and send it down a well-defined pipeline which will take care of its execution. ReFrame can be easily set up on any cluster and its straightforward invocation allows it to be easily integrated with common continuous integration/deployment (CI/CD) tools, in order to perform continuous testing of an HPC system. Finally, its ability to feed the collected performance data to well known log channels, such as Syslog, Graylog or, simply, parsable log files, make it also a powerful tool for continuously monitoring the health of the system from user’s perspective.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. OpenHPC: Community building blocks for HPC systems. https://github.com/openhpc/ohpc

  2. Pavilion2. https://github.com/lanl-preteam/pavilion2

  3. Cray Lightweight Log Manager (LLM) (2019). https://pubs.cray.com/content/S-2393/CLE%207.0.UP00/xctm-series-system-administration-guide/cray-lightweight-log-manager-llm

  4. Adams, M., Brown, J., Shalf, J., Straalen, B.V., Strohmaier, E., Williams, S.: HPGMG 1.0: a benchmark for ranking high performance computing systems. Technical report, LBNL-6630E, Lawrence Berkeley National Laboratory, May 2014. http://escholarship.org/uc/item/00r9w79m

  5. Checconi, F., Petrini, F., Willcock, J., Lumsdaine, A., Choudhury, A.R., Sabharwal, Y.: Breaking the speed and scalability Barriers for Graph exploration on distributed-memory machines. In: SC 2012: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–12, November 2012. https://doi.org/10.1109/SC.2012.25

  6. Chun, B.N.: DART: distributed automated regression testing for large-scale network applications. In: Higashino, T. (ed.) OPODIS 2004. LNCS, vol. 3544, pp. 20–36. Springer, Heidelberg (2005). https://doi.org/10.1007/11516798_2

    Chapter  Google Scholar 

  7. Colby, K., Maji, A.K., Rahman, J., Bottum, J.: Testpilot: A flexible framework for user-centric testing of HPC clusters. In: Proceedings of the Fourth International Workshop on HPC User Support Tools, HUST 2017, pp. 5:1–5:10. ACM, New York (2017). https://doi.org/10.1145/3152493.3152555. http://doi.acm.org/10.1145/3152493.3152555

  8. Dongarra, J., Heroux, M.A., Luszczek, P.: HPCG benchmark: a new metric for ranking high performance computing systems. Technical report, UT-EECS-15-736, Electrical Engineering and Compute Science Department, University of Tennessee, Knoxville, November 2015. https://library.eecs.utk.edu/storage/594phpwDhjVNut-eecs-15-736.pdf

  9. Dubois, P.F.: Testing scientific programs. Comput. Sci. Eng. 14(4), 69–73 (2012). https://doi.org/10.1109/MCSE.2012.84

    Article  Google Scholar 

  10. Furlani, J.L., Osel, P.W.: Abstract yourself with modules. In: Proceedings of the 10th USENIX Conference on System Administration, LISA 1996, pp. 193–204. USENIX Association, Berkeley (1996). http://dl.acm.org/citation.cfm?id=1029824.1029858

  11. Gamblin, T., et al.: The Spack package manager: bringing order to HPC software chaos. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, pp. 40:1–40:12. ACM, New York (2015). https://doi.org/10.1145/2807591.2807623. http://doi.acm.org/10.1145/2807591.2807623

  12. GrafanaLabs: Grafana: The open platform for beautiful analytics and monitoring. https://grafana.com/

  13. Graylog Community: Enterprise Log Management for All. https://www.graylog.org/

  14. Horenko, I.: Finite element approach to clustering of multidimensional time series. SIAM J. Sci. Comput. 32(1), 62–83 (2010). https://doi.org/10.1137/080715962

    Article  MathSciNet  MATH  Google Scholar 

  15. Hoste, K., Timmerman, J., Georges, A., Weirdt, S.D.: Easybuild: building software with ease. In: 2012 IEEE International Conference on Services Computing (SCC), pp. 572–582, November 2013. https://doi.org/10.1109/SC.Companion.2012.81. doi.ieeecomputersociety.org/10.1109/SC.Companion.2012.81

  16. Jülich Supercomputing Centre: JUBE Benchmarking Environment. https://apps.fz-juelich.de/jsc/jube/jube2/docu/index.html

  17. Khuvis, S., et al.: A continuous integration-based framework for software management. In: Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (Learning), PEARC 2019, pp. 28:1–28:7. ACM, New York (2019). https://doi.org/10.1145/3332186.3332219. http://doi.acm.org/10.1145/3332186.3332219

  18. Kurth, T., et al.: Analyzing performance of selected NESAP applications on the Cori HPC system. In: Kunkel, J.M., Yokota, R., Taufer, M., Shalf, J. (eds.) ISC High Performance 2017. LNCS, vol. 10524, pp. 334–347. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67630-2_25

    Chapter  Google Scholar 

  19. Lockwood, G.: IOR and mdtest (2019). https://github.com/hpc/ior

  20. Ma, Wenjing, Ao, Yulong, Yang, Chao, Williams, Samuel: Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight. Cluster Comput. 1–15 (2019). https://doi.org/10.1007/s10586-019-02938-w

  21. McLay, R.: Lmod: A New Environment Module System. https://lmod.readthedocs.io/

  22. Merchant, S., Prabhakar, G.: Tool for performance tuning and regression analyses of HPC systems and applications. In: 2012 19th International Conference on High Performance Computing, pp. 1–6, December 2012. https://doi.org/10.1109/HiPC.2012.6507528

  23. Open Source: Environment Modules. http://modules.sourceforge.net/

  24. Sauers, J.: Onyx Point works with Exascale Computing Project to bring CI to supercomputing centers (2018). https://www.onyxpoint.com/onyxpoint-works-with-ecp-to-bring-ci-to-supercomputers/

  25. Shan, H., Williams, S., Zheng, Y., Kamil, A., Yelick, K.: Implementing high-performance geometric multigrid solver with naturally grained messages. In: 2015 9th International Conference on Partitioned Global Address Space Programming Models, pp. 38–46, September 2015. https://doi.org/10.1109/PGAS.2015.12

  26. Siddiqui, S.: Buildtest: A HPC Application Testing Framework. https://github.com/HPC-buildtest/buildtest

  27. Whitney, C., Bautista, E., Davis, T.: The NERSC Data Collect Environment. In: Cray User Group 2016. CUG16 (2016). https://cug.org/proceedings/cug2016_proceedings/includes/files/pap101s2-file1.pdf

  28. Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: simple Linux utility for resource management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds.) JSSPP 2003. LNCS, vol. 2862, pp. 44–60. Springer, Heidelberg (2003). https://doi.org/10.1007/10968987_3. https://slurm.schedmd.com/

    Chapter  Google Scholar 

Download references

Acknowledgements

CSCS would like to thank the members of the User Engagement and Support and the HPC Operations units for their valuable feedback regarding the framework and their contributions in writing regression tests for the system.

This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility operated under Contract No. DE-AC02-05CH11231.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vasileios Karakasis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Karakasis, V. et al. (2020). Enabling Continuous Testing of HPC Systems Using ReFrame. In: Juckeland, G., Chandrasekaran, S. (eds) Tools and Techniques for High Performance Computing. HUST SE-HER WIHPC 2019 2019 2019. Communications in Computer and Information Science, vol 1190. Springer, Cham. https://doi.org/10.1007/978-3-030-44728-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-44728-1_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-44727-4

  • Online ISBN: 978-3-030-44728-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics