Skip to main content

Functional Testing with STLs: A Step Towards Reliable RISC-V-based HPC Commodity Clusters

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2023)

Abstract

The reliability of High-Performance Computing (HPC) systems is an essential concern due to their massive size and the complexity of their operation. Thus, functional tests have been extensively used to monitor HPC systems and use software routines to verify the software stack’s operation, mainly focusing on high-level abstraction features. However, the miniaturization of transistor technologies and the increment of computational resources (to face the performance and computation capabilities of HPC systems for the exascale generation) impose new reliability challenges that involve the development of clever testing strategies considering the underlying hardware characteristics. Interestingly, resorting to open-hardware architectures (such as RISC-V-based platforms) in the HPC domain offers a unique opportunity to effectively combine traditional HPC functional testing techniques with the adoption of effective fine-grain hardware testing solutions, such as those based on the Software-Based Self-Test (SBST) strategy.

This work proposes the SBST strategy as an enhanced and complementary technique for functional testing of RISC-V platforms for HPC systems. The method provides fine-grain evaluations of the CPU cores, including quantitative information on the state of the CPU cores and the presence of faults. For the experiments, we resort to two RISC-V cores (RI5CY and ibex) to develop and verify the effectiveness of the SBST strategy. In total, we developed 11 STLs (SBST routines) showing that a considerable percentage of hardware faults (from about 82% and up to 90%) can be detected with minimal overhead, thus, allowing their use during empty time intervals or in combination with other in-field functional testing approaches for HPC clusters.

This work has been supported by the National Resilience and Recovery Plan (PNRR) through the National Center for HPC, Big Data and Quantum Computing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Silvaco 45nm Open Cell Library. https://si2.org/open-cell-library. Accessed 17 Mar 2022

  2. Apostolakis, A., et al.: Software-based self-testing of symmetric shared-memory multiprocessors. IEEE Trans. Comput. 58(12), 1682–1694 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  3. Baghyalakshmi, D., et al.: WSN based temperature monitoring for high performance computing cluster. In: 2011 International Conference on Recent Trends in Information Technology (ICRTIT), pp. 1105–1110 (2011)

    Google Scholar 

  4. Barth, W.: Nagios: system and Network Monitoring. No Starch Press, San Francisco (2008)

    Google Scholar 

  5. Bernardi, P., et al.: Development flow for on-line core self-test of automotive microcontrollers. IEEE Trans. Comput. 65(3), 744–754 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  6. Borghesi, A., et al.: Anomaly detection using autoencoders in high performance computing systems. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 9428–9433 (2019)

    Google Scholar 

  7. Cantoro, R., et al.: An analysis of test solutions for cots-based systems in space applications. In: 2018 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), pp. 59–64 (2018)

    Google Scholar 

  8. Cantoro, R., et al.: New perspectives on core in-field path delay test. In: 2020 IEEE International Test Conference (ITC), pp. 1–5 (2020)

    Google Scholar 

  9. Chen, L., Dey, S.: Software-based self-testing methodology for processor cores. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 20(3), 369–380 (2001)

    Article  Google Scholar 

  10. Condia, J.E.R., et al.: Using STLs for effective in-field test of GPUs. IEEE Des. Test 40(2), 109–117 (2023)

    Article  Google Scholar 

  11. DeBardeleben, N., et al.: GPU behavior on a large HPC cluster. In: Euro-Par 2013: Parallel Processing Workshops, pp. 680–689 (2014)

    Google Scholar 

  12. Deligiannis, N.I., et al.: Automating the generation of programs maximizing the repeatable constant switching activity in microprocessor units via MaxSAT. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. (2023)

    Google Scholar 

  13. Deligiannis, N.I., et al.: Automating the generation of programs maximizing the sustained switching activity in microprocessor units via evolutionary techniques. Microprocess. Microsyst. 98 (2023)

    Google Scholar 

  14. Dixit, H.D., et al.: Silent data corruptions at scale. CoRR abs/2102.11245 (2021). https://arxiv.org/abs/2102.11245

  15. Evans, T., et al.: Comprehensive resource use monitoring for HPC systems with TACC stats. In: 2014 First International Workshop on HPC User Support Tools, pp. 13–21 (2014)

    Google Scholar 

  16. Faller, T., et al.: Constraint-based automatic SBST generation for RISC-V processor families. In: 28th IEEE European Test Symposium (ETS2023), to be apear, pp. 1–6 (2023)

    Google Scholar 

  17. Faller, T., et al.: Towards SAT-based SBST generation for RISC-V cores. In: 2021 IEEE 22nd Latin American Test Symposium (LATS) (2021)

    Google Scholar 

  18. Gomez, L.B., et al.: GPGPUs: how to combine high computational power with high reliability. In: 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1–9 (2014)

    Google Scholar 

  19. Guerrero-Balaguera, J.D., et al.: A novel compaction approach for SBST test programs. In: 2021 IEEE 30th Asian Test Symposium (ATS), pp. 67–72 (2021)

    Google Scholar 

  20. Hamdioui, S., et al.: March SS: a test for all static simple ram faults. In: Proceedings of the 2002 IEEE International Workshop on Memory Technology, Design and Testing (MTDT2002), pp. 95–100 (2002)

    Google Scholar 

  21. Hamdioui, S., et al.: Reliability challenges of real-time systems in forthcoming technology nodes. In: 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 129–134 (2013)

    Google Scholar 

  22. Hochschild, P.H., et al.: Cores that don’t count. In: Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS 2021) (2021)

    Google Scholar 

  23. IEEE: The international roadmap for devices and systems: 2022. In: Institute of Electrical and Electronics Engineers (IEEE) (2022)

    Google Scholar 

  24. Karakasis, V., et al.: Enabling continuous testing of HPC systems using reframe. In: Juckeland, G., Chandrasekaran, S. (eds.) HUST/SE-HER/WIHPC -2019. CCIS, vol. 1190, pp. 49–68. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44728-1_3

    Chapter  Google Scholar 

  25. Kranitis, N., et al.: Software-based self-testing of embedded processors. IEEE Trans. Comput. 54(4), 461–475 (2005)

    Article  Google Scholar 

  26. Laguna, I.: Varity: quantifying floating-point variations in HPC systems through randomized testing. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 622–633 (2020)

    Google Scholar 

  27. Larrea, V.G.V., et al.: Towards acceptance testing at the exascale frontier. In: Proceedings of the Cray User Group 2020 Conference (2020)

    Google Scholar 

  28. Li, J., et al.: Monster: an out-of-the-box monitoring tool for high performance computing systems. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 119–129 (2020)

    Google Scholar 

  29. Luszczek, P., et al.: Introduction to the HPC challenge benchmark suite, April 2005

    Google Scholar 

  30. Massie, M.L., et al.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)

    Article  Google Scholar 

  31. Pedicini, G., Green, J.: Spotlight on testing: stability, performance and operational testing of LANL HPC clusters. In: State of the Practice Reports. SC ’11 (2011)

    Google Scholar 

  32. Psarakis, M., et al.: Microprocessor software-based self-testing. IEEE Des. Test Comput. 27(3), 4–19 (2010)

    Article  Google Scholar 

  33. Riefert, A., et al.: A flexible framework for the automatic generation of SBST programs. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 24(10), 3055–3066 (2016)

    Google Scholar 

  34. Sabena, D., et al.: On the automatic generation of optimized software-based self-test programs for VLIW processors. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 22(4), 813–823 (2014)

    Google Scholar 

  35. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secure Comput. 7(4), 337–350 (2010)

    Article  Google Scholar 

  36. Sickinger, D., et al.: Energy performance testing of Asetek’s RackCDU system at NREL’s high performance computing data center, November 2014

    Google Scholar 

  37. Smara, M., et al.: Acceptance test for fault detection in component-based cloud computing and systems. Futur. Gener. Comput. Syst. 70, 74–93 (2017)

    Article  Google Scholar 

  38. Sollom, J.: Cray’s node health checker: an overview. In: Proceedings of the Annual Meeting of the Cray Users Group-CUG-2011, Fairbanks, Alaska, USA (2011)

    Google Scholar 

  39. Tronge, J., et al.: BeeSwarm: enabling parallel scaling performance measurement in continuous integration for HPC applications. In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1136–1140 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Josie E. Rodriguez Condia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Condia, J.E.R., Deligiannis, N.I., Sini, J., Cantoro, R., Reorda, M.S. (2023). Functional Testing with STLs: A Step Towards Reliable RISC-V-based HPC Commodity Clusters. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40843-4_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40842-7

  • Online ISBN: 978-3-031-40843-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics