Functional Testing with STLs: A Step Towards Reliable RISC-V-based HPC Commodity Clusters

Condia, Josie E. Rodriguez; Deligiannis, Nikolaos I.; Sini, Jacopo; Cantoro, Riccardo; Reorda, Matteo Sonza

doi:10.1007/978-3-031-40843-4_33

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13999))

Included in the following conference series:

International Conference on High Performance Computing

1014 Accesses

Abstract

The reliability of High-Performance Computing (HPC) systems is an essential concern due to their massive size and the complexity of their operation. Thus, functional tests have been extensively used to monitor HPC systems and use software routines to verify the software stack’s operation, mainly focusing on high-level abstraction features. However, the miniaturization of transistor technologies and the increment of computational resources (to face the performance and computation capabilities of HPC systems for the exascale generation) impose new reliability challenges that involve the development of clever testing strategies considering the underlying hardware characteristics. Interestingly, resorting to open-hardware architectures (such as RISC-V-based platforms) in the HPC domain offers a unique opportunity to effectively combine traditional HPC functional testing techniques with the adoption of effective fine-grain hardware testing solutions, such as those based on the Software-Based Self-Test (SBST) strategy.

This work proposes the SBST strategy as an enhanced and complementary technique for functional testing of RISC-V platforms for HPC systems. The method provides fine-grain evaluations of the CPU cores, including quantitative information on the state of the CPU cores and the presence of faults. For the experiments, we resort to two RISC-V cores (RI5CY and ibex) to develop and verify the effectiveness of the SBST strategy. In total, we developed 11 STLs (SBST routines) showing that a considerable percentage of hardware faults (from about 82% and up to 90%) can be detected with minimal overhead, thus, allowing their use during empty time intervals or in combination with other in-field functional testing approaches for HPC clusters.

This work has been supported by the National Resilience and Recovery Plan (PNRR) through the National Center for HPC, Big Data and Quantum Computing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Silvaco 45nm Open Cell Library. https://si2.org/open-cell-library. Accessed 17 Mar 2022
Apostolakis, A., et al.: Software-based self-testing of symmetric shared-memory multiprocessors. IEEE Trans. Comput. 58(12), 1682–1694 (2009)
Article MathSciNet MATH Google Scholar
Baghyalakshmi, D., et al.: WSN based temperature monitoring for high performance computing cluster. In: 2011 International Conference on Recent Trends in Information Technology (ICRTIT), pp. 1105–1110 (2011)
Google Scholar
Barth, W.: Nagios: system and Network Monitoring. No Starch Press, San Francisco (2008)
Google Scholar
Bernardi, P., et al.: Development flow for on-line core self-test of automotive microcontrollers. IEEE Trans. Comput. 65(3), 744–754 (2016)
Article MathSciNet MATH Google Scholar
Borghesi, A., et al.: Anomaly detection using autoencoders in high performance computing systems. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 9428–9433 (2019)
Google Scholar
Cantoro, R., et al.: An analysis of test solutions for cots-based systems in space applications. In: 2018 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), pp. 59–64 (2018)
Google Scholar
Cantoro, R., et al.: New perspectives on core in-field path delay test. In: 2020 IEEE International Test Conference (ITC), pp. 1–5 (2020)
Google Scholar
Chen, L., Dey, S.: Software-based self-testing methodology for processor cores. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 20(3), 369–380 (2001)
Article Google Scholar
Condia, J.E.R., et al.: Using STLs for effective in-field test of GPUs. IEEE Des. Test 40(2), 109–117 (2023)
Article Google Scholar
DeBardeleben, N., et al.: GPU behavior on a large HPC cluster. In: Euro-Par 2013: Parallel Processing Workshops, pp. 680–689 (2014)
Google Scholar
Deligiannis, N.I., et al.: Automating the generation of programs maximizing the repeatable constant switching activity in microprocessor units via MaxSAT. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. (2023)
Google Scholar
Deligiannis, N.I., et al.: Automating the generation of programs maximizing the sustained switching activity in microprocessor units via evolutionary techniques. Microprocess. Microsyst. 98 (2023)
Google Scholar
Dixit, H.D., et al.: Silent data corruptions at scale. CoRR abs/2102.11245 (2021). https://arxiv.org/abs/2102.11245
Evans, T., et al.: Comprehensive resource use monitoring for HPC systems with TACC stats. In: 2014 First International Workshop on HPC User Support Tools, pp. 13–21 (2014)
Google Scholar
Faller, T., et al.: Constraint-based automatic SBST generation for RISC-V processor families. In: 28th IEEE European Test Symposium (ETS2023), to be apear, pp. 1–6 (2023)
Google Scholar
Faller, T., et al.: Towards SAT-based SBST generation for RISC-V cores. In: 2021 IEEE 22nd Latin American Test Symposium (LATS) (2021)
Google Scholar
Gomez, L.B., et al.: GPGPUs: how to combine high computational power with high reliability. In: 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1–9 (2014)
Google Scholar
Guerrero-Balaguera, J.D., et al.: A novel compaction approach for SBST test programs. In: 2021 IEEE 30th Asian Test Symposium (ATS), pp. 67–72 (2021)
Google Scholar
Hamdioui, S., et al.: March SS: a test for all static simple ram faults. In: Proceedings of the 2002 IEEE International Workshop on Memory Technology, Design and Testing (MTDT2002), pp. 95–100 (2002)
Google Scholar
Hamdioui, S., et al.: Reliability challenges of real-time systems in forthcoming technology nodes. In: 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 129–134 (2013)
Google Scholar
Hochschild, P.H., et al.: Cores that don’t count. In: Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS 2021) (2021)
Google Scholar
IEEE: The international roadmap for devices and systems: 2022. In: Institute of Electrical and Electronics Engineers (IEEE) (2022)
Google Scholar
Karakasis, V., et al.: Enabling continuous testing of HPC systems using reframe. In: Juckeland, G., Chandrasekaran, S. (eds.) HUST/SE-HER/WIHPC -2019. CCIS, vol. 1190, pp. 49–68. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44728-1_3
Chapter Google Scholar
Kranitis, N., et al.: Software-based self-testing of embedded processors. IEEE Trans. Comput. 54(4), 461–475 (2005)
Article Google Scholar
Laguna, I.: Varity: quantifying floating-point variations in HPC systems through randomized testing. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 622–633 (2020)
Google Scholar
Larrea, V.G.V., et al.: Towards acceptance testing at the exascale frontier. In: Proceedings of the Cray User Group 2020 Conference (2020)
Google Scholar
Li, J., et al.: Monster: an out-of-the-box monitoring tool for high performance computing systems. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 119–129 (2020)
Google Scholar
Luszczek, P., et al.: Introduction to the HPC challenge benchmark suite, April 2005
Google Scholar
Massie, M.L., et al.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)
Article Google Scholar
Pedicini, G., Green, J.: Spotlight on testing: stability, performance and operational testing of LANL HPC clusters. In: State of the Practice Reports. SC ’11 (2011)
Google Scholar
Psarakis, M., et al.: Microprocessor software-based self-testing. IEEE Des. Test Comput. 27(3), 4–19 (2010)
Article Google Scholar
Riefert, A., et al.: A flexible framework for the automatic generation of SBST programs. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 24(10), 3055–3066 (2016)
Google Scholar
Sabena, D., et al.: On the automatic generation of optimized software-based self-test programs for VLIW processors. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 22(4), 813–823 (2014)
Google Scholar
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secure Comput. 7(4), 337–350 (2010)
Article Google Scholar
Sickinger, D., et al.: Energy performance testing of Asetek’s RackCDU system at NREL’s high performance computing data center, November 2014
Google Scholar
Smara, M., et al.: Acceptance test for fault detection in component-based cloud computing and systems. Futur. Gener. Comput. Syst. 70, 74–93 (2017)
Article Google Scholar
Sollom, J.: Cray’s node health checker: an overview. In: Proceedings of the Annual Meeting of the Cray Users Group-CUG-2011, Fairbanks, Alaska, USA (2011)
Google Scholar
Tronge, J., et al.: BeeSwarm: enabling parallel scaling performance measurement in continuous integration for HPC applications. In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1136–1140 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Control and Computer Engineering (DAUIN), Politecnico di Torino, Turin, Italy
Josie E. Rodriguez Condia, Nikolaos I. Deligiannis, Jacopo Sini, Riccardo Cantoro & Matteo Sonza Reorda

Authors

Josie E. Rodriguez Condia
View author publications
You can also search for this author in PubMed Google Scholar
Nikolaos I. Deligiannis
View author publications
You can also search for this author in PubMed Google Scholar
Jacopo Sini
View author publications
You can also search for this author in PubMed Google Scholar
Riccardo Cantoro
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Sonza Reorda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Josie E. Rodriguez Condia .

Editor information

Editors and Affiliations

University of New Mexico, Albuquerque, NM, USA
Amanda Bienz
University of Edinburgh, Edinburgh, UK
Michèle Weiland
Université Paris-Saclay, Gif sur Yvette, France
Marc Baboulin
CERFACS, Toulouse, France
Carola Kruse

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Condia, J.E.R., Deligiannis, N.I., Sini, J., Cantoro, R., Reorda, M.S. (2023). Functional Testing with STLs: A Step Towards Reliable RISC-V-based HPC Commodity Clusters. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-40843-4_33
Published: 25 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40842-7
Online ISBN: 978-3-031-40843-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Functional Testing with STLs: A Step Towards Reliable RISC-V-based HPC Commodity Clusters