Abstract
The reliability of High-Performance Computing (HPC) systems is an essential concern due to their massive size and the complexity of their operation. Thus, functional tests have been extensively used to monitor HPC systems and use software routines to verify the software stack’s operation, mainly focusing on high-level abstraction features. However, the miniaturization of transistor technologies and the increment of computational resources (to face the performance and computation capabilities of HPC systems for the exascale generation) impose new reliability challenges that involve the development of clever testing strategies considering the underlying hardware characteristics. Interestingly, resorting to open-hardware architectures (such as RISC-V-based platforms) in the HPC domain offers a unique opportunity to effectively combine traditional HPC functional testing techniques with the adoption of effective fine-grain hardware testing solutions, such as those based on the Software-Based Self-Test (SBST) strategy.
This work proposes the SBST strategy as an enhanced and complementary technique for functional testing of RISC-V platforms for HPC systems. The method provides fine-grain evaluations of the CPU cores, including quantitative information on the state of the CPU cores and the presence of faults. For the experiments, we resort to two RISC-V cores (RI5CY and ibex) to develop and verify the effectiveness of the SBST strategy. In total, we developed 11 STLs (SBST routines) showing that a considerable percentage of hardware faults (from about 82% and up to 90%) can be detected with minimal overhead, thus, allowing their use during empty time intervals or in combination with other in-field functional testing approaches for HPC clusters.
This work has been supported by the National Resilience and Recovery Plan (PNRR) through the National Center for HPC, Big Data and Quantum Computing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Silvaco 45nm Open Cell Library. https://si2.org/open-cell-library. Accessed 17 Mar 2022
Apostolakis, A., et al.: Software-based self-testing of symmetric shared-memory multiprocessors. IEEE Trans. Comput. 58(12), 1682–1694 (2009)
Baghyalakshmi, D., et al.: WSN based temperature monitoring for high performance computing cluster. In: 2011 International Conference on Recent Trends in Information Technology (ICRTIT), pp. 1105–1110 (2011)
Barth, W.: Nagios: system and Network Monitoring. No Starch Press, San Francisco (2008)
Bernardi, P., et al.: Development flow for on-line core self-test of automotive microcontrollers. IEEE Trans. Comput. 65(3), 744–754 (2016)
Borghesi, A., et al.: Anomaly detection using autoencoders in high performance computing systems. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 9428–9433 (2019)
Cantoro, R., et al.: An analysis of test solutions for cots-based systems in space applications. In: 2018 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC), pp. 59–64 (2018)
Cantoro, R., et al.: New perspectives on core in-field path delay test. In: 2020 IEEE International Test Conference (ITC), pp. 1–5 (2020)
Chen, L., Dey, S.: Software-based self-testing methodology for processor cores. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 20(3), 369–380 (2001)
Condia, J.E.R., et al.: Using STLs for effective in-field test of GPUs. IEEE Des. Test 40(2), 109–117 (2023)
DeBardeleben, N., et al.: GPU behavior on a large HPC cluster. In: Euro-Par 2013: Parallel Processing Workshops, pp. 680–689 (2014)
Deligiannis, N.I., et al.: Automating the generation of programs maximizing the repeatable constant switching activity in microprocessor units via MaxSAT. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. (2023)
Deligiannis, N.I., et al.: Automating the generation of programs maximizing the sustained switching activity in microprocessor units via evolutionary techniques. Microprocess. Microsyst. 98 (2023)
Dixit, H.D., et al.: Silent data corruptions at scale. CoRR abs/2102.11245 (2021). https://arxiv.org/abs/2102.11245
Evans, T., et al.: Comprehensive resource use monitoring for HPC systems with TACC stats. In: 2014 First International Workshop on HPC User Support Tools, pp. 13–21 (2014)
Faller, T., et al.: Constraint-based automatic SBST generation for RISC-V processor families. In: 28th IEEE European Test Symposium (ETS2023), to be apear, pp. 1–6 (2023)
Faller, T., et al.: Towards SAT-based SBST generation for RISC-V cores. In: 2021 IEEE 22nd Latin American Test Symposium (LATS) (2021)
Gomez, L.B., et al.: GPGPUs: how to combine high computational power with high reliability. In: 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1–9 (2014)
Guerrero-Balaguera, J.D., et al.: A novel compaction approach for SBST test programs. In: 2021 IEEE 30th Asian Test Symposium (ATS), pp. 67–72 (2021)
Hamdioui, S., et al.: March SS: a test for all static simple ram faults. In: Proceedings of the 2002 IEEE International Workshop on Memory Technology, Design and Testing (MTDT2002), pp. 95–100 (2002)
Hamdioui, S., et al.: Reliability challenges of real-time systems in forthcoming technology nodes. In: 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 129–134 (2013)
Hochschild, P.H., et al.: Cores that don’t count. In: Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS 2021) (2021)
IEEE: The international roadmap for devices and systems: 2022. In: Institute of Electrical and Electronics Engineers (IEEE) (2022)
Karakasis, V., et al.: Enabling continuous testing of HPC systems using reframe. In: Juckeland, G., Chandrasekaran, S. (eds.) HUST/SE-HER/WIHPC -2019. CCIS, vol. 1190, pp. 49–68. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44728-1_3
Kranitis, N., et al.: Software-based self-testing of embedded processors. IEEE Trans. Comput. 54(4), 461–475 (2005)
Laguna, I.: Varity: quantifying floating-point variations in HPC systems through randomized testing. In: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 622–633 (2020)
Larrea, V.G.V., et al.: Towards acceptance testing at the exascale frontier. In: Proceedings of the Cray User Group 2020 Conference (2020)
Li, J., et al.: Monster: an out-of-the-box monitoring tool for high performance computing systems. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 119–129 (2020)
Luszczek, P., et al.: Introduction to the HPC challenge benchmark suite, April 2005
Massie, M.L., et al.: The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 30(7), 817–840 (2004)
Pedicini, G., Green, J.: Spotlight on testing: stability, performance and operational testing of LANL HPC clusters. In: State of the Practice Reports. SC ’11 (2011)
Psarakis, M., et al.: Microprocessor software-based self-testing. IEEE Des. Test Comput. 27(3), 4–19 (2010)
Riefert, A., et al.: A flexible framework for the automatic generation of SBST programs. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 24(10), 3055–3066 (2016)
Sabena, D., et al.: On the automatic generation of optimized software-based self-test programs for VLIW processors. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 22(4), 813–823 (2014)
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secure Comput. 7(4), 337–350 (2010)
Sickinger, D., et al.: Energy performance testing of Asetek’s RackCDU system at NREL’s high performance computing data center, November 2014
Smara, M., et al.: Acceptance test for fault detection in component-based cloud computing and systems. Futur. Gener. Comput. Syst. 70, 74–93 (2017)
Sollom, J.: Cray’s node health checker: an overview. In: Proceedings of the Annual Meeting of the Cray Users Group-CUG-2011, Fairbanks, Alaska, USA (2011)
Tronge, J., et al.: BeeSwarm: enabling parallel scaling performance measurement in continuous integration for HPC applications. In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1136–1140 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Condia, J.E.R., Deligiannis, N.I., Sini, J., Cantoro, R., Reorda, M.S. (2023). Functional Testing with STLs: A Step Towards Reliable RISC-V-based HPC Commodity Clusters. In: Bienz, A., Weiland, M., Baboulin, M., Kruse, C. (eds) High Performance Computing. ISC High Performance 2023. Lecture Notes in Computer Science, vol 13999. Springer, Cham. https://doi.org/10.1007/978-3-031-40843-4_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-40843-4_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40842-7
Online ISBN: 978-3-031-40843-4
eBook Packages: Computer ScienceComputer Science (R0)