Analyzing the Reliability of TCUs Through Micro-architecture and Structural Evaluations for Two Real Number Formats

Limas Sierra, Robert; Guerrero-Balaguera, Juan-David; Rodriguez Condia, Josie E.; Sonza Reorda, Matteo

doi:10.1007/978-3-031-70947-0_8

Robert Limas Sierra¹⁷,
Juan-David Guerrero-Balaguera¹⁷,
Josie E. Rodriguez Condia¹⁷ &
…
Matteo Sonza Reorda¹⁷

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 680))

Included in the following conference series:

IFIP/IEEE International Conference on Very Large Scale Integration - System on a Chip

129 Accesses

Abstract

Modern Graphics Processing Units (GPUs) include in-chip hardware accelerators (Tensor Core Units, or TCUs) to increase the performance of machine learning applications. Unfortunately, cutting-edge semiconductor technologies are increasingly prone to suffer from faults and affect devices during their operation. Moreover, the execution of safety-critical and High-Performance Computing (HPC) applications in GPUs strongly stresses crucial resources, such as TCUs, which increases the likelihood of different kinds of failures. Thus, the resilience analysis of GPUs and their critical units (TCUs) are vital in safety-critical domains, e.g., in automotive, space, and autonomous robotics, to develop effective countermeasures or improve designs. Recently, new arithmetic formats have been proposed, particularly suited to neural network processing. However, an effective reliability characterization of TCUs supporting different arithmetic formats was still missed.

In this work, we propose a hierarchical multi-level strategy to assess the reliability of permanent faults arising in TCUs inside GPUs when using two number formats, i.e., Floating Point (FP) and Posit. The proposed strategy combines a fine-grain micro-architectural characterization of hardware faults in TCUs with a higher-level structural evaluation to observe the interactions with other GPU structures and the error propagation effects. The micro-architectural characterization resorts to two representative descriptions of the main components in TCUs (Dot-Product Units) for both formats (FP and Posit). Then, the fine-grain findings feed a structural TCU model (PyOpenTCU) to propagate and observe the principal error effects. The experimental results show the advantages in performance and accuracy of using clever methods for the reliability assessment of large hardware accelerators, such as TCUs, and identified a relation between the corrupted spatial areas in the output matrices and the TCU’s scheduling policies. Finally, the results demonstrate that Posit formats are less affected by faults than Floating Point formats by several orders of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Investigating and Reducing the Architectural Impact of Transient Faults in Special Function Units for GPUs

Article Open access 21 March 2024

New paradigm of FPGA-based computational intelligence from surveying the implementation of DNN accelerators

Article 12 January 2022

Exploring Cell-Based Neural Architectures for Embedded Systems

References

Peccerillo, B., et al.: A survey on hardware accelerators: taxonomy, trends, challenges, and perspectives. J. Syst. Architect. 129, 102561 (2022)
Article Google Scholar
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. SIGARCH Comput. Archit. News 45(2), 1–12 (2017)
Article MATH Google Scholar
Raihan, M., et al.: Modeling deep learning accelerator enabled GPUs. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 79–92 (2019)
Google Scholar
Lee, W.-K., et al.: Tensorcrypto: high throughput acceleration of lattice-based cryptography using tensor core on GPU. IEEE Access 10, 20 616-20 632 (2022)
Google Scholar
Groth, S., et al.: Efficient application of tensor core units for convolving images. Association for Computing Machinery (2021)
Google Scholar
Dally, W.J., et al.: Evolution of the graphics processing unit (GPU). IEEE Micro 41(6), 42–51 (2021)
Article MATH Google Scholar
Oakden, T., et al.: Graphics processing in virtual production. In: 2022 14th International Conference on Computer and Automation Engineering (ICCAE), pp. 61–64 (2022)
Google Scholar
Gati, N.J., Yang, L.T., Feng, J., Mo, Y., Alazab, M.: Differentially private tensor train deep computation for internet of multimedia things. ACM Trans. Multimed. Comput. Commun. Appl. 16(3s), 1–20 (2020). https://dl.acm.org/doi/10.1145/3421276
Fu, C., Yang, Z., Liu, X.-Y., Yang, J., Walid, A., Yang, L.T.: Secure tensor decomposition for heterogeneous multimedia data in cloud computing. IEEE Trans. Comput. Soc. Syst. 7(1), 247–260 (2020). https://ieeexplore.ieee.org/document/8960318/
Wang, H., Yang, W., Hu, R., Ouyang, R., Li, K., Li, K.: A novel parallel algorithm for sparse tensor matrix chain multiplication via TCU-acceleration. IEEE Trans. Parallel Distrib. Syst. 34(8), 2419–2432 (2023). https://ieeexplore.ieee.org/document/10159508/
Chen, H., Ahmad, F., Vorobyov, S., Porikli, F.: Tensor decompositions in wireless communications and MIMO radar. IEEE J. Sel. Top. Signal Process. 15(3), 438–453 (2021). https://ieeexplore.ieee.org/document/9362250/
Xu, H., Jiang, G., Yu, M., Zhu, Z., Bai, Y., Song, Y., Sun, H.: Tensor product and tensor-singular value decomposition based multi-exposure fusion of images. IEEE Trans. Multimed. 24, 3738–3753 (2022). https://ieeexplore.ieee.org/document/9522049/
Sofuoglu, S.E., Aviyente, S.: Graph regularized tensor train decomposition. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3912–3916. IEEE, Barcelona (2020). https://ieeexplore.ieee.org/document/9054032/
Zeng, H., Xue, J., Luong, H.Q., Philips, W.: Multimodal core tensor factorization and its applications to low-rank tensor completion. IEEE Trans. Multimed. 25, 7010–7024 (2023). https://ieeexplore.ieee.org/document/9927348/
Chen, L., Liu, Y., Zhu, C.: Robust tensor principal component analysis in all modes. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, pp. 1–6. IEEE (2018). https://ieeexplore.ieee.org/document/8486550/
Chang, S.Y., Wu, H.-C., Yan, K., Chen, X., Wu, Y.: Novel personalized multimedia recommendation systems using tensor singular-value-decomposition. In: 2023 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), Beijing, China, pp. 1–7. IEEE (2023). https://ieeexplore.ieee.org/document/10211188/
Lee, A.: Train spotting: Startup gets on track with AI and nvidia Jetson to ensure safety, cost savings for railways (2022). https://resources.nvidia.com/en-us-jetson-success/rail-vision-startup-uses?lx=XRDs_y
Mariani, R.: Driving toward a safer future: NVIDIA achieves safety milestones with drive hyperion autonomous vehicle platform (2023). https://blogs.nvidia.com/blog/2023/04/20/nvidia-drive-safety-milestones/
IEEE. The international roadmap for devices and systems: 2022. In: Institute of Electrical and Electronics Engineers (IEEE) (2022)
Google Scholar
Strojwas, A.J., et al.: Yield and reliability challenges at 7nm and below. In: 2019 Electron Devices Technology and Manufacturing Conference (EDTM), pp. 179–181 (2019)
Google Scholar
Hochschild, P.H., et al.: Cores that don’t count. In: Proceedings of the Workshop on Hot Topics in Operating Systems, HotOS 2021, New York, NY, USA, pp. 9–16. Association for Computing Machinery (2021). https://doi.org/10.1145/3458336.3465297
Dixit, H.D., et al.: Silent data corruptions at scale. CoRR, vol. abs/2102.11245 (2021). https://arxiv.org/abs/2102.11245
Constantinides, K., Mutlu, O., Austin, T., Bertacco, V.: Software-based online detection of hardware defects mechanisms, architectural support, and evaluation. In: 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pp. 97–108 (2007)
Google Scholar
Gizopoulos, D., Papadimitriou, G., Chatzopoulos, O.: Estimating the failures and silent errors rates of CPUs across ISAS and microarchitectures. In: 2023 IEEE International Test Conference (ITC), pp. 377–382 (2023)
Google Scholar
Papadimitriou, G., Gizopoulos, D.: Silent data corruptions: microarchitectural perspectives. IEEE Trans. Comput. 72(11), 3072–3085 (2023)
Article MATH Google Scholar
Zeng, Y., Huang, B.-Y., Zhang, H., Gupta, A., Malik, S.: Generating architecture-level abstractions from RTL designs for processors and accelerators part I: determining architectural state variables. In: 2021 IEEE/ACM International Conference in Computer Aided Design (ICCAD), 1–9 (2021)
Google Scholar
Libano, F., et al.: On the reliability of Xilinx’s deep processing unit and systolic arrays for matrix multiplication. In: 2020 20th European Conference on Radiation and its Effects on Components and Systems (RADECS), pp. 1–5 (2020)
Google Scholar
Omland, P., et al.: HPC hardware design reliability benchmarking with HDFIT. IEEE Trans. Parallel Distrib. Syst. 34(3), 995–1006 (2023)
Article MathSciNet MATH Google Scholar
Rech, R.L., Rech, P.: Reliability of Google’s tensor processing units for embedded applications. In: 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 376–381 (2022)
Google Scholar
He, Y., et al.: Understanding and mitigating hardware failures in deep learning training systems. In: Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA 2023. Association for Computing Machinery, New York (2023https://doi.org/10.1145/3579371.3589105
Basso, P.M., et al.: Impact of tensor cores and mixed precision on the reliability of matrix multiplication in GPUs. IEEE Trans. Nucl. Sci. 67(7), 1560–1565 (2020)
Article MATH Google Scholar
Kundu, S., et al.: Special session: Reliability analysis for AI/ml hardware. In: 2021 IEEE 39th VLSI Test Symposium (VTS), pp. 1–10 (2021)
Google Scholar
Ozen, E., Orailoglu, A.: Architecting decentralization and customizability in DNN accelerators for hardware defect adaptation. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 41(11), 3934–3945 (2022)
Article MATH Google Scholar
Chaudhuri, A., et al.: Special session: fault criticality assessment in AI accelerators. In: 2022 IEEE 40th VLSI Test Symposium (VTS), pp. 1–4 (2022)
Google Scholar
Agarwal, U.K., Chan, A., Asgari, A., Pattabiraman, K.: Towards reliability assessment of systolic arrays against stuck-at faults. In: 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S), pp. 230–236 (2023)
Google Scholar
Tan, J., et al.: Saca-FI: a microarchitecture-level fault injection framework for reliability analysis of systolic array based CNN accelerator. Future Gener. Comput. Syst. 147, 251–264 (2023). https://www.sciencedirect.com/science/article/pii/S0167739X2300184X
Elliott, J., et al.: Quantifying the impact of single bit flips on floating point arithmetic. North Carolina State University. Department of Computer Science, Technical report (2013)
Google Scholar
Fu, H., et al.: Comparing floating-point and logarithmic number representations for reconfigurable acceleration. In: IEEE International Conference on Field Programmable Technology, pp. 337–340 (2006)
Google Scholar
Haselman, M., et al.: A comparison of floating point and logarithmic number systems for FPGAs. In: 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2005), pp. 181–190 (2005)
Google Scholar
Chugh, M., Parhami, B.: Logarithmic arithmetic as an alternative to floating-point: a review. In: 2013 Asilomar Conference on Signals, Systems and Computers, pp. 1139–1143 (2013)
Google Scholar
Barrois, B., Sentieys, O.: Customizing fixed-point and floating-point arithmetic—a case study in k-means clustering. In: IEEE International Workshop on Signal Processing Systems (SiPS), pp. 1–6 (2017)
Google Scholar
Gohil, V., et al.: Fixed-posit: a floating-point representation for error-resilient applications. IEEE Trans. Circuits Syst. II Express Briefs 68(10), 3341–3345 (2021)
MATH Google Scholar
Schlueter, B., et al.: Evaluating the resiliency of posits for scientific computing. In: Proceedings of the SC 2023 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, pp. 477–487 (2023)
Google Scholar
Gavarini, G., et al.: On the resilience of representative and novel data formats in CNNs. In: IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 1–6 (2023)
Google Scholar
Fatemi Langroudi, S.H., Pandit, T., Kudithipudi, D.: Deep learning inference on embedded devices: fixed-point vs posit. In: 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), pp. 19–23 (2018)
Google Scholar
Alouani, I., et al.: An investigation on inherent robustness of posit data representation. In: 34th International Conference on VLSI Design and 20th International Conference on Embedded Systems (VLSID), pp. 276–281 (2021)
Google Scholar
Limas Sierra, R., et al.: Analyzing the impact of different real number formats on the structural reliability of TCUs in GPUs. In: 2023 IFIP/IEEE 31st International Conference on Very Large Scale Integration (VLSI-SoC), pp. 1–6 (2023)
Google Scholar
Limas Sierra, R., Guerrero-Balaguera, J.-D., Condia, J.E.R., Sonza Reorda, M.: Exploring hardware fault impacts on different real number representations of the structural resilience of TCUs in GPUs. Electronics 13(3) (2024). https://www.mdpi.com/2079-9292/13/3/578
Mallasén, D., Barrio, A.A.D., Prieto-Matias, M.: Big-percival: exploring the native use of 64-bit posit arithmetic in scientific computing (2023)
Google Scholar
Murillo, R., Del Barrio, A.A., Botella, G.: Customized posit adders and multipliers using the flopoco core generator. In: 2020 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5 (2020)
Google Scholar
Advanced Micro Devices, I.: Introducing AMD CDNA architecture the all-new AMD GPU architecture for the modern era of HPC & AI (2020)
Google Scholar
Smith, A., James, N., AMD instinct MI200 series accelerator and node architectures. In: 2022 IEEE Hot Chips 34 Symposium (HCS), pp. 1–23. IEEE Computer Society (2022)
Google Scholar
Jiang, H.: Intel’s ponte vecchio GPU : architecture, systems & software. In: 2022 IEEE Hot Chips 34 Symposium (HCS), pp. 1–29. IEEE Computer Society (2022)
Google Scholar
Boswell, B.R., et al.: Generalized acceleration of matrix multiply accumulate operations. U.S. Patent and Trademark Office, US Patent 10,338,919 (2019)
Google Scholar
Gebhart, M., et al.: Energy-efficient mechanisms for managing thread context in throughput processors. In: 38th Annual International Symposium on Computer Architecture (ISCA), pp. 235–246 (2011)
Google Scholar
IEEE standard for floating-point arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008), pp. 1–84 (2019)
Google Scholar
Gustafson, J.L., Yonemoto, I.T.: Beating floating point at its own game: posit arithmetic. Supercomput. Front. Innov.: Int. J. 4(2), 71–86 (2017)
Google Scholar
Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural network. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, Lille, France, 07–09 Jul 2015, , vol. 37, pp. 1613–1622. PMLR (2015). https://proceedings.mlr.press/v37/blundell15.html
Lindstrom, P., et al.: Universal coding of the reals: alternatives to IEEE floating point. In: Proceedings of the Conference for Next Generation Arithmetic, CoNGA 2018. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3190339.3190344
Mishra, S.M., et al.: Comparison of floating-point representations for the efficient implementation of machine learning algorithms. In: 2022 32nd International Conference Radioelektronika (RADIOELEKTRONIKA), pp. 1–6 (2022)
Google Scholar
Ito, K., et al.: Analyzing due errors on GPUs with neutron irradiation test and fault injection to control flow. IEEE Trans. Nucl. Sci. 68(8), 1668–1674 (2021)
Article MATH Google Scholar
Benevenuti, F., et al.: Investigating the reliability impacts of neutron-induced soft errors in aerial image classification CNNs implemented in a softcore SRAM-based FPGA GPU. Microelectron. Reliab. 138, 114738 (2022). 33rd European Symposium on Reliability of Electron Devices, Failure Physics and Analysis
Google Scholar
Tsai, T., et al.: NVBitFI: dynamic fault injection for GPUs. In: 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 284–291 (2021)
Google Scholar
Condia, J.E.R., et al.: A multi-level approach to evaluate the impact of GPU permanent faults on CNN’s reliability. In: 2022 IEEE International Test Conference (ITC), pp. 278–287 (2022)
Google Scholar
Previlon, F.G., et al.: A comprehensive evaluation of the effects of input data on the resilience of GPU applications. In: 2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT) (2019)
Google Scholar
Mallasen, D., et al.: Percival: open-source posit RISC-V core with quire capability. IEEE Trans. Emerg. Top. Comput. 10(03), 1241–1252 (2022)
Article MATH Google Scholar
de Dinechin, F., et al.: Designing custom arithmetic data paths with FloPoCo. IEEE Design Test Comput. 28(4), 18–27 (2011)
Article MATH Google Scholar
Martins, M., et al.: Open cell library in 15nm freePDK technology. In: Proceedings of the 2015 Symposium on International Symposium on Physical Design. Proceedings of the International Symposium on Physical Design (ISPD 2015), pp. 171–178 (2015)
Google Scholar
Gil, P., et al.: Pin-level hardware fault injection techniques. In: Fault Injection Techniques and Tools for Embedded Systems reliability Evaluation, pp. 63–79 (2003). 978-0-306-48711-8
Google Scholar
Jenn, E., Arlat, J., Rimén, M., Ohlsson, J., Karlsson, J.: Fault injection into VHDL models: the MEFISTO tool. In: Randell, B., Laprie, J.C., Kopetz, H., Littlewood, B. (eds.) Predictably Dependable Computing Systems. ESPRIT Basic Research Series, pp. 329–346. Springer, Heidelberg (1995). https://doi.org/10.1007/978-3-642-79789-7_19
Chapter Google Scholar
Češka, M., Matyáš, J., Mrazek, V., Vojnar, T.: Designing approximate arithmetic circuits with combined error constraints (2022)
Google Scholar
Jiang, H., Santiago, F.J.H., Mo, H., Liu, L., Han, J.: Approximate arithmetic circuits: a survey, characterization, and recent applications. Proc. IEEE 108(12), 2108–2135 (2020)
Article MATH Google Scholar
Huang, J., Yu, C.D., van de Geijn, R.A.: Implementing Strassen’s algorithm with cutlass on NVIDIA Volta GPUs (2018)
Google Scholar

Download references

Acknowledgements

This work has been supported by the National Resilience and Recovery Plan (PNRR) through the National Center for HPC, Big Data, and Quantum Computing.

Author information

Authors and Affiliations

Department of Control and Computer Engineering (DAUIN), Politecnico di Torino, Corso Duca degli Abruzzi, 24, 10129, Turin, TO, Italy
Robert Limas Sierra, Juan-David Guerrero-Balaguera, Josie E. Rodriguez Condia & Matteo Sonza Reorda

Authors

Robert Limas Sierra
View author publications
You can also search for this author in PubMed Google Scholar
Juan-David Guerrero-Balaguera
View author publications
You can also search for this author in PubMed Google Scholar
Josie E. Rodriguez Condia
View author publications
You can also search for this author in PubMed Google Scholar
Matteo Sonza Reorda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Limas Sierra .

Editor information

Editors and Affiliations

Khalifa University, Abu Dhabi, United Arab Emirates
Ibrahim (Abe) M. Elfadel
American University of Sharjah, Sharjah, United Arab Emirates
Lutfi Albasha

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Limas Sierra, R., Guerrero-Balaguera, JD., Rodriguez Condia, J.E., Sonza Reorda, M. (2024). Analyzing the Reliability of TCUs Through Micro-architecture and Structural Evaluations for Two Real Number Formats. In: Elfadel, I.(.M., Albasha, L. (eds) VLSI-SoC 2023: Innovations for Trustworthy Artificial Intelligence. VLSI-SoC 2023. IFIP Advances in Information and Communication Technology, vol 680. Springer, Cham. https://doi.org/10.1007/978-3-031-70947-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-70947-0_8
Published: 29 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70946-3
Online ISBN: 978-3-031-70947-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Analyzing the Reliability of TCUs Through Micro-architecture and Structural Evaluations for Two Real Number Formats