Abstract
Modern Graphics Processing Units (GPUs) include in-chip hardware accelerators (Tensor Core Units, or TCUs) to increase the performance of machine learning applications. Unfortunately, cutting-edge semiconductor technologies are increasingly prone to suffer from faults and affect devices during their operation. Moreover, the execution of safety-critical and High-Performance Computing (HPC) applications in GPUs strongly stresses crucial resources, such as TCUs, which increases the likelihood of different kinds of failures. Thus, the resilience analysis of GPUs and their critical units (TCUs) are vital in safety-critical domains, e.g., in automotive, space, and autonomous robotics, to develop effective countermeasures or improve designs. Recently, new arithmetic formats have been proposed, particularly suited to neural network processing. However, an effective reliability characterization of TCUs supporting different arithmetic formats was still missed.
In this work, we propose a hierarchical multi-level strategy to assess the reliability of permanent faults arising in TCUs inside GPUs when using two number formats, i.e., Floating Point (FP) and Posit. The proposed strategy combines a fine-grain micro-architectural characterization of hardware faults in TCUs with a higher-level structural evaluation to observe the interactions with other GPU structures and the error propagation effects. The micro-architectural characterization resorts to two representative descriptions of the main components in TCUs (Dot-Product Units) for both formats (FP and Posit). Then, the fine-grain findings feed a structural TCU model (PyOpenTCU) to propagate and observe the principal error effects. The experimental results show the advantages in performance and accuracy of using clever methods for the reliability assessment of large hardware accelerators, such as TCUs, and identified a relation between the corrupted spatial areas in the output matrices and the TCU’s scheduling policies. Finally, the results demonstrate that Posit formats are less affected by faults than Floating Point formats by several orders of magnitude.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Peccerillo, B., et al.: A survey on hardware accelerators: taxonomy, trends, challenges, and perspectives. J. Syst. Architect. 129, 102561 (2022)
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. SIGARCH Comput. Archit. News 45(2), 1–12 (2017)
Raihan, M., et al.: Modeling deep learning accelerator enabled GPUs. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 79–92 (2019)
Lee, W.-K., et al.: Tensorcrypto: high throughput acceleration of lattice-based cryptography using tensor core on GPU. IEEE Access 10, 20 616-20 632 (2022)
Groth, S., et al.: Efficient application of tensor core units for convolving images. Association for Computing Machinery (2021)
Dally, W.J., et al.: Evolution of the graphics processing unit (GPU). IEEE Micro 41(6), 42–51 (2021)
Oakden, T., et al.: Graphics processing in virtual production. In: 2022 14th International Conference on Computer and Automation Engineering (ICCAE), pp. 61–64 (2022)
Gati, N.J., Yang, L.T., Feng, J., Mo, Y., Alazab, M.: Differentially private tensor train deep computation for internet of multimedia things. ACM Trans. Multimed. Comput. Commun. Appl. 16(3s), 1–20 (2020). https://dl.acm.org/doi/10.1145/3421276
Fu, C., Yang, Z., Liu, X.-Y., Yang, J., Walid, A., Yang, L.T.: Secure tensor decomposition for heterogeneous multimedia data in cloud computing. IEEE Trans. Comput. Soc. Syst. 7(1), 247–260 (2020). https://ieeexplore.ieee.org/document/8960318/
Wang, H., Yang, W., Hu, R., Ouyang, R., Li, K., Li, K.: A novel parallel algorithm for sparse tensor matrix chain multiplication via TCU-acceleration. IEEE Trans. Parallel Distrib. Syst. 34(8), 2419–2432 (2023). https://ieeexplore.ieee.org/document/10159508/
Chen, H., Ahmad, F., Vorobyov, S., Porikli, F.: Tensor decompositions in wireless communications and MIMO radar. IEEE J. Sel. Top. Signal Process. 15(3), 438–453 (2021). https://ieeexplore.ieee.org/document/9362250/
Xu, H., Jiang, G., Yu, M., Zhu, Z., Bai, Y., Song, Y., Sun, H.: Tensor product and tensor-singular value decomposition based multi-exposure fusion of images. IEEE Trans. Multimed. 24, 3738–3753 (2022). https://ieeexplore.ieee.org/document/9522049/
Sofuoglu, S.E., Aviyente, S.: Graph regularized tensor train decomposition. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3912–3916. IEEE, Barcelona (2020). https://ieeexplore.ieee.org/document/9054032/
Zeng, H., Xue, J., Luong, H.Q., Philips, W.: Multimodal core tensor factorization and its applications to low-rank tensor completion. IEEE Trans. Multimed. 25, 7010–7024 (2023). https://ieeexplore.ieee.org/document/9927348/
Chen, L., Liu, Y., Zhu, C.: Robust tensor principal component analysis in all modes. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, pp. 1–6. IEEE (2018). https://ieeexplore.ieee.org/document/8486550/
Chang, S.Y., Wu, H.-C., Yan, K., Chen, X., Wu, Y.: Novel personalized multimedia recommendation systems using tensor singular-value-decomposition. In: 2023 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), Beijing, China, pp. 1–7. IEEE (2023). https://ieeexplore.ieee.org/document/10211188/
Lee, A.: Train spotting: Startup gets on track with AI and nvidia Jetson to ensure safety, cost savings for railways (2022). https://resources.nvidia.com/en-us-jetson-success/rail-vision-startup-uses?lx=XRDs_y
Mariani, R.: Driving toward a safer future: NVIDIA achieves safety milestones with drive hyperion autonomous vehicle platform (2023). https://blogs.nvidia.com/blog/2023/04/20/nvidia-drive-safety-milestones/
IEEE. The international roadmap for devices and systems: 2022. In: Institute of Electrical and Electronics Engineers (IEEE) (2022)
Strojwas, A.J., et al.: Yield and reliability challenges at 7nm and below. In: 2019 Electron Devices Technology and Manufacturing Conference (EDTM), pp. 179–181 (2019)
Hochschild, P.H., et al.: Cores that don’t count. In: Proceedings of the Workshop on Hot Topics in Operating Systems, HotOS 2021, New York, NY, USA, pp. 9–16. Association for Computing Machinery (2021). https://doi.org/10.1145/3458336.3465297
Dixit, H.D., et al.: Silent data corruptions at scale. CoRR, vol. abs/2102.11245 (2021). https://arxiv.org/abs/2102.11245
Constantinides, K., Mutlu, O., Austin, T., Bertacco, V.: Software-based online detection of hardware defects mechanisms, architectural support, and evaluation. In: 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pp. 97–108 (2007)
Gizopoulos, D., Papadimitriou, G., Chatzopoulos, O.: Estimating the failures and silent errors rates of CPUs across ISAS and microarchitectures. In: 2023 IEEE International Test Conference (ITC), pp. 377–382 (2023)
Papadimitriou, G., Gizopoulos, D.: Silent data corruptions: microarchitectural perspectives. IEEE Trans. Comput. 72(11), 3072–3085 (2023)
Zeng, Y., Huang, B.-Y., Zhang, H., Gupta, A., Malik, S.: Generating architecture-level abstractions from RTL designs for processors and accelerators part I: determining architectural state variables. In: 2021 IEEE/ACM International Conference in Computer Aided Design (ICCAD), 1–9 (2021)
Libano, F., et al.: On the reliability of Xilinx’s deep processing unit and systolic arrays for matrix multiplication. In: 2020 20th European Conference on Radiation and its Effects on Components and Systems (RADECS), pp. 1–5 (2020)
Omland, P., et al.: HPC hardware design reliability benchmarking with HDFIT. IEEE Trans. Parallel Distrib. Syst. 34(3), 995–1006 (2023)
Rech, R.L., Rech, P.: Reliability of Google’s tensor processing units for embedded applications. In: 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 376–381 (2022)
He, Y., et al.: Understanding and mitigating hardware failures in deep learning training systems. In: Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA 2023. Association for Computing Machinery, New York (2023https://doi.org/10.1145/3579371.3589105
Basso, P.M., et al.: Impact of tensor cores and mixed precision on the reliability of matrix multiplication in GPUs. IEEE Trans. Nucl. Sci. 67(7), 1560–1565 (2020)
Kundu, S., et al.: Special session: Reliability analysis for AI/ml hardware. In: 2021 IEEE 39th VLSI Test Symposium (VTS), pp. 1–10 (2021)
Ozen, E., Orailoglu, A.: Architecting decentralization and customizability in DNN accelerators for hardware defect adaptation. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 41(11), 3934–3945 (2022)
Chaudhuri, A., et al.: Special session: fault criticality assessment in AI accelerators. In: 2022 IEEE 40th VLSI Test Symposium (VTS), pp. 1–4 (2022)
Agarwal, U.K., Chan, A., Asgari, A., Pattabiraman, K.: Towards reliability assessment of systolic arrays against stuck-at faults. In: 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S), pp. 230–236 (2023)
Tan, J., et al.: Saca-FI: a microarchitecture-level fault injection framework for reliability analysis of systolic array based CNN accelerator. Future Gener. Comput. Syst. 147, 251–264 (2023). https://www.sciencedirect.com/science/article/pii/S0167739X2300184X
Elliott, J., et al.: Quantifying the impact of single bit flips on floating point arithmetic. North Carolina State University. Department of Computer Science, Technical report (2013)
Fu, H., et al.: Comparing floating-point and logarithmic number representations for reconfigurable acceleration. In: IEEE International Conference on Field Programmable Technology, pp. 337–340 (2006)
Haselman, M., et al.: A comparison of floating point and logarithmic number systems for FPGAs. In: 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2005), pp. 181–190 (2005)
Chugh, M., Parhami, B.: Logarithmic arithmetic as an alternative to floating-point: a review. In: 2013 Asilomar Conference on Signals, Systems and Computers, pp. 1139–1143 (2013)
Barrois, B., Sentieys, O.: Customizing fixed-point and floating-point arithmetic—a case study in k-means clustering. In: IEEE International Workshop on Signal Processing Systems (SiPS), pp. 1–6 (2017)
Gohil, V., et al.: Fixed-posit: a floating-point representation for error-resilient applications. IEEE Trans. Circuits Syst. II Express Briefs 68(10), 3341–3345 (2021)
Schlueter, B., et al.: Evaluating the resiliency of posits for scientific computing. In: Proceedings of the SC 2023 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, pp. 477–487 (2023)
Gavarini, G., et al.: On the resilience of representative and novel data formats in CNNs. In: IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 1–6 (2023)
Fatemi Langroudi, S.H., Pandit, T., Kudithipudi, D.: Deep learning inference on embedded devices: fixed-point vs posit. In: 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), pp. 19–23 (2018)
Alouani, I., et al.: An investigation on inherent robustness of posit data representation. In: 34th International Conference on VLSI Design and 20th International Conference on Embedded Systems (VLSID), pp. 276–281 (2021)
Limas Sierra, R., et al.: Analyzing the impact of different real number formats on the structural reliability of TCUs in GPUs. In: 2023 IFIP/IEEE 31st International Conference on Very Large Scale Integration (VLSI-SoC), pp. 1–6 (2023)
Limas Sierra, R., Guerrero-Balaguera, J.-D., Condia, J.E.R., Sonza Reorda, M.: Exploring hardware fault impacts on different real number representations of the structural resilience of TCUs in GPUs. Electronics 13(3) (2024). https://www.mdpi.com/2079-9292/13/3/578
Mallasén, D., Barrio, A.A.D., Prieto-Matias, M.: Big-percival: exploring the native use of 64-bit posit arithmetic in scientific computing (2023)
Murillo, R., Del Barrio, A.A., Botella, G.: Customized posit adders and multipliers using the flopoco core generator. In: 2020 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5 (2020)
Advanced Micro Devices, I.: Introducing AMD CDNA architecture the all-new AMD GPU architecture for the modern era of HPC & AI (2020)
Smith, A., James, N., AMD instinct MI200 series accelerator and node architectures. In: 2022 IEEE Hot Chips 34 Symposium (HCS), pp. 1–23. IEEE Computer Society (2022)
Jiang, H.: Intel’s ponte vecchio GPU : architecture, systems & software. In: 2022 IEEE Hot Chips 34 Symposium (HCS), pp. 1–29. IEEE Computer Society (2022)
Boswell, B.R., et al.: Generalized acceleration of matrix multiply accumulate operations. U.S. Patent and Trademark Office, US Patent 10,338,919 (2019)
Gebhart, M., et al.: Energy-efficient mechanisms for managing thread context in throughput processors. In: 38th Annual International Symposium on Computer Architecture (ISCA), pp. 235–246 (2011)
IEEE standard for floating-point arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008), pp. 1–84 (2019)
Gustafson, J.L., Yonemoto, I.T.: Beating floating point at its own game: posit arithmetic. Supercomput. Front. Innov.: Int. J. 4(2), 71–86 (2017)
Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural network. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, Lille, France, 07–09 Jul 2015, , vol. 37, pp. 1613–1622. PMLR (2015). https://proceedings.mlr.press/v37/blundell15.html
Lindstrom, P., et al.: Universal coding of the reals: alternatives to IEEE floating point. In: Proceedings of the Conference for Next Generation Arithmetic, CoNGA 2018. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3190339.3190344
Mishra, S.M., et al.: Comparison of floating-point representations for the efficient implementation of machine learning algorithms. In: 2022 32nd International Conference Radioelektronika (RADIOELEKTRONIKA), pp. 1–6 (2022)
Ito, K., et al.: Analyzing due errors on GPUs with neutron irradiation test and fault injection to control flow. IEEE Trans. Nucl. Sci. 68(8), 1668–1674 (2021)
Benevenuti, F., et al.: Investigating the reliability impacts of neutron-induced soft errors in aerial image classification CNNs implemented in a softcore SRAM-based FPGA GPU. Microelectron. Reliab. 138, 114738 (2022). 33rd European Symposium on Reliability of Electron Devices, Failure Physics and Analysis
Tsai, T., et al.: NVBitFI: dynamic fault injection for GPUs. In: 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 284–291 (2021)
Condia, J.E.R., et al.: A multi-level approach to evaluate the impact of GPU permanent faults on CNN’s reliability. In: 2022 IEEE International Test Conference (ITC), pp. 278–287 (2022)
Previlon, F.G., et al.: A comprehensive evaluation of the effects of input data on the resilience of GPU applications. In: 2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT) (2019)
Mallasen, D., et al.: Percival: open-source posit RISC-V core with quire capability. IEEE Trans. Emerg. Top. Comput. 10(03), 1241–1252 (2022)
de Dinechin, F., et al.: Designing custom arithmetic data paths with FloPoCo. IEEE Design Test Comput. 28(4), 18–27 (2011)
Martins, M., et al.: Open cell library in 15nm freePDK technology. In: Proceedings of the 2015 Symposium on International Symposium on Physical Design. Proceedings of the International Symposium on Physical Design (ISPD 2015), pp. 171–178 (2015)
Gil, P., et al.: Pin-level hardware fault injection techniques. In: Fault Injection Techniques and Tools for Embedded Systems reliability Evaluation, pp. 63–79 (2003). 978-0-306-48711-8
Jenn, E., Arlat, J., Rimén, M., Ohlsson, J., Karlsson, J.: Fault injection into VHDL models: the MEFISTO tool. In: Randell, B., Laprie, J.C., Kopetz, H., Littlewood, B. (eds.) Predictably Dependable Computing Systems. ESPRIT Basic Research Series, pp. 329–346. Springer, Heidelberg (1995). https://doi.org/10.1007/978-3-642-79789-7_19
Češka, M., Matyáš, J., Mrazek, V., Vojnar, T.: Designing approximate arithmetic circuits with combined error constraints (2022)
Jiang, H., Santiago, F.J.H., Mo, H., Liu, L., Han, J.: Approximate arithmetic circuits: a survey, characterization, and recent applications. Proc. IEEE 108(12), 2108–2135 (2020)
Huang, J., Yu, C.D., van de Geijn, R.A.: Implementing Strassen’s algorithm with cutlass on NVIDIA Volta GPUs (2018)
Acknowledgements
This work has been supported by the National Resilience and Recovery Plan (PNRR) through the National Center for HPC, Big Data, and Quantum Computing.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 IFIP International Federation for Information Processing
About this paper
Cite this paper
Limas Sierra, R., Guerrero-Balaguera, JD., Rodriguez Condia, J.E., Sonza Reorda, M. (2024). Analyzing the Reliability of TCUs Through Micro-architecture and Structural Evaluations for Two Real Number Formats. In: Elfadel, I.(.M., Albasha, L. (eds) VLSI-SoC 2023: Innovations for Trustworthy Artificial Intelligence. VLSI-SoC 2023. IFIP Advances in Information and Communication Technology, vol 680. Springer, Cham. https://doi.org/10.1007/978-3-031-70947-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-70947-0_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70946-3
Online ISBN: 978-3-031-70947-0
eBook Packages: Computer ScienceComputer Science (R0)