Skip to main content

Analyzing the Reliability of TCUs Through Micro-architecture and Structural Evaluations for Two Real Number Formats

  • Conference paper
  • First Online:
VLSI-SoC 2023: Innovations for Trustworthy Artificial Intelligence (VLSI-SoC 2023)

Abstract

Modern Graphics Processing Units (GPUs) include in-chip hardware accelerators (Tensor Core Units, or TCUs) to increase the performance of machine learning applications. Unfortunately, cutting-edge semiconductor technologies are increasingly prone to suffer from faults and affect devices during their operation. Moreover, the execution of safety-critical and High-Performance Computing (HPC) applications in GPUs strongly stresses crucial resources, such as TCUs, which increases the likelihood of different kinds of failures. Thus, the resilience analysis of GPUs and their critical units (TCUs) are vital in safety-critical domains, e.g., in automotive, space, and autonomous robotics, to develop effective countermeasures or improve designs. Recently, new arithmetic formats have been proposed, particularly suited to neural network processing. However, an effective reliability characterization of TCUs supporting different arithmetic formats was still missed.

In this work, we propose a hierarchical multi-level strategy to assess the reliability of permanent faults arising in TCUs inside GPUs when using two number formats, i.e., Floating Point (FP) and Posit. The proposed strategy combines a fine-grain micro-architectural characterization of hardware faults in TCUs with a higher-level structural evaluation to observe the interactions with other GPU structures and the error propagation effects. The micro-architectural characterization resorts to two representative descriptions of the main components in TCUs (Dot-Product Units) for both formats (FP and Posit). Then, the fine-grain findings feed a structural TCU model (PyOpenTCU) to propagate and observe the principal error effects. The experimental results show the advantages in performance and accuracy of using clever methods for the reliability assessment of large hardware accelerators, such as TCUs, and identified a relation between the corrupted spatial areas in the output matrices and the TCU’s scheduling policies. Finally, the results demonstrate that Posit formats are less affected by faults than Floating Point formats by several orders of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Peccerillo, B., et al.: A survey on hardware accelerators: taxonomy, trends, challenges, and perspectives. J. Syst. Architect. 129, 102561 (2022)

    Article  Google Scholar 

  2. Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. SIGARCH Comput. Archit. News 45(2), 1–12 (2017)

    Article  MATH  Google Scholar 

  3. Raihan, M., et al.: Modeling deep learning accelerator enabled GPUs. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 79–92 (2019)

    Google Scholar 

  4. Lee, W.-K., et al.: Tensorcrypto: high throughput acceleration of lattice-based cryptography using tensor core on GPU. IEEE Access 10, 20 616-20 632 (2022)

    Google Scholar 

  5. Groth, S., et al.: Efficient application of tensor core units for convolving images. Association for Computing Machinery (2021)

    Google Scholar 

  6. Dally, W.J., et al.: Evolution of the graphics processing unit (GPU). IEEE Micro 41(6), 42–51 (2021)

    Article  MATH  Google Scholar 

  7. Oakden, T., et al.: Graphics processing in virtual production. In: 2022 14th International Conference on Computer and Automation Engineering (ICCAE), pp. 61–64 (2022)

    Google Scholar 

  8. Gati, N.J., Yang, L.T., Feng, J., Mo, Y., Alazab, M.: Differentially private tensor train deep computation for internet of multimedia things. ACM Trans. Multimed. Comput. Commun. Appl. 16(3s), 1–20 (2020). https://dl.acm.org/doi/10.1145/3421276

  9. Fu, C., Yang, Z., Liu, X.-Y., Yang, J., Walid, A., Yang, L.T.: Secure tensor decomposition for heterogeneous multimedia data in cloud computing. IEEE Trans. Comput. Soc. Syst. 7(1), 247–260 (2020). https://ieeexplore.ieee.org/document/8960318/

  10. Wang, H., Yang, W., Hu, R., Ouyang, R., Li, K., Li, K.: A novel parallel algorithm for sparse tensor matrix chain multiplication via TCU-acceleration. IEEE Trans. Parallel Distrib. Syst. 34(8), 2419–2432 (2023). https://ieeexplore.ieee.org/document/10159508/

  11. Chen, H., Ahmad, F., Vorobyov, S., Porikli, F.: Tensor decompositions in wireless communications and MIMO radar. IEEE J. Sel. Top. Signal Process. 15(3), 438–453 (2021). https://ieeexplore.ieee.org/document/9362250/

  12. Xu, H., Jiang, G., Yu, M., Zhu, Z., Bai, Y., Song, Y., Sun, H.: Tensor product and tensor-singular value decomposition based multi-exposure fusion of images. IEEE Trans. Multimed. 24, 3738–3753 (2022). https://ieeexplore.ieee.org/document/9522049/

  13. Sofuoglu, S.E., Aviyente, S.: Graph regularized tensor train decomposition. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3912–3916. IEEE, Barcelona (2020). https://ieeexplore.ieee.org/document/9054032/

  14. Zeng, H., Xue, J., Luong, H.Q., Philips, W.: Multimodal core tensor factorization and its applications to low-rank tensor completion. IEEE Trans. Multimed. 25, 7010–7024 (2023). https://ieeexplore.ieee.org/document/9927348/

  15. Chen, L., Liu, Y., Zhu, C.: Robust tensor principal component analysis in all modes. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, pp. 1–6. IEEE (2018). https://ieeexplore.ieee.org/document/8486550/

  16. Chang, S.Y., Wu, H.-C., Yan, K., Chen, X., Wu, Y.: Novel personalized multimedia recommendation systems using tensor singular-value-decomposition. In: 2023 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), Beijing, China, pp. 1–7. IEEE (2023). https://ieeexplore.ieee.org/document/10211188/

  17. Lee, A.: Train spotting: Startup gets on track with AI and nvidia Jetson to ensure safety, cost savings for railways (2022). https://resources.nvidia.com/en-us-jetson-success/rail-vision-startup-uses?lx=XRDs_y

  18. Mariani, R.: Driving toward a safer future: NVIDIA achieves safety milestones with drive hyperion autonomous vehicle platform (2023). https://blogs.nvidia.com/blog/2023/04/20/nvidia-drive-safety-milestones/

  19. IEEE. The international roadmap for devices and systems: 2022. In: Institute of Electrical and Electronics Engineers (IEEE) (2022)

    Google Scholar 

  20. Strojwas, A.J., et al.: Yield and reliability challenges at 7nm and below. In: 2019 Electron Devices Technology and Manufacturing Conference (EDTM), pp. 179–181 (2019)

    Google Scholar 

  21. Hochschild, P.H., et al.: Cores that don’t count. In: Proceedings of the Workshop on Hot Topics in Operating Systems, HotOS 2021, New York, NY, USA, pp. 9–16. Association for Computing Machinery (2021). https://doi.org/10.1145/3458336.3465297

  22. Dixit, H.D., et al.: Silent data corruptions at scale. CoRR, vol. abs/2102.11245 (2021). https://arxiv.org/abs/2102.11245

  23. Constantinides, K., Mutlu, O., Austin, T., Bertacco, V.: Software-based online detection of hardware defects mechanisms, architectural support, and evaluation. In: 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pp. 97–108 (2007)

    Google Scholar 

  24. Gizopoulos, D., Papadimitriou, G., Chatzopoulos, O.: Estimating the failures and silent errors rates of CPUs across ISAS and microarchitectures. In: 2023 IEEE International Test Conference (ITC), pp. 377–382 (2023)

    Google Scholar 

  25. Papadimitriou, G., Gizopoulos, D.: Silent data corruptions: microarchitectural perspectives. IEEE Trans. Comput. 72(11), 3072–3085 (2023)

    Article  MATH  Google Scholar 

  26. Zeng, Y., Huang, B.-Y., Zhang, H., Gupta, A., Malik, S.: Generating architecture-level abstractions from RTL designs for processors and accelerators part I: determining architectural state variables. In: 2021 IEEE/ACM International Conference in Computer Aided Design (ICCAD), 1–9 (2021)

    Google Scholar 

  27. Libano, F., et al.: On the reliability of Xilinx’s deep processing unit and systolic arrays for matrix multiplication. In: 2020 20th European Conference on Radiation and its Effects on Components and Systems (RADECS), pp. 1–5 (2020)

    Google Scholar 

  28. Omland, P., et al.: HPC hardware design reliability benchmarking with HDFIT. IEEE Trans. Parallel Distrib. Syst. 34(3), 995–1006 (2023)

    Article  MathSciNet  MATH  Google Scholar 

  29. Rech, R.L., Rech, P.: Reliability of Google’s tensor processing units for embedded applications. In: 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 376–381 (2022)

    Google Scholar 

  30. He, Y., et al.: Understanding and mitigating hardware failures in deep learning training systems. In: Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA 2023. Association for Computing Machinery, New York (2023https://doi.org/10.1145/3579371.3589105

  31. Basso, P.M., et al.: Impact of tensor cores and mixed precision on the reliability of matrix multiplication in GPUs. IEEE Trans. Nucl. Sci. 67(7), 1560–1565 (2020)

    Article  MATH  Google Scholar 

  32. Kundu, S., et al.: Special session: Reliability analysis for AI/ml hardware. In: 2021 IEEE 39th VLSI Test Symposium (VTS), pp. 1–10 (2021)

    Google Scholar 

  33. Ozen, E., Orailoglu, A.: Architecting decentralization and customizability in DNN accelerators for hardware defect adaptation. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 41(11), 3934–3945 (2022)

    Article  MATH  Google Scholar 

  34. Chaudhuri, A., et al.: Special session: fault criticality assessment in AI accelerators. In: 2022 IEEE 40th VLSI Test Symposium (VTS), pp. 1–4 (2022)

    Google Scholar 

  35. Agarwal, U.K., Chan, A., Asgari, A., Pattabiraman, K.: Towards reliability assessment of systolic arrays against stuck-at faults. In: 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S), pp. 230–236 (2023)

    Google Scholar 

  36. Tan, J., et al.: Saca-FI: a microarchitecture-level fault injection framework for reliability analysis of systolic array based CNN accelerator. Future Gener. Comput. Syst. 147, 251–264 (2023). https://www.sciencedirect.com/science/article/pii/S0167739X2300184X

  37. Elliott, J., et al.: Quantifying the impact of single bit flips on floating point arithmetic. North Carolina State University. Department of Computer Science, Technical report (2013)

    Google Scholar 

  38. Fu, H., et al.: Comparing floating-point and logarithmic number representations for reconfigurable acceleration. In: IEEE International Conference on Field Programmable Technology, pp. 337–340 (2006)

    Google Scholar 

  39. Haselman, M., et al.: A comparison of floating point and logarithmic number systems for FPGAs. In: 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2005), pp. 181–190 (2005)

    Google Scholar 

  40. Chugh, M., Parhami, B.: Logarithmic arithmetic as an alternative to floating-point: a review. In: 2013 Asilomar Conference on Signals, Systems and Computers, pp. 1139–1143 (2013)

    Google Scholar 

  41. Barrois, B., Sentieys, O.: Customizing fixed-point and floating-point arithmetic—a case study in k-means clustering. In: IEEE International Workshop on Signal Processing Systems (SiPS), pp. 1–6 (2017)

    Google Scholar 

  42. Gohil, V., et al.: Fixed-posit: a floating-point representation for error-resilient applications. IEEE Trans. Circuits Syst. II Express Briefs 68(10), 3341–3345 (2021)

    MATH  Google Scholar 

  43. Schlueter, B., et al.: Evaluating the resiliency of posits for scientific computing. In: Proceedings of the SC 2023 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, pp. 477–487 (2023)

    Google Scholar 

  44. Gavarini, G., et al.: On the resilience of representative and novel data formats in CNNs. In: IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), pp. 1–6 (2023)

    Google Scholar 

  45. Fatemi Langroudi, S.H., Pandit, T., Kudithipudi, D.: Deep learning inference on embedded devices: fixed-point vs posit. In: 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), pp. 19–23 (2018)

    Google Scholar 

  46. Alouani, I., et al.: An investigation on inherent robustness of posit data representation. In: 34th International Conference on VLSI Design and 20th International Conference on Embedded Systems (VLSID), pp. 276–281 (2021)

    Google Scholar 

  47. Limas Sierra, R., et al.: Analyzing the impact of different real number formats on the structural reliability of TCUs in GPUs. In: 2023 IFIP/IEEE 31st International Conference on Very Large Scale Integration (VLSI-SoC), pp. 1–6 (2023)

    Google Scholar 

  48. Limas Sierra, R., Guerrero-Balaguera, J.-D., Condia, J.E.R., Sonza Reorda, M.: Exploring hardware fault impacts on different real number representations of the structural resilience of TCUs in GPUs. Electronics 13(3) (2024). https://www.mdpi.com/2079-9292/13/3/578

  49. Mallasén, D., Barrio, A.A.D., Prieto-Matias, M.: Big-percival: exploring the native use of 64-bit posit arithmetic in scientific computing (2023)

    Google Scholar 

  50. Murillo, R., Del Barrio, A.A., Botella, G.: Customized posit adders and multipliers using the flopoco core generator. In: 2020 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5 (2020)

    Google Scholar 

  51. Advanced Micro Devices, I.: Introducing AMD CDNA architecture the all-new AMD GPU architecture for the modern era of HPC & AI (2020)

    Google Scholar 

  52. Smith, A., James, N., AMD instinct MI200 series accelerator and node architectures. In: 2022 IEEE Hot Chips 34 Symposium (HCS), pp. 1–23. IEEE Computer Society (2022)

    Google Scholar 

  53. Jiang, H.: Intel’s ponte vecchio GPU : architecture, systems & software. In: 2022 IEEE Hot Chips 34 Symposium (HCS), pp. 1–29. IEEE Computer Society (2022)

    Google Scholar 

  54. Boswell, B.R., et al.: Generalized acceleration of matrix multiply accumulate operations. U.S. Patent and Trademark Office, US Patent 10,338,919 (2019)

    Google Scholar 

  55. Gebhart, M., et al.: Energy-efficient mechanisms for managing thread context in throughput processors. In: 38th Annual International Symposium on Computer Architecture (ISCA), pp. 235–246 (2011)

    Google Scholar 

  56. IEEE standard for floating-point arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008), pp. 1–84 (2019)

    Google Scholar 

  57. Gustafson, J.L., Yonemoto, I.T.: Beating floating point at its own game: posit arithmetic. Supercomput. Front. Innov.: Int. J. 4(2), 71–86 (2017)

    Google Scholar 

  58. Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural network. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, Lille, France, 07–09 Jul 2015, , vol. 37, pp. 1613–1622. PMLR (2015). https://proceedings.mlr.press/v37/blundell15.html

  59. Lindstrom, P., et al.: Universal coding of the reals: alternatives to IEEE floating point. In: Proceedings of the Conference for Next Generation Arithmetic, CoNGA 2018. Association for Computing Machinery, New York (2018). https://doi.org/10.1145/3190339.3190344

  60. Mishra, S.M., et al.: Comparison of floating-point representations for the efficient implementation of machine learning algorithms. In: 2022 32nd International Conference Radioelektronika (RADIOELEKTRONIKA), pp. 1–6 (2022)

    Google Scholar 

  61. Ito, K., et al.: Analyzing due errors on GPUs with neutron irradiation test and fault injection to control flow. IEEE Trans. Nucl. Sci. 68(8), 1668–1674 (2021)

    Article  MATH  Google Scholar 

  62. Benevenuti, F., et al.: Investigating the reliability impacts of neutron-induced soft errors in aerial image classification CNNs implemented in a softcore SRAM-based FPGA GPU. Microelectron. Reliab. 138, 114738 (2022). 33rd European Symposium on Reliability of Electron Devices, Failure Physics and Analysis

    Google Scholar 

  63. Tsai, T., et al.: NVBitFI: dynamic fault injection for GPUs. In: 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 284–291 (2021)

    Google Scholar 

  64. Condia, J.E.R., et al.: A multi-level approach to evaluate the impact of GPU permanent faults on CNN’s reliability. In: 2022 IEEE International Test Conference (ITC), pp. 278–287 (2022)

    Google Scholar 

  65. Previlon, F.G., et al.: A comprehensive evaluation of the effects of input data on the resilience of GPU applications. In: 2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT) (2019)

    Google Scholar 

  66. Mallasen, D., et al.: Percival: open-source posit RISC-V core with quire capability. IEEE Trans. Emerg. Top. Comput. 10(03), 1241–1252 (2022)

    Article  MATH  Google Scholar 

  67. de Dinechin, F., et al.: Designing custom arithmetic data paths with FloPoCo. IEEE Design Test Comput. 28(4), 18–27 (2011)

    Article  MATH  Google Scholar 

  68. Martins, M., et al.: Open cell library in 15nm freePDK technology. In: Proceedings of the 2015 Symposium on International Symposium on Physical Design. Proceedings of the International Symposium on Physical Design (ISPD 2015), pp. 171–178 (2015)

    Google Scholar 

  69. Gil, P., et al.: Pin-level hardware fault injection techniques. In: Fault Injection Techniques and Tools for Embedded Systems reliability Evaluation, pp. 63–79 (2003). 978-0-306-48711-8

    Google Scholar 

  70. Jenn, E., Arlat, J., Rimén, M., Ohlsson, J., Karlsson, J.: Fault injection into VHDL models: the MEFISTO tool. In: Randell, B., Laprie, J.C., Kopetz, H., Littlewood, B. (eds.) Predictably Dependable Computing Systems. ESPRIT Basic Research Series, pp. 329–346. Springer, Heidelberg (1995). https://doi.org/10.1007/978-3-642-79789-7_19

    Chapter  Google Scholar 

  71. Češka, M., Matyáš, J., Mrazek, V., Vojnar, T.: Designing approximate arithmetic circuits with combined error constraints (2022)

    Google Scholar 

  72. Jiang, H., Santiago, F.J.H., Mo, H., Liu, L., Han, J.: Approximate arithmetic circuits: a survey, characterization, and recent applications. Proc. IEEE 108(12), 2108–2135 (2020)

    Article  MATH  Google Scholar 

  73. Huang, J., Yu, C.D., van de Geijn, R.A.: Implementing Strassen’s algorithm with cutlass on NVIDIA Volta GPUs (2018)

    Google Scholar 

Download references

Acknowledgements

This work has been supported by the National Resilience and Recovery Plan (PNRR) through the National Center for HPC, Big Data, and Quantum Computing.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert Limas Sierra .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Limas Sierra, R., Guerrero-Balaguera, JD., Rodriguez Condia, J.E., Sonza Reorda, M. (2024). Analyzing the Reliability of TCUs Through Micro-architecture and Structural Evaluations for Two Real Number Formats. In: Elfadel, I.(.M., Albasha, L. (eds) VLSI-SoC 2023: Innovations for Trustworthy Artificial Intelligence. VLSI-SoC 2023. IFIP Advances in Information and Communication Technology, vol 680. Springer, Cham. https://doi.org/10.1007/978-3-031-70947-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70947-0_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70946-3

  • Online ISBN: 978-3-031-70947-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics