Skip to main content

A Multi-level Parallel Integer/Floating-Point Arithmetic Architecture for Deep Learning Instructions

  • Conference paper
  • First Online:
Euro-Par 2023: Parallel Processing (Euro-Par 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14100))

Included in the following conference series:

  • 1488 Accesses

Abstract

The extensive instruction-set for deep learning (DL) significantly enhances the performance of general-purpose architectures by exploiting data-level parallelism. However, it is challenging to design arithmetic units capable of performing parallel operations on a wide range of formats to perform DL instructions (DLIs) efficiently. This paper presents a multi-level parallel arithmetic architecture capable of supporting intra- and inter-operation parallelism for integer and a wide range of FP formats. For intra-operation parallelism, the proposed architecture supports multi-term dot-product for integer, half-precision, and BrainFloat16 formats using mixed-precision methods. For inter-operation parallelism, a dual-path execution is enabled to perform integer dot-product and single-precision (SP) addition in parallel. Moreover, the architecture supports the commonly used fused multiply-add (FMA) operations in general-purpose architectures. The proposed architecture strictly adheres to the computing requirements of DLIs and can efficiently implement them. When using benchmarked DNN inference applications where both integer and FP formats are needed, the proposed architecture can significantly improve performance by up to 15.7% compared to a single-path implementation. Furthermore, compared with state-of-the-art designs, the proposed architecture achieves higher energy efficiency and works more efficiently in implementing DLIs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. IEEE standard for floating-point arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008), pp. 1–84 (2019). https://doi.org/10.1109/IEEESTD.2019.8766229

  2. Arafa, M., et al.: Cascade lake: next generation Intel Xeon scalable processor. IEEE Micro 39(2), 29–36 (2019)

    Article  MathSciNet  Google Scholar 

  3. Arm: Arm A64 instruction set architecture - Armv9, for Armv9-a architecture profile (2021). https://developer.arm.com/documentation/ddi0602/2021-12

  4. Booth, A.D.: A signed binary multiplication technique. Q. J. Mech. Appl. Math. 4(2), 236–240 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  5. Brunie, N.: Modified fused multiply and add for exact low precision product accumulation. In: 2017 IEEE 24th Symposium on Computer Arithmetic (ARITH), pp. 106–113. IEEE (2017)

    Google Scholar 

  6. Dan, Z., et al.: IEEE standard for floating-point arithmetic. IEEE Std 754-2008, pp. 1–70 (2008)

    Google Scholar 

  7. Hauser, J.: Berkeley testfloat (2018). https://www.jhauser.us/arithmetic/testfloat.html

  8. Huang, L., Ma, S., Shen, L., Wang, Z., Xiao, N.: Low-cost binary128 floating-point FMA unit design with SIMD support. IEEE Trans. Comput. 61(5), 745–751 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  9. Kalamkar, D., et al.: A study of BFLOAT16 for deep learning training. arXiv preprint arXiv:1905.12322 (2019)

  10. Kaul, H., et al.: A 1.45 GHz 52-to-162GFLOPS/W variable-precision floating-point fused multiply-add unit with certainty tracking in 32 nm CMOS. In: 2012 IEEE International Solid-State Circuits Conference, pp. 182–184. IEEE (2012)

    Google Scholar 

  11. Mach, S., Schuiki, F., Zaruba, F., Benini, L.: FPnew: an open-source multiformat floating-point unit architecture for energy-proportional transprecision computing. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 29(4), 774–787 (2020)

    Article  Google Scholar 

  12. Mao, W., et al.: A configurable floating-point multiple-precision processing element for HPC and AI converged computing. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 30(2), 213–226 (2021)

    Article  Google Scholar 

  13. Micikevicius, P., et al.: Mixed precision training. arXiv preprint arXiv:1710.03740 (2017)

  14. Murshed, M.S., Murphy, C., Hou, D., Khan, N., Ananthanarayanan, G., Hussain, F.: Machine learning at the network edge: a survey. ACM Comput. Surv. (CSUR) 54(8), 1–37 (2021)

    Article  Google Scholar 

  15. Schmookler, M.S., Nowka, K.J.: Leading zero anticipation and detection-a comparison of methods. In: Proceedings 15th IEEE Symposium on Computer Arithmetic, ARITH-15 2001, pp. 7–12. IEEE (2001)

    Google Scholar 

  16. Sevilla, J., Heim, L., Ho, A., Besiroglu, T., Hobbhahn, M., Villalobos, P.: Compute trends across three eras of machine learning. In: 2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2022)

    Google Scholar 

  17. Sohn, J., Swartzlander, E.E.: A fused floating-point four-term dot product unit. IEEE Trans. Circuits Syst. I Regul. Pap. 63(3), 370–378 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  18. Tan, H., Tong, G., Huang, L., Xiao, L., Xiao, N.: Multiple-mode-supporting floating-point FMA unit for deep learning processors. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 31(02), 253–266 (2023)

    Article  Google Scholar 

  19. Vanholder, H.: Efficient inference with tensorrt. In: GPU Technology Conference, vol. 1, p. 2 (2016)

    Google Scholar 

  20. Zhang, H., Chen, D., Ko, S.B.: Efficient multiple-precision floating-point fused multiply-add with mixed-precision support. IEEE Trans. Comput. 68(7), 1035–1048 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  21. Zhang, H., Lee, H.J., Ko, S.B.: Efficient fixed/floating-point merged mixed-precision multiply-accumulate unit for deep learning processors. In: 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. IEEE (2018)

    Google Scholar 

Download references

Acknowledgment

This work is supported in part by NSFC (No. 62272475, 62090023) and NSFHN (No. 2022JJ10064).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Libo Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tan, H. et al. (2023). A Multi-level Parallel Integer/Floating-Point Arithmetic Architecture for Deep Learning Instructions. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-39698-4_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-39697-7

  • Online ISBN: 978-3-031-39698-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics