A Multi-level Parallel Integer/Floating-Point Arithmetic Architecture for Deep Learning Instructions

Tan, Hongbing; Zhang, Jing; Huang, Libo; He, Xiaowei; Dong, Dezun; Wang, Yongwen; Xiao, Liquan

doi:10.1007/978-3-031-39698-4_18

Hongbing Tan¹²,
Jing Zhang¹²,
Libo Huang¹²,
Xiaowei He¹²,
Dezun Dong¹²,
Yongwen Wang¹² &
…
Liquan Xiao¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14100))

Included in the following conference series:

European Conference on Parallel Processing

1488 Accesses

Abstract

The extensive instruction-set for deep learning (DL) significantly enhances the performance of general-purpose architectures by exploiting data-level parallelism. However, it is challenging to design arithmetic units capable of performing parallel operations on a wide range of formats to perform DL instructions (DLIs) efficiently. This paper presents a multi-level parallel arithmetic architecture capable of supporting intra- and inter-operation parallelism for integer and a wide range of FP formats. For intra-operation parallelism, the proposed architecture supports multi-term dot-product for integer, half-precision, and BrainFloat16 formats using mixed-precision methods. For inter-operation parallelism, a dual-path execution is enabled to perform integer dot-product and single-precision (SP) addition in parallel. Moreover, the architecture supports the commonly used fused multiply-add (FMA) operations in general-purpose architectures. The proposed architecture strictly adheres to the computing requirements of DLIs and can efficiently implement them. When using benchmarked DNN inference applications where both integer and FP formats are needed, the proposed architecture can significantly improve performance by up to 15.7% compared to a single-path implementation. Furthermore, compared with state-of-the-art designs, the proposed architecture achieves higher energy efficiency and works more efficiently in implementing DLIs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

IEEE standard for floating-point arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008), pp. 1–84 (2019). https://doi.org/10.1109/IEEESTD.2019.8766229
Arafa, M., et al.: Cascade lake: next generation Intel Xeon scalable processor. IEEE Micro 39(2), 29–36 (2019)
Article MathSciNet Google Scholar
Arm: Arm A64 instruction set architecture - Armv9, for Armv9-a architecture profile (2021). https://developer.arm.com/documentation/ddi0602/2021-12
Booth, A.D.: A signed binary multiplication technique. Q. J. Mech. Appl. Math. 4(2), 236–240 (1951)
Article MathSciNet MATH Google Scholar
Brunie, N.: Modified fused multiply and add for exact low precision product accumulation. In: 2017 IEEE 24th Symposium on Computer Arithmetic (ARITH), pp. 106–113. IEEE (2017)
Google Scholar
Dan, Z., et al.: IEEE standard for floating-point arithmetic. IEEE Std 754-2008, pp. 1–70 (2008)
Google Scholar
Hauser, J.: Berkeley testfloat (2018). https://www.jhauser.us/arithmetic/testfloat.html
Huang, L., Ma, S., Shen, L., Wang, Z., Xiao, N.: Low-cost binary128 floating-point FMA unit design with SIMD support. IEEE Trans. Comput. 61(5), 745–751 (2011)
Article MathSciNet MATH Google Scholar
Kalamkar, D., et al.: A study of BFLOAT16 for deep learning training. arXiv preprint arXiv:1905.12322 (2019)
Kaul, H., et al.: A 1.45 GHz 52-to-162GFLOPS/W variable-precision floating-point fused multiply-add unit with certainty tracking in 32 nm CMOS. In: 2012 IEEE International Solid-State Circuits Conference, pp. 182–184. IEEE (2012)
Google Scholar
Mach, S., Schuiki, F., Zaruba, F., Benini, L.: FPnew: an open-source multiformat floating-point unit architecture for energy-proportional transprecision computing. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 29(4), 774–787 (2020)
Article Google Scholar
Mao, W., et al.: A configurable floating-point multiple-precision processing element for HPC and AI converged computing. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 30(2), 213–226 (2021)
Article Google Scholar
Micikevicius, P., et al.: Mixed precision training. arXiv preprint arXiv:1710.03740 (2017)
Murshed, M.S., Murphy, C., Hou, D., Khan, N., Ananthanarayanan, G., Hussain, F.: Machine learning at the network edge: a survey. ACM Comput. Surv. (CSUR) 54(8), 1–37 (2021)
Article Google Scholar
Schmookler, M.S., Nowka, K.J.: Leading zero anticipation and detection-a comparison of methods. In: Proceedings 15th IEEE Symposium on Computer Arithmetic, ARITH-15 2001, pp. 7–12. IEEE (2001)
Google Scholar
Sevilla, J., Heim, L., Ho, A., Besiroglu, T., Hobbhahn, M., Villalobos, P.: Compute trends across three eras of machine learning. In: 2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2022)
Google Scholar
Sohn, J., Swartzlander, E.E.: A fused floating-point four-term dot product unit. IEEE Trans. Circuits Syst. I Regul. Pap. 63(3), 370–378 (2016)
Article MathSciNet MATH Google Scholar
Tan, H., Tong, G., Huang, L., Xiao, L., Xiao, N.: Multiple-mode-supporting floating-point FMA unit for deep learning processors. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 31(02), 253–266 (2023)
Article Google Scholar
Vanholder, H.: Efficient inference with tensorrt. In: GPU Technology Conference, vol. 1, p. 2 (2016)
Google Scholar
Zhang, H., Chen, D., Ko, S.B.: Efficient multiple-precision floating-point fused multiply-add with mixed-precision support. IEEE Trans. Comput. 68(7), 1035–1048 (2019)
Article MathSciNet MATH Google Scholar
Zhang, H., Lee, H.J., Ko, S.B.: Efficient fixed/floating-point merged mixed-precision multiply-accumulate unit for deep learning processors. In: 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. IEEE (2018)
Google Scholar

Download references

Acknowledgment

This work is supported in part by NSFC (No. 62272475, 62090023) and NSFHN (No. 2022JJ10064).

Author information

Authors and Affiliations

National University of Defense Technology, Changsha, 410073, China
Hongbing Tan, Jing Zhang, Libo Huang, Xiaowei He, Dezun Dong, Yongwen Wang & Liquan Xiao

Authors

Hongbing Tan
View author publications
You can also search for this author in PubMed Google Scholar
Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Libo Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowei He
View author publications
You can also search for this author in PubMed Google Scholar
Dezun Dong
View author publications
You can also search for this author in PubMed Google Scholar
Yongwen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Liquan Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Libo Huang .

Editor information

Editors and Affiliations

University of Glasgow, Glasgow, UK
José Cano
University of Cyprus, Nicosia, Cyprus
Marios D. Dikaiakos
University of Cyprus, Nicosia, Cyprus
George A. Papadopoulos
Chalmers University of Technology, Gothenburg, Sweden
Miquel Pericàs
University of Manchester, Manchester, UK
Rizos Sakellariou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tan, H. et al. (2023). A Multi-level Parallel Integer/Floating-Point Arithmetic Architecture for Deep Learning Instructions. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-39698-4_18
Published: 24 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39697-7
Online ISBN: 978-3-031-39698-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Multi-level Parallel Integer/Floating-Point Arithmetic Architecture for Deep Learning Instructions