Abstract
The extensive instruction-set for deep learning (DL) significantly enhances the performance of general-purpose architectures by exploiting data-level parallelism. However, it is challenging to design arithmetic units capable of performing parallel operations on a wide range of formats to perform DL instructions (DLIs) efficiently. This paper presents a multi-level parallel arithmetic architecture capable of supporting intra- and inter-operation parallelism for integer and a wide range of FP formats. For intra-operation parallelism, the proposed architecture supports multi-term dot-product for integer, half-precision, and BrainFloat16 formats using mixed-precision methods. For inter-operation parallelism, a dual-path execution is enabled to perform integer dot-product and single-precision (SP) addition in parallel. Moreover, the architecture supports the commonly used fused multiply-add (FMA) operations in general-purpose architectures. The proposed architecture strictly adheres to the computing requirements of DLIs and can efficiently implement them. When using benchmarked DNN inference applications where both integer and FP formats are needed, the proposed architecture can significantly improve performance by up to 15.7% compared to a single-path implementation. Furthermore, compared with state-of-the-art designs, the proposed architecture achieves higher energy efficiency and works more efficiently in implementing DLIs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
IEEE standard for floating-point arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008), pp. 1–84 (2019). https://doi.org/10.1109/IEEESTD.2019.8766229
Arafa, M., et al.: Cascade lake: next generation Intel Xeon scalable processor. IEEE Micro 39(2), 29–36 (2019)
Arm: Arm A64 instruction set architecture - Armv9, for Armv9-a architecture profile (2021). https://developer.arm.com/documentation/ddi0602/2021-12
Booth, A.D.: A signed binary multiplication technique. Q. J. Mech. Appl. Math. 4(2), 236–240 (1951)
Brunie, N.: Modified fused multiply and add for exact low precision product accumulation. In: 2017 IEEE 24th Symposium on Computer Arithmetic (ARITH), pp. 106–113. IEEE (2017)
Dan, Z., et al.: IEEE standard for floating-point arithmetic. IEEE Std 754-2008, pp. 1–70 (2008)
Hauser, J.: Berkeley testfloat (2018). https://www.jhauser.us/arithmetic/testfloat.html
Huang, L., Ma, S., Shen, L., Wang, Z., Xiao, N.: Low-cost binary128 floating-point FMA unit design with SIMD support. IEEE Trans. Comput. 61(5), 745–751 (2011)
Kalamkar, D., et al.: A study of BFLOAT16 for deep learning training. arXiv preprint arXiv:1905.12322 (2019)
Kaul, H., et al.: A 1.45 GHz 52-to-162GFLOPS/W variable-precision floating-point fused multiply-add unit with certainty tracking in 32 nm CMOS. In: 2012 IEEE International Solid-State Circuits Conference, pp. 182–184. IEEE (2012)
Mach, S., Schuiki, F., Zaruba, F., Benini, L.: FPnew: an open-source multiformat floating-point unit architecture for energy-proportional transprecision computing. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 29(4), 774–787 (2020)
Mao, W., et al.: A configurable floating-point multiple-precision processing element for HPC and AI converged computing. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 30(2), 213–226 (2021)
Micikevicius, P., et al.: Mixed precision training. arXiv preprint arXiv:1710.03740 (2017)
Murshed, M.S., Murphy, C., Hou, D., Khan, N., Ananthanarayanan, G., Hussain, F.: Machine learning at the network edge: a survey. ACM Comput. Surv. (CSUR) 54(8), 1–37 (2021)
Schmookler, M.S., Nowka, K.J.: Leading zero anticipation and detection-a comparison of methods. In: Proceedings 15th IEEE Symposium on Computer Arithmetic, ARITH-15 2001, pp. 7–12. IEEE (2001)
Sevilla, J., Heim, L., Ho, A., Besiroglu, T., Hobbhahn, M., Villalobos, P.: Compute trends across three eras of machine learning. In: 2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2022)
Sohn, J., Swartzlander, E.E.: A fused floating-point four-term dot product unit. IEEE Trans. Circuits Syst. I Regul. Pap. 63(3), 370–378 (2016)
Tan, H., Tong, G., Huang, L., Xiao, L., Xiao, N.: Multiple-mode-supporting floating-point FMA unit for deep learning processors. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 31(02), 253–266 (2023)
Vanholder, H.: Efficient inference with tensorrt. In: GPU Technology Conference, vol. 1, p. 2 (2016)
Zhang, H., Chen, D., Ko, S.B.: Efficient multiple-precision floating-point fused multiply-add with mixed-precision support. IEEE Trans. Comput. 68(7), 1035–1048 (2019)
Zhang, H., Lee, H.J., Ko, S.B.: Efficient fixed/floating-point merged mixed-precision multiply-accumulate unit for deep learning processors. In: 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. IEEE (2018)
Acknowledgment
This work is supported in part by NSFC (No. 62272475, 62090023) and NSFHN (No. 2022JJ10064).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tan, H. et al. (2023). A Multi-level Parallel Integer/Floating-Point Arithmetic Architecture for Deep Learning Instructions. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-39698-4_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39697-7
Online ISBN: 978-3-031-39698-4
eBook Packages: Computer ScienceComputer Science (R0)