ABSTRACT
BFloat16(BF16) format has recently driven the development of deep learning due to its higher energy efficiency and less memory consumption than the traditional format. This paper presents a scalable BF16 dot-product(DoP) architecture for high-performance deep-learning computing. A novel 4-term DoP unit is proposed as a fundamental module in the architecture, which performs 4-term DoP operation in three cycles. More-term DoP units are constructed through the extension of the fundamental unit, in which early exponent comparison is performed to hide latency, and intermediate normalization and rounding are omitted to improve accuracy and further reduce latency. Compared with the discrete design, the proposed architecture reduces latency by 22.8% for 4-term DoP, and a larger proportion of latency is reduced as the size of the DoP operation increases. Compared with existing designs for BF16, the proposed architecture at 64-term exhibits better-normalized energy efficiency and higher throughput with at least 1.88× and 20.3× improvement, respectively.
- John Osorio and et al. A BF16 FMA is all you need for DNN training. IEEE Transactions on Emerging Topics in Computing, 10:1302--1314, 2022.Google ScholarCross Ref
- Dhiraj D. Kalamkar and et al. A study of bfloat16 for deep learning training. ArXiv, abs/1905.12322, 2019.Google Scholar
- Neil Burgess and et al. Bfloat16 processing for neural networks. 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH), pages 88--91, 2019.Google Scholar
- Stefan Mach and et al. Fpnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 29:774--787, 2020.Google Scholar
- Luca Bertaccini and et al. MiniFloat-NN and ExSdotp: An ISA extension and a modular open hardware unit for low-precision training on RISC-V cores. 2022 IEEE 29th Symposium on Computer Arithmetic (ARITH), pages 1--8, 2022.Google Scholar
- Hongbing Tan and et al. Multiple-mode-supporting floating-point FMA unit for deep learning processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 31:253--266, 2023.Google ScholarCross Ref
- Jongwook Sohn and et al. A fused floating-point four-term dot product unit. IEEE Transactions on Circuits and Systems I: Regular Papers, 63:370--378, 2016.Google ScholarCross Ref
- Robert H. Dennard and et al. Design of ion-implanted MOSFET's with very small physical dimensions. Proceedings of the IEEE, 87:668--678, 1974.Google Scholar
Index Terms
- A Scalable BFloat16 Dot-Product Architecture for Deep Learning
Recommendations
Comparative analysis of deep learning models for dysarthric speech detection
AbstractDysarthria is a speech communication disorder that is associated with neurological impairments. To detect this disorder from speech, we present an experimental comparison of deep models developed based on frequency domain features. A comparative ...
Wavelet extreme learning machine and deep learning for data classification
AbstractRecently, the Extreme Learning Machine (ELM) algorithm has been applied to various fields due to its rapidity and significant generalization performance. Traditionally, deep learning (DL) and wavelet neural networks (WNN) methods reach ...
A scalable architecture for discrete wavelet transform
CAMP '95: Proceedings of the Computer Architectures for Machine PerceptionWe present the design and prototyping of an efficient systolic architecture which performs both forward and inverse discrete wavelet transform. The proposed architecture consists of a linear array of processing elements, each of which has an adder and a ...
Comments