abstract

A Scalable BFloat16 Dot-Product Architecture for Deep Learning

Authors:
Jing Zhang

National University of Defense Technology, China

National University of Defense Technology, China
View Profile

,
Libo Huang

National University of Defense Technology, China

National University of Defense Technology, China
View Profile

,
Hongbing Tan

National University of Defense Technology, China

National University of Defense Technology, China
View Profile

,
Zhong Zheng

National University of Defense Technology, China

National University of Defense Technology, China
View Profile

,
Hui Guo

National University of Defense Technology, China

National University of Defense Technology, China
View Profile

GLSVLSI '23: Proceedings of the Great Lakes Symposium on VLSI 2023June 2023Pages 219–220https://doi.org/10.1145/3583781.3590318

Published:05 June 2023Publication History

GLSVLSI '23: Proceedings of the Great Lakes Symposium on VLSI 2023

Pages 219–220

ABSTRACT

BFloat16(BF16) format has recently driven the development of deep learning due to its higher energy efficiency and less memory consumption than the traditional format. This paper presents a scalable BF16 dot-product(DoP) architecture for high-performance deep-learning computing. A novel 4-term DoP unit is proposed as a fundamental module in the architecture, which performs 4-term DoP operation in three cycles. More-term DoP units are constructed through the extension of the fundamental unit, in which early exponent comparison is performed to hide latency, and intermediate normalization and rounding are omitted to improve accuracy and further reduce latency. Compared with the discrete design, the proposed architecture reduces latency by 22.8% for 4-term DoP, and a larger proportion of latency is reduced as the size of the DoP operation increases. Compared with existing designs for BF16, the proposed architecture at 64-term exhibits better-normalized energy efficiency and higher throughput with at least 1.88× and 20.3× improvement, respectively.

References

John Osorio and et al. A BF16 FMA is all you need for DNN training. IEEE Transactions on Emerging Topics in Computing, 10:1302--1314, 2022.Google ScholarCross Ref
Dhiraj D. Kalamkar and et al. A study of bfloat16 for deep learning training. ArXiv, abs/1905.12322, 2019.Google Scholar
Neil Burgess and et al. Bfloat16 processing for neural networks. 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH), pages 88--91, 2019.Google Scholar
Stefan Mach and et al. Fpnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 29:774--787, 2020.Google Scholar
Luca Bertaccini and et al. MiniFloat-NN and ExSdotp: An ISA extension and a modular open hardware unit for low-precision training on RISC-V cores. 2022 IEEE 29th Symposium on Computer Arithmetic (ARITH), pages 1--8, 2022.Google Scholar
Hongbing Tan and et al. Multiple-mode-supporting floating-point FMA unit for deep learning processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 31:253--266, 2023.Google ScholarCross Ref
Jongwook Sohn and et al. A fused floating-point four-term dot product unit. IEEE Transactions on Circuits and Systems I: Regular Papers, 63:370--378, 2016.Google ScholarCross Ref
Robert H. Dennard and et al. Design of ion-implanted MOSFET's with very small physical dimensions. Proceedings of the IEEE, 87:668--678, 1974.Google Scholar

Index Terms

A Scalable BFloat16 Dot-Product Architecture for Deep Learning
1. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators
      2. Programmable logic elements

Recommendations

Comparative analysis of deep learning models for dysarthric speech detection
Abstract
Dysarthria is a speech communication disorder that is associated with neurological impairments. To detect this disorder from speech, we present an experimental comparison of deep models developed based on frequency domain features. A comparative ...
Read More
Wavelet extreme learning machine and deep learning for data classification
Abstract
Recently, the Extreme Learning Machine (ELM) algorithm has been applied to various fields due to its rapidity and significant generalization performance. Traditionally, deep learning (DL) and wavelet neural networks (WNN) methods reach ...
Read More
A scalable architecture for discrete wavelet transform
CAMP '95: Proceedings of the Computer Architectures for Machine Perception

We present the design and prototyping of an efficient systolic architecture which performs both forward and inverse discrete wavelet transform. The proposed architecture consists of a linear array of processing elements, each of which has an adder and a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GLSVLSI '23: Proceedings of the Great Lakes Symposium on VLSI 2023
June 2023
731 pages
ISBN:9798400701252
DOI:10.1145/3583781
General Chairs:
Himanshu Thapliyal
University of Tennessee, Knoxville, USA
,
Ronald DeMara
University of Central Florida, USA
,
Program Chairs:
Inna Partin-Vaisband
University of Illinois Chicago, USA
,
Srinivas Katkoori
University of South Florida, USA
Copyright © 2023 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 June 2023
Check for updates
Author Tags
bf16
deep learning
dot-product operation
scalable architecture
Qualifiers
- abstract
Conference

Acceptance Rates
Overall Acceptance Rate312of1,156submissions,27%
Upcoming Conference
GLSVLSI '24

Sponsor:

sigda

Great Lakes Symposium on VLSI 2024

June 12 - 14, 2024

Clearwater , FL , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 87
  Total Downloads
- Downloads (Last 12 months)87
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Scalable BFloat16 Dot-Product Architecture for Deep Learning

GLSVLSI '23: Proceedings of the Great Lakes Symposium on VLSI 2023

ABSTRACT

References

Cited By

Index Terms

Recommendations

Comparative analysis of deep learning models for dysarthric speech detection

Wavelet extreme learning machine and deep learning for data classification

A scalable architecture for discrete wavelet transform

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Scalable BFloat16 Dot-Product Architecture for Deep Learning

GLSVLSI '23: Proceedings of the Great Lakes Symposium on VLSI 2023

ABSTRACT

References

Cited By

Index Terms

Recommendations

Comparative analysis of deep learning models for dysarthric speech detection

Wavelet extreme learning machine and deep learning for data classification

A scalable architecture for discrete wavelet transform

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media