Logarithm-approximate floating-point multiplier is applicable to power-efficient neural network training

doi:10.1016/j.vlsi.2020.05.002

Integration

Volume 74, September 2020, Pages 19-31

https://doi.org/10.1016/j.vlsi.2020.05.002 Get rights and content

Highlights

•
Edge computing and wide-dynamic range instinct of gradients facilitate the demand for developing power-efficient floating-point (FP) neural network (NN) training.
•
We present LAM incorporating BWS to significantly reduce primary MAC computation in FP NN training up to 4-hidden layer NN. Besides, solo-LAM training is proved to be sufficiently accurate and there is no need to rely on the exact multiplier for our datasets.
•
Benchmarking results through our dedicated training hardware show that compared to the exact multiplier, LAM/LAM + BWS can reduce 2.5X/4.9X power while sustaining the accuracy.
•
We deploy an open-source GPGPU design with a programmable compiler to train LAM and LAM + BWS based NNs, and attain at 1.32X and 1.54X power efficiency through real-time measurement.

Abstract

Recently, emerging “edge computing” moves data and services from the cloud to nearby edge servers to achieve short latency and wide bandwidth, and solve privacy concerns. However, edge servers, often embedded with GPU processors, highly demand a solution for power-efficient neural network (NN) training due to the limitation of power and size. Besides, according to the nature of the broad dynamic range of gradient values computed in NN training, floating-point representation is more suitable. This paper proposes to adopt a logarithm-approximate multiplier (LAM) for multiply-accumulate (MAC) computation in neural network (NN) training engines, where LAM approximates a floating-point multiplication as a fixed-point addition, resulting in smaller delay, fewer gates, and lower power consumption. We demonstrate the efficiency of LAM in two platforms, which are dedicated NN training hardware, and open-source GPU design. Compared to the NN training applying the exact multiplier, our implementation of the NN training engine for a 2-D classification dataset achieves 10% speed-up and 2.3X efficiency improvement in power and area, respectively. LAM is also highly compatible with conventional bit-width scaling (BWS). When BWS is applied with LAM in five test datasets, the implemented training engines achieve more than 4.9X power efficiency improvement, with at most 1% accuracy degradation, where 2.2X improvement originates from LAM. Also, the advantage of LAM can be exploited in processors. A GPU design embedded with LAM executing an NN-training workload, which is implemented in an FPGA, presents 1.32X power efficiency improvement, and the improvement reaches 1.54X with LAM + BWS. Finally, LAM-based training in deeper NN is evaluated. Up to 4-hidden layer NN, LAM-based training achieves highly comparable accuracy as that of the accurate multiplier, even with aggressive BWS.

Introduction

For enhancing our daily life with artificial intelligence (AI), machine learning (ML) is currently adopted and executed everywhere, even on end devices such as smartphones, Internet-of-Things (IoT) sensors, and cameras. The generated data from devices need to be collected and aggregated as the samples for learning processing and analysis. Conventionally, the data is transferred to the cloud since the learning and analysis processes require high computational capability and large memory capacity. However, the data moving to the cloud involves challenges in terms of latency, bandwidth, and privacy concerns. Some applications like computer vision and natural language processing highly demand local and real-time services, and then edge computing emerges as a solution. In edge computing, the edge servers, typically embedded with GPU processors, are placed near the end devices, and they process and analyze the data independently from the cloud or before sending the data to the cloud [1,2]. Compared with ML training in the cloud, training in the edge servers may provide tailored ML models without security risk [1]. On the other hand, due to the size and power limitations, power-efficient training is demanded and explored [3,4].

Neural network (NN) is one of the most widely-used techniques in machine learning [5]. A feedforward NN model is composed of a few to hundreds of layers, each of which includes a number of neurons. The neurons are connected layer by layer through synaptic weights. The synaptic weights are optimized to provide sufficiently high accuracy through the computationally expensive training phase. Hardware NN system is mainly categorized into two types. The first one is the inference engine that processes a network with given pre-trained weights, and the latter is the training engine that has the additional capability of weight optimization in the training phase. Regardless of the inference or training engine, multiply-accumulate (MAC) arithmetic computation is the primary operation. The rapidly increasing trend of NN size to deal with more intricate and sophisticated problems explodes the amount of MAC computation, resulting in a strong demand for dedicated hardware engines.

Inference engines in several studies exploit inherent error-tolerant property in machine learning and introduce approximate computing (AC) techniques for gaining performance and reducing cost [[6], [7], [8], [9]]. Among various AC techniques proposed for inference engines, bit-width scaling (BWS), which reduces the bit width of data representation, is the most popular and powerful way that trades computation reduction with accuracy degradation [6,[10], [11], [12]]. Even binarized neural networks are studied [13].

In contrast to inference, training engines need to perform more arithmetic computation with a wider dynamic range since the gradient, which is numerically computed and used to guide the weight update, spreads in a broad dynamic range [10]. A simple example can help us understand this property. Fig. 1 plots the distribution density of the gradient values found when training a NN for MNIST dataset [24]. The gradient values spread from 2⁻⁴⁷ to 2⁶. If adopting fixed-point expression, more than 50 bits are required to cover this range, while floating-point expression spends only a few bits for exponents and some extra bits for fraction parts to cover this wide range. Thus, adopting floating-point units (FPUs) is beneficial for training engines to accommodate such gradient computation.

Nevertheless, FPUs are known as power-hungry and area-expensive units [14]. Specifically, in the neural network algorithm that mainly consists of MAC computations, floating-point multiplication is the most power-hungry and space-demanding arithmetic operators. Fig. 2 exemplifies the power and area benchmarking between a 32-bit floating-point multiplier and an adder synthesized for the same clock frequency. The figure shows that the floating-point multiplier consumes 3.01X power and 1.75X area. The MAC operation requires the same number of multiplication and addition, indicating that the power for MAC operation is mostly consumed by the multiplier. Consequently, massive MAC computations in NN training deteriorate the area and power efficiency. Therefore, power-efficient floating-point multiplication is highly demanded in training engine development.

In this paper, we perform that the logarithm-approximate multiplier (LAM), which approximates floating-point multiplication to fixed-point addition, benefits to NN training and improves the power efficiency of massive MAC computation involved in NN training under floating-point format. We also show that LAM is useful even when the BWS is already implemented in training, and hence power efficiency is further enhanced. These advantages are quantitatively evaluated through the experiments with dedicated training hardware. Experimental results show 2.5X power reduction by LAM and 4.9X reduction by LAM + BWS while sustaining the accuracy, where 2.2X reduction originates from LAM. These results are reported in our preliminary work [15].

In this work, we newly evaluate whether solo or hybrid usages of exact and approximate multipliers in training and testing phases affect the classification accuracy. Experimental results reveal that adopting approximate multipliers (LAMs) in both training and testing phases induces no significant accuracy degradation, and then there is no need to rely on accurate multipliers. Next, we conduct additional experiments for evaluating the applicability of LAM and LAM + BWS to open-source GPU design, on which NN training programs are executed. The power reductions thanks to LAM and LAM + BWS are measured with an FPGA implementation of the GPU design, and they are 1.32X and 1.54X compared to the original design. Finally, we extend the NN depth, up to 4-hidden layers, and show that solo-LAM training achieves highly comparable results with solo-EFM training. This trend sustains even when BWS is aggressively adopted as long as the acceptable training accuracy is obtained.

The rest of the paper is organized as follows. Section 2 reviews the NN with its training and related works. Section 3 introduces LAM and discusses its approximation error. Experimental results of adopting LAM in dedicated training engines are presented in Section 4. Section 5 provides the environmental setup and measurement results of LAM-based NN training with FPGA implementation of an open-source GPU design. Section 6 applies LAM-based training to deeper NNs and provides evaluation results. Finally, Section 7 concludes this paper.

Section snippets

Basics of neural network

Fig. 3a illustrates a multilayer perceptron (MLP) structure, which is known as a basic feedforward NN [18]. For the sake of clarity, the structure contains only 1 hidden layer, while the number of hidden layers can be extended to form deeper NNs. The state of each neuron in the network is computed from all the states of neurons in the previous layer and then propagates its state to the next layer. Taking the example in Fig. 3a, since all the states are pre-determined in the input layer I, the

Logarithm approximate multiplier

Logarithm-approximate multiplier, LAM in short, is developed by Ref. [16]. With an approximation, a floating-point value in linear domain can be regarded as its value taken by the logarithm of base 2 in fixed-point format. Thanks to the log-domain property, floating-point multiplication can be approximated to fixed-point addition. This section introduces LAM and analyzes its approximation error.

Experimental results for dedicated hardware design

This section shows the advantage of LAM as an arithmetic unit and demonstrates the impact of LAM in NN training engine on the classification accuracy and hardware resource.

Evaluation in GPU design

Following the performance evaluation with dedicated training engines in the previous section, this section applies LAM and BWS to an open-source GPU design and clarifies the advantage in NN training.

Evaluation for deeper NNs

The advantage of applying LAM and BWS to deeper NN training is presented in this section. We show the training results of NNs with 2, 3, and 4 hidden layers, respectively, for MNIST dataset. In every NN structure, each hidden layer consists of 50 neurons. The NN training results for adopting solo-EFM and solo-LAM with different configurations of BWS are shown in Fig. 21.

Fig. 21 shows that, up to 4-hidden-layer NN, LAM-based training yields the accuracy comparable to that of EFM-based training

Conclusions

Edge computing drives the demands for more power-efficient data processing, such as neuron network training. This work evaluated whether approximate floating-point multiplier, which can cover a broad dynamic range, could be adopted in NN training achieving higher energy efficiency. Specifically, we focused on logarithm-approximate multiplier (LAM) incorporating bit-width scaling (BWS) to reduce primary MAC computation complexity. The experimental results with dedicated hardware design show that

CRediT authorship contribution statement

TaiYu Cheng: Conceptualization, Software, Validation, Formal analysis, Data curation, Writing - original draft. Yukata Masuda: Resources. Jun Chen: Methodology. Jaehoon Yu: Supervision. Masanori Hashimoto: Supervision, Writing - review & editing.

Declaration of competing interest

None.

TaiYu Cheng received the B.E. and M.E. degrees in electrical engineering from National Taiwan University, Taipei, Taiwan, in 2010 and 2012, respectively. From 2012 to 2018, he was with Taiwan Semiconductor Manufacturing Company, Hsinchu, Taiwan, where he has been engaged in design flow of timing closure. Since 2018, he has been a Ph. D. student with the Department of Information Systems Engineering, Osaka University, Osaka, Japan. His research interests include low-power circuit design.

References (35)

J. Chen et al.
Deep learning with edge computing: a review
Proc. IEEE
(2019)
G.L. Pedro
Edge-centric computing: vision and challenges
P. Grulich
Collaborative edge and cloud neural networks for real-time video processing
Y. Huang
When deep learning meets edge computing
A. Krizhevsky
Imagenet classification with deep convolutional neural networks
J.Y.F. Tong
Reducing power by optimizing the necessary precision/range of floating-point arithmetic
TVLSI
(2000)
S. Venkataramani
Axnn: energy-efficient neuromorphic systems using approximate computing
Q. Zhang
Approxann: an approximate computing framework for artificial neural network
J. Kung
A power-aware digital feedforward neural network platform with backpropagation driven approximate synapses
J. David
Training Deep Neural Networks with Low Precision Multiplications
(2014)

N. Wang

Training deep neural networks with 8-bit floating point numbers

D. Kim

A power-aware digital multilayer perceptron accelerator with on-chip training based on approximate computing

TETIC

(Feb. 2017)

Simons et al., A review of binarized neural networks, Electronics 8. 661....

M. Horowitz. Energy Table for 45nm Process. Stanford VLSI...

T. Cheng

Minimizing power for neural network training with logarithm-approximate floating-point multiplier

M. Gao

Energy efficient runtime approximate computing on data flow graphs

C. Chang et al.

Fourclass

(1996)

Cited by (0)

Yutaka Masuda received the B.E., M.E., and Ph.D. degrees in Information Systems Engineering from the Osaka University, Osaka, Japan, in 2014, 2016, and 2019, respectively. He is currently an Assistant Professor in Center for Embedded Computing Systems, Graduate School of Informatics, Nagoya University. His research interests include low-power circuit design. He is a member of IEEE, IEICE, and IPSJ.

Jun Chenreceived the B.E. and M.E. degrees in control theory and engineering from Tongji University, Shanghai, China, in 2004 and 2007, respectively, and received his Ph.D. degree in Information Systems Engineering from the Osaka University, Osaka, Japan, in 2020. From 2008 to 2016, he was with Synopsys Inc., Shanghai, China, where he has been engaged in research and development of routing congestion, power and placement optimization flow. He is currently a software engineer at Giga Design Automation Co., Ltd. His research interests include computer-aided-design for digital integrated circuits, power and signal integrity analysis..

Jaehoon Yu received his B.E. degree in Electrical and Electronic Engineering and his M.S. degree in Communications and Computer Engineering from Kyoto University, Kyoto, Japan, in 2005 and 2007, respectively, and received his Ph.D. degree in Information Systems Engineering from Osaka University, Osaka, Japan, in 2013. From 2013 to 2019, he was an assistant professor at Osaka University. He is currently an associate professor at Tokyo Institute of Technology, Japan. His research interests include computer vision, machine learning, and system-level design. He is a member of IEEE, IEICE, and IPSJ.

Masanori Hashimoto received the B.E., M.E. and Ph.D. degrees in communications and computer engineering from Kyoto University, Kyoto, Japan, in 1997, 1999, and 2001, respectively. He is currently a Professor with the Department of Information Systems Engineering, Graduate School of Information Science and Technology, Osaka University, Suita, Japan. His current research interests include the design for manufacturability and reliability, timing and power integrity analysis, reconfigurable computing, soft error characterization, and low-power circuit design. Dr. Hashimoto was a recipient of the Best Paper Awards from ASP-DAC in 2004 and RADECS in 2017, and the Best Paper Award of the IEICE Transactions in 2016. He was on the Technical Program Committee of international conferences, including DAC, ICCAD, ITC, Symposium on VLSI Circuits, ASP-DAC, and DATE. He serves/served as an Associate Editor for the IEEE Transactions on VLSI Systems, IEEE Transactions on Circuits and Systems I, ACM Transactions on Design Automation of Electronic Systems, and Elsevier Microelectronics Reliability.

^☆: This work is supported by Grant-in-Aid for Scientific Research (B) from JSPS under Grant 19H04079.

View full text

Integration

Logarithm-approximate floating-point multiplier is applicable to power-efficient neural network training☆

Highlights

Abstract

Introduction

Section snippets

Basics of neural network

Logarithm approximate multiplier

Experimental results for dedicated hardware design

Evaluation in GPU design

Evaluation for deeper NNs

Conclusions

CRediT authorship contribution statement

Declaration of competing interest

Deep learning with edge computing: a review

Proc. IEEE

Edge-centric computing: vision and challenges

Collaborative edge and cloud neural networks for real-time video processing

When deep learning meets edge computing

Imagenet classification with deep convolutional neural networks

Reducing power by optimizing the necessary precision/range of floating-point arithmetic

TVLSI

Axnn: energy-efficient neuromorphic systems using approximate computing

Approxann: an approximate computing framework for artificial neural network