A fine-grained mixed precision DNN accelerator using a two-stage big–little core RISC-V MCU

doi:10.1016/j.vlsi.2022.10.006

Integration

Volume 88, January 2023, Pages 241-248

https://doi.org/10.1016/j.vlsi.2022.10.006 Get rights and content

Highlights

•
A mixed precision structure to process different effective bit-width weights separately.
•
An MCU contains two big–little RISC-V cores to control the accelerator and peripherals.
•
A fine-grain control flow to control the queuing operations with low hardware amount.

Abstract

Deep neural networks (DNNs) are widely used in modern AI systems, and their dedicated accelerators have become a promising option for edge scenarios due to the energy efficiency and high performance. Since the DNN model requires significant storage and computation resources, various energy efficient DNN accelerators or algorithms have been proposed for edge devices. Many quantization algorithms for efficient DNN training have been proposed, where the weights in the DNN layers are quantified to small/zero values, therefore requiring much fewer effective bits, i.e., low-precision bits in both arithmetic and storage units. Such sparsity can be leveraged to remove non-effective bits and reduce the design cost, however at the cost of accuracy degradation, as some key operations still demand a higher precision. Therefore, in this paper, we propose a universal mixed-precision DNN accelerator architecture that can simultaneously support mixed-precision DNN arithmetic operations. A big–little core controller based on RISC-V is implemented to effectively control the datapath, and assign the arithmetic operations to full precision and low precision process units, respectively. Experimental results show that, with the proposed designs, we can save 16% chip area and 45.2% DRAM access compared with the state-of-the-art design.

Graphical abstract

Introduction

Deep neural networks (DNNs) have been recently demonstrated to be highly effective for computer vision and speech recognition, and hence are particularly promising for Internet of Things (IoT) devices at the edge [1], [2], [3], [4]. Such intelligent edge devices are typically constrained by the available computational and storage resources, it is thus highly desired to design efficient hardware implementations to support efficient DNN models with high speed and low power. The general purpose GPU and CPU are possible solutions for DNN deployments but are not applicable to edge scenarios due to their significant resource requirements. Many recent works have developed energy efficient DNN acceleration implementations using Field Programmable Gate Array (FPGA) [5], [6], [7], [8], [9] and Application Specific Integrated Circuit (ASIC) [10], [11], [12], [13], [14]. With awareness of the complexity of DNN algorithms featuring millions of Multiply-Accumulate (MAC) operations and frequent data transmissions, many prior researches have placed focus on effective performance and accuracy trade-off using limited computational resources [15], [16], [17], [18]. Among these efforts, quantization and sparsification are the two most commonly used techniques [19], [20], [21]. Through careful training procedure, many weights in a DNN can be zeroed out or quantified to fewer bits while maintaining a similar accuracy [22], [23]. The weight representation using lower bit-width in hardware can bring down the demanded resources for both computation and storage. It is then an appealing option to design the processing element (PE) in a DNN accelerator, i.e., the basic arithmetic unit, only with demanded bit-width to reduce the design overhead in both cache and MAC.

For example, prior work [11], [14], [24], [25], [26], [27], [28] have already proposed to reduce the bit-widths of inputs and weights to 8-bit, and even 4- or 2- bits. Moreover, binary neural networks are also very popular to have aggressive energy reduction [5], [11]. While the quantization provides non-trivial memory bandwidth reduction, the quantified networks have to be implemented in hardware with fixed precision, which is often the largest bit-width needed for the network. Thus, the applications have been constrained to particular scenarios without flexibility to support varying demands [5], [6], [11]. To have more flexibility, some general purpose accelerator architectures have been proposed with configurability, which can be used to deploy different DNN algorithms [10], [12], [13], [14]. The PE in [13] has two 8-bit multipliers, which can be switched between 8- and 16-bit precision modes. ENVISION in [12] uses a booth multiplier that can be configured to 4-, 8- or 16-bits. Dynamic voltage and frequency scaling technology is also adopted to achieve more power-performance trade-off. UNPU in [14] proposes serial lookup table-based multipliers to implement the PE that supports precision from 1 to 16 bits. In other word, at the expense of more complex control logic for PEs, the configurability can provide various precision for different neural networks or different layers in one neural network, thus improving the overall energy efficiency [12], [13], [14].

In most of the previous work, the configuration of precision existed only in the time dimension, that is, one precision for one layer. In the same layer, the fixed precision is used even if the data values are vastly different. We note that the effective precision of weights within a layer may actually significantly vary, which refers to the minimum number of bits to represent a weight without accuracy loss. Assuming that in a one layer of a network, most weights have values less than 128 (normalized to integers for ease of description), these weights can actually be represented by int8. But the other part of weights are greater than 128, and they need to be represented by int16. In prior accelerators, this layer would need to be calculated with 16 bits precision, even though the weights greater than 128 are quite few. We found that the above phenomenon is very common. Fig. 1 shows the effective weight precision distributions of different layers in AlexNet [29] and VGG16 [1] from [21]. The effective weight precision indicates the minimum bits that a weight can be represented by. While the highest effective precision is up to 16 bits, the majority of weights within the same layer is only 5 bits or even lower. This observation is due to the fact that the neural network tends to converge to more distributed and sparse weights in the process of quantization-driven training. Previous designs had to calculate with the highest precision,to ensure the accuracy, this inevitably induce non-trivial computation and storage resources waste. A lot of storage and calculation that could be done with 8 bits had to be done with 16 bits. Actually the higher order bits of these low effective precision weights can be skipped without any change in value.

An ideal approach is then to analyze the weights using only necessarily minimal bit-widths and then have fine-granular mixed-precision support to process data with different precision in different PEs. For example, weights greater than 128 can be processed with 16-bit precision, and the rest are processed with 8-bit precision or even finer granularity. Obviously, per-PE precision is impractical in terms of resource consumption [30], [31]. Moreover, even with 2 different precision, its control logic can be too complex to actually reduce the overall hardware resources. Thus, in order to design a fine-grained mixed-precision DNN accelerator, there are a few challenges that need to be addressed:

•
It is not efficiency to process all data with uniform precision, especially in scenarios where the data distribution has a distinct pattern. It is highly desired to have a universal accelerator architecture that filters the data and processes them separately.
•
Because of the complexity of neural networks operation, processing data separately will further increase the complexity of hardware. How to design an efficient controller that can use as few hardware resources as possible and achieve precise control?
•
A effective datapath control flow is needed to fetch, process and merge the results from the mixed-precision PEs and ensure its cost can be covered by the savings from the mixed-precision support.

To address the above challenges, this paper proposes a general-purpose mixed-precision DNN accelerator with a two-stage RISC-V micro-controller (MCU). The main contributions are summarized as the following:

•
Mixed-precision accelerator architecture: The weights in DNN were divided into two groups according to the effective bit width, Full Precision (FP) and Low Precision (LP). A small set of full precision PEs (FPPE) and a low precision PE (LPPE) array are used to process the two groups of weights separately. Such an architecture is applicable to different DNNs to support simultaneous computations.
•
Two-stage big–little core Micro-controller Unit (MCU): We propose a RISC-V-based two-stage MCU that consists of a big core and little one to control the accelerator. The big core controls the complete accelerator operation as well as the small core, while the small one micro-manages the FPPE array in the accelerator. Such an MCU can effectively manage global and local operation for the proposed DNN accelerator.
•
Unique datapath control flow: A unique control flow is designed to control the queuing operations. The FP data fetch and merging are implemented with minimal hardware cost through the control flow for fast execution.

The proposed architecture is implemented in both FPGA and UMC 40 nm library for ASIC and then compared with a prior state-of-the-art accelerator [10]. The experimental results show that the proposed implementation can save almost half of the weights storage and MAC area when compared with the single precision accelerator in [10]. The proposed two-stage big–little core MCU only consumes 11k look-up tables (LUTs) in FPGA or 91k $μ m^{2}$ chip area in ASIC. Due to the mixed-precision configuration, DRAM access is also reduced by more than 40%, resulting in overall power consumption reduction.

The rest of this paper is organized as follows. Section 2 introduces the preliminaries of our design. Details of the proposed architecture are shown in Section 3. We show the experimental results in Section 4, with Section 5 concluding this paper.

Section snippets

Effective bit widths in fixed-point format

Fig. 2 demonstrates the format of a 16 bit signed fixed point number in a general hardware system. The most significant bit (MSB) is the sign bit and the bits to the right of it are the numerical bits. Take the number “8” as an example. It can be represented by only five digits “01000”. However, in a 16-bit system, it have to be represented by “0000 0000 0000 1000”. Thus, the first 11 bits are non-effective in a 16-bit system. Assume that a 16-bit signed fixed-point weight is represented by ${W_{15}$

Proposed architecture

We detail the architectural implementation of the proposed mixed-precision accelerator and big–little core MCU in this section.

Experimental results

The proposed architecture is implemented on Xilinx ZCU 102 FPGA. We apply the proposed architecture to four neural network models, $i . e .$ , AlexNet, VGG16, Resnet18 and Resnet50 to comprehensively evaluate the effectiveness on improving hardware utilization. We also compare the hardware area of the proposed architecture with the other state-of-the-art work [10] using single precision. For fair comparison, the underlying accelerator setup is similar to [10], where LPPE array size and FPPE array

Conclusions

This work proposed a mixed-precision DNN accelerator architecture with a two-stage big–little core RISC-V MCU. The mixed-precision structure can process FP and LP precision separately, which not only improves the overall hardware utilization but also saves chip area. The two-stage big–little core MCU is designed to control the more complex logics in the proposed accelerator. While the big core is used to manage the overall system, the little core Simpico is used to micro-manage the FP array.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Key R&D Program of China (Grant No. 2018YFE0126300), Zhejiang Provincial NSF (Grant No. LD21F040003), NSFC, China (Grant No. 61974133) and the Open Research Project of the State Key Laboratory of Industrial Control Technology, China (No. ICT2022B61).

Li Zhang received his Ph.D degree from Huazhong University of Science and Technoloy in 2018. Then he conducted postdoctoral research at Zhejiang University. Since 2021, he works at Hubei University of Technology as a faculty. His current research interest focus on low power chip design, artificial intelligence algorithm and hardware acceleration, 3D chip design and optimization.

References (37)

K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: Proc. CVPR,...
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proc. CVPR,...
DengJ. et al.
Energy efficient real-time UAV object detection on embedded platforms
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. (TCAD)
(2020)
M. Bojarski, et al., End to end learning for self-driving cars, in: Proc. CVPR,...
G. Peng, et al., FBNA: A Fully Binarized Neural Network Accelerator, in: Proc. FPL,...
D.J.M. Moss, High performance binary neural networks on the xeon+FPGATM platform, in: Proc. FPL,...
R. Cai, et al., VIBNN: Hardware Acceleration of Bayesian Neural Networks, in: Proc. ACM,...
Y. Ma, T. Zheng, Y. Cao, S. Vrudhula, J. Seo, Algorithm-Hardware Co-Design of Single Shot Detector for Fast Object...
C. Yao, J. He, X. Zhang, C. Hao, D. Chen, Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs, in: Proc....
ChenY.H. et al.
Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks
IEEE J. Solid-State Circuits
(2017)

S. Yin, et al., An ultra-high energy-efficient reconfigurable processor for deep neural networks with binary/ternary...

B. Moons, R. Uytterhoeven, W. Dehaene, M. Verhelst, Envision: A 0.26-to-10 TOPS/W subword-parallel...

D. Shin, J. Lee, J. Lee, H.-J. Yoo, DNPU: An 8.1 TOPS/W reconfigurable CNN-RNN processor for general-purpose deep...

LeeJ.

UNPU: An energy-efficient deep neural network accelerator with fully variable weight bit precision

IEEE JSSC

(2019)

A.G. Howard, et al., Mobilenets: Efficient convolutional neural networks for mobile vision applications, in: Proc....

M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, XNOR-Net: ImageNet classification using binary convolutional neural...

F.N. Landola, et al., SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and ¡0.5MB model size, in: Proc....

ChenC.

Pam: A piecewise-linearly-approximated floating-point multiplier with unbiasedness and configurability

IEEE T. Computers

(2022)

Cited by (0)

Qishen Lv is a senior engineer in Shenzhen Power Supply Co., Lt. His research interest covers the state detection, evaluation and diagnosis of power equipment.

Di Gao is currently pursuing a PhD with the College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China. Her current research interests include hardware acceleration and architecture design and optimization.

Xian Zhou is a master degree candidate in the College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China. His current research interests include processing-in-memory circuit and architecture design and optimization.

Wenchao Meng received the Ph.D. degree in control science and engineering from Zhejiang University, Hangzhou, China, in 2015, where he is currently with the College of Control Science and Engineering. His current research interests include adaptive intelligent control, cyber–physical systems, renewable energy systems, and smart grids.

Qinmin Yang received the Ph.D. degree in electrical engineering from the University of Missouri-Rolla, Rolla, MO, USA, in 2007. Since 2010, he has been with the State Key Laboratory of Industrial Control Technology, College of Control Science and Engineering, Zhejiang University, Hangzhou, China, where he is currently a Professor. His research interests include intelligent control, renewable energy systems, smart grid, and industrial big data.

Cheng Zhuo eceived the BS and MS degrees from Zhejiang University, China, in 2005 and 2007, respectively, and the PhD degree in computer science and engineering from the University of Michigan, Ann Arbor, MI, in 2010. He is currently a professor with the College of Information Science Electronic Engineering, Zhejiang University. His research interests include low power optimization, 3D integration, hardware acceleration, and power and signal integrity.

View full text