ARA: Cross-Layer approximate computing framework based reconfigurable architecture for CNNs
Introduction
Convolution Neural Networks(CNNs) are the most used and effective neural networks when processing visual classification related problems. For different issues, different scales of CNNs are innovated. LeCun proposed LeNet-5 [1] in 1998 to do the hand-writing number recognition with only two convolution layers while in 2012, AlexNet [2] is introduced with over 200 MB parameters and five convolution layers. Since then, the VGG-16 Net(552 MB, 13 convolution layers) [3], the GoogLeNet (50 MB, 22 layers) [4] and ResNet(18 × 152 layers) [5] by Microsoft are all inspired. With the network model becoming larger, the acceleration of CNNs is deeply researched.
There are mainly four architectures of accelerators for CNN in recent years, including GPP (General Purpose Processor), FPGA (Field Programmable Gate Array), ASIC (Application Specific Integrated Circuit) and CGRA (Coarse-Grained Reconfigurable Architecture). In 2015, Yazdanbakhsh et al. [6] used GPUs to accelerate CNNs, but the power consumption is too high. In 2016, Huynh [7] proposed mobile GPU architecture DeepSense for deep learning. However, the low energy efficiency of GPUs remains a problem. Nakahara et al. [8] detected objects by mapping the Binarized Neural Networks(BNNs) onto FPGA in 2017. In the same year, Meloni et al. [9] put forward the NEURAghe architecture to map neural networks onto FPGAs with high efficiency. However, the low hardware resource utilization of FPGA influences the energy efficiency greatly. DianNao [10] is a 16-bit fix point ASIC for ANNs, which achieved over 110 times performance and 21 times better in energy efficiency than GPUs in 2014. ShiDianNao [11] is then proposed for vision applications and by reusing the weights in CNNs, the external memory access power is significantly reduced. However, for ASIC accelerators, the support for more types of CNNs is still in lack despite the high energy efficiency. Based on the reconfigurable features, CGRAs are proven suitable for many domain specific applications including deep learning. In 2017, Yin et al. proposed a serial of reconfigurable architectures as the Thinkers [[12], [13], [14]] for different neural networks. Fan et al. [15] used the Steam Dual-Track CGRA to accelerate CNNs. EIE [16] by Han et al. achieved high energy efficiency for compressed neural networks, which can support different scales of CNNs by reconfiguration. In the process-in-memory accelerators such as PRIME [17] by Xie et al., the datapath and arithmetic units are also reconfigurable. CGRAs are regarded as the balanced architecture between system flexibility and energy efficiency.
To further improve the energy efficiency of accelerators, approximate computing is introduced to accelerators based on the error tolerate features of neural networks. Zhang et al. [18] proposed an approximate computing framework for ANN named ApproxANN to save energy. Programmable SoCs are designed with approximate computing by Moreau et al. [19] too. Besides, the analog circuit is also used for ANNs [20]. Cross-Layer approximate computing models are conducted by Sarwar [21] for energy efficient neural computing. Due to the error tolerant features, CGRAs are also adopting approximate computing techniques, such as E-ERA [22] proposed by Liu et al. However, previous works of approximate computing for CGRA are still in lack of the accuracy controlling, the array and datapath designs with the corresponding algorithm to make network models hardware friendly. In this paper, two aspects of approximate computing based CGRA for CNNs are studied. One is the hardware friendly CNN compression methods and the other is the approximate computing based reconfigurable architecture design.
This paper makes the following contributions:
- 1)
We design a hardware friendly compression framework for CNNs, including dynamic layered CNN structure to reduce the computing operations, kernel shrinking method with proper fast convolution algorithm to reduce the computation complexities for convolution layers;
- 2)
Based on the framework, we propose the approximate computing units including the multi-port SRAM LUT based multiplier, the precision controllable approximate multiplier, and the approximate adder with error correction logic. The computing units can significantly improve the energy efficiency of multiplication and addition operations in CNNs;
- 3)
We propose the approximate computing based reconfigurable architecture named ARA for CNNs. ARA is composed of approximate computing units with configurable datapath for different types of CNNs. Experiments show the ARA can achieve high energy efficiency while processing CNNs.
Section snippets
Hardware-friendly CNN compression framework
Since different images contain varying contents of information, different scales and structures of CNNs are presented with different convolution(CONV) layers to do the feature extraction and abstraction, and full connection layers to do the classification. However, the information content of input images cannot be predicted in real scenarios, which makes it time-consuming for accelerators to change networks according to the different applications or uneconomic to use unnecessarily large
ARA: approximate computing based reconfigurable architecture
Based on the compression strategies proposed in Section.II, we proposed an approximate computing based reconfigurable architecture for CNNs. The architecture is composed of two types of approximate computing based reconfigurable computing arrays. One is composed of multi-port SRAM LUT based approximate multipliers, and the other one consists of precision controllable processing units.
Implementations and performance
Implement under TSMC 45 nm process technology, the power of proposed ARA is 204 mW@ 1.1 V, 200 MHz and 21.1 mW@ 0.9 V, 40 MHz, which can meet the requirements of embedded systems and the specification of ARA is in Table 6.
The performance of ARA is conducted with the most typical CNNs, including CNNs for image classification, facial recognition, object detection, and image semantic segmentation. As shown in Table 7, ARA can achieve 1.92 TOPS/W in energy efficiency when working at 1.1 V, 200 MHz
Conclusion
In this paper, an approximate and reconfigurable architecture ARA is proposed with heterogeneous neuron processing element arrays, including multi-port SRAM LUT based approximate multiplier, precision controllable approximate multiplier and improved GeAr approximate adders. Firstly, the framework of CNN compression is proposed, including dynamic layered CNN structure, kernel shrinking methods with the Winograd algorithm to reduce operations in convolution layers. Then the approximate computing
Acknowledgements
This work was supported by the National Science and Technology Major Project (Grant No. 2018ZX01028-101-005), the National Natural Science Foundation of China (Grant No. 61404028, and 61771135) and the program of China Scholarships Council during a visit of Yu Gong to UCLA.
References (30)
- et al.
DT-CGRA: Dual-track coarse-grained reconfigurable architecture for stream applications
Field Program. Logic Appli. (FPL)
(2016) - et al.
Gradient-based learning applied to document recognition
Proc. IEEE
(1998) - et al.
Imagenet classification with deep convolutional neural networks
Adv. Neural Inf. Process. Syst.
(2012) - et al.(2014)
- et al.
Going deeper with convolutions
- et al.
Deep residual learning for image recognition
- et al.
Neural acceleration for gpu throughput processors
- et al.
Deepsense: a gpu-based deep convolutional neural network framework on commodity mobile devices
- et al.
An object detector based on multiscale sliding window search using a fully pipelined binarized CNN on an FPGA
- et al.(2017)
Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
ACM Sigplan Not.
ShiDianNao: shifting vision processing closer to the sensor
Comput. Architect. News
A high energy efficient reconfigurable hybrid neural network processor for deep learning applications
IEEE J. Solid State Circ.
A 1.06-to-5.09 TOPS/W reconfigurable hybrid-neural-network processor for deep learning applications
Deep convolutional neural network architecture with reconfigurable computation patterns
IEEE Trans. Very Large Scale Integr. Syst.
Cited by (17)
Multiplication Circuit Architecture for Error- Tolerant CNN-Based Keywords Speech Recognition
2023, IEEE Design and TestApproximation Opportunities in Edge Computing Hardware: A Systematic Literature Review
2023, ACM Computing SurveysDesign and implementation of instruction-driven and data-driven self-reconfigurable cell array
2023, High Technology LettersA New Approximate (8; 2) Compressor for Image Processing Applications
2023, IETE Journal of Research