Elsevier

Microelectronics Journal

Volume 87, May 2019, Pages 33-44
Microelectronics Journal

ARA: Cross-Layer approximate computing framework based reconfigurable architecture for CNNs

https://doi.org/10.1016/j.mejo.2019.03.011Get rights and content

Abstract

Convolution Neural Networks are now widely used in image processing, object detection, video detection, and other classification tasks. Thus the acceleration of CNN is also widely researched for its complex computation features and data dependence. To achieve high energy efficiency, we proposed a CNN accelerator with approximate computing techniques. In this paper, two main aspects are studied: the hardware-compatible network compression algorithms, and the approximate computing units and architectures with hardware resource scheduling strategies. For the algorithm approximation part, we introduce a dynamic layered CNN structure for different scales of input, the convolution kernel shrinking strategy with layer-by-layer quantization to compress networks, and the Winograd Minimum Filter algorithm to decrease operations in convolution layers. For the architecture part, two types of approximate multipliers are innovated as iterative multipliers, and multi-port SRAM integrated LUT based multipliers. Approximate adders with error correction logic are also designed. Based on the approximate computing units, the Convolution Neural Processing Unit named CNPU is proposed with reconfigurable datapath designs for the mapping of different tasks. By the work on the algorithm, the CNPU architecture and the datapath design, we propose a high energy efficient reconfigurable CNN accelerator with approximate computing named ARA (Approximate computing based Reconfigurable Architecture). Implemented under TSMC 45 nm process, our accelerator achieves 1.92TOPS/W@ 1.1 V, 200 MHz and 3.72TOPS/W@ 0.9 V, 40 MHz in energy-efficiency, which is 1.51 ∼ 4.36 times better than the state-of-the-art accelerators.

Introduction

Convolution Neural Networks(CNNs) are the most used and effective neural networks when processing visual classification related problems. For different issues, different scales of CNNs are innovated. LeCun proposed LeNet-5 [1] in 1998 to do the hand-writing number recognition with only two convolution layers while in 2012, AlexNet [2] is introduced with over 200 MB parameters and five convolution layers. Since then, the VGG-16 Net(552 MB, 13 convolution layers) [3], the GoogLeNet (50 MB, 22 layers) [4] and ResNet(18 × 152 layers) [5] by Microsoft are all inspired. With the network model becoming larger, the acceleration of CNNs is deeply researched.

There are mainly four architectures of accelerators for CNN in recent years, including GPP (General Purpose Processor), FPGA (Field Programmable Gate Array), ASIC (Application Specific Integrated Circuit) and CGRA (Coarse-Grained Reconfigurable Architecture). In 2015, Yazdanbakhsh et al. [6] used GPUs to accelerate CNNs, but the power consumption is too high. In 2016, Huynh [7] proposed mobile GPU architecture DeepSense for deep learning. However, the low energy efficiency of GPUs remains a problem. Nakahara et al. [8] detected objects by mapping the Binarized Neural Networks(BNNs) onto FPGA in 2017. In the same year, Meloni et al. [9] put forward the NEURAghe architecture to map neural networks onto FPGAs with high efficiency. However, the low hardware resource utilization of FPGA influences the energy efficiency greatly. DianNao [10] is a 16-bit fix point ASIC for ANNs, which achieved over 110 times performance and 21 times better in energy efficiency than GPUs in 2014. ShiDianNao [11] is then proposed for vision applications and by reusing the weights in CNNs, the external memory access power is significantly reduced. However, for ASIC accelerators, the support for more types of CNNs is still in lack despite the high energy efficiency. Based on the reconfigurable features, CGRAs are proven suitable for many domain specific applications including deep learning. In 2017, Yin et al. proposed a serial of reconfigurable architectures as the Thinkers [[12], [13], [14]] for different neural networks. Fan et al. [15] used the Steam Dual-Track CGRA to accelerate CNNs. EIE [16] by Han et al. achieved high energy efficiency for compressed neural networks, which can support different scales of CNNs by reconfiguration. In the process-in-memory accelerators such as PRIME [17] by Xie et al., the datapath and arithmetic units are also reconfigurable. CGRAs are regarded as the balanced architecture between system flexibility and energy efficiency.

To further improve the energy efficiency of accelerators, approximate computing is introduced to accelerators based on the error tolerate features of neural networks. Zhang et al. [18] proposed an approximate computing framework for ANN named ApproxANN to save energy. Programmable SoCs are designed with approximate computing by Moreau et al. [19] too. Besides, the analog circuit is also used for ANNs [20]. Cross-Layer approximate computing models are conducted by Sarwar [21] for energy efficient neural computing. Due to the error tolerant features, CGRAs are also adopting approximate computing techniques, such as E-ERA [22] proposed by Liu et al. However, previous works of approximate computing for CGRA are still in lack of the accuracy controlling, the array and datapath designs with the corresponding algorithm to make network models hardware friendly. In this paper, two aspects of approximate computing based CGRA for CNNs are studied. One is the hardware friendly CNN compression methods and the other is the approximate computing based reconfigurable architecture design.

This paper makes the following contributions:

  • 1)

    We design a hardware friendly compression framework for CNNs, including dynamic layered CNN structure to reduce the computing operations, kernel shrinking method with proper fast convolution algorithm to reduce the computation complexities for convolution layers;

  • 2)

    Based on the framework, we propose the approximate computing units including the multi-port SRAM LUT based multiplier, the precision controllable approximate multiplier, and the approximate adder with error correction logic. The computing units can significantly improve the energy efficiency of multiplication and addition operations in CNNs;

  • 3)

    We propose the approximate computing based reconfigurable architecture named ARA for CNNs. ARA is composed of approximate computing units with configurable datapath for different types of CNNs. Experiments show the ARA can achieve high energy efficiency while processing CNNs.

Section snippets

Hardware-friendly CNN compression framework

Since different images contain varying contents of information, different scales and structures of CNNs are presented with different convolution(CONV) layers to do the feature extraction and abstraction, and full connection layers to do the classification. However, the information content of input images cannot be predicted in real scenarios, which makes it time-consuming for accelerators to change networks according to the different applications or uneconomic to use unnecessarily large

ARA: approximate computing based reconfigurable architecture

Based on the compression strategies proposed in Section.II, we proposed an approximate computing based reconfigurable architecture for CNNs. The architecture is composed of two types of approximate computing based reconfigurable computing arrays. One is composed of multi-port SRAM LUT based approximate multipliers, and the other one consists of precision controllable processing units.

Implementations and performance

Implement under TSMC 45 nm process technology, the power of proposed ARA is 204 mW@ 1.1 V, 200 MHz and 21.1 mW@ 0.9 V, 40 MHz, which can meet the requirements of embedded systems and the specification of ARA is in Table 6.

The performance of ARA is conducted with the most typical CNNs, including CNNs for image classification, facial recognition, object detection, and image semantic segmentation. As shown in Table 7, ARA can achieve 1.92 TOPS/W in energy efficiency when working at 1.1 V, 200 MHz

Conclusion

In this paper, an approximate and reconfigurable architecture ARA is proposed with heterogeneous neuron processing element arrays, including multi-port SRAM LUT based approximate multiplier, precision controllable approximate multiplier and improved GeAr approximate adders. Firstly, the framework of CNN compression is proposed, including dynamic layered CNN structure, kernel shrinking methods with the Winograd algorithm to reduce operations in convolution layers. Then the approximate computing

Acknowledgements

This work was supported by the National Science and Technology Major Project (Grant No. 2018ZX01028-101-005), the National Natural Science Foundation of China (Grant No. 61404028, and 61771135) and the program of China Scholarships Council during a visit of Yu Gong to UCLA.

References (30)

  • X. Fan et al.

    DT-CGRA: Dual-track coarse-grained reconfigurable architecture for stream applications

    Field Program. Logic Appli. (FPL)

    (2016)
  • Y. LeCun et al.

    Gradient-based learning applied to document recognition

    Proc. IEEE

    (1998)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Adv. Neural Inf. Process. Syst.

    (2012)
  • K. Simonyan et al.
    (2014)
  • C. Szegedy et al.

    Going deeper with convolutions

  • K. He et al.

    Deep residual learning for image recognition

  • A. Yazdanbakhsh et al.

    Neural acceleration for gpu throughput processors

  • L.N. Huynh et al.

    Deepsense: a gpu-based deep convolutional neural network framework on commodity mobile devices

  • H. Nakahara et al.

    An object detector based on multiscale sliding window search using a fully pipelined binarized CNN on an FPGA

  • P. Meloni et al.
    (2017)
  • T. Chen et al.

    Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

    ACM Sigplan Not.

    (2014)
  • Z. Du et al.

    ShiDianNao: shifting vision processing closer to the sensor

    Comput. Architect. News

    (2015)
  • S. Yin et al.

    A high energy efficient reconfigurable hybrid neural network processor for deep learning applications

    IEEE J. Solid State Circ.

    (2018)
  • S. Yin et al.

    A 1.06-to-5.09 TOPS/W reconfigurable hybrid-neural-network processor for deep learning applications

  • F. Tu et al.

    Deep convolutional neural network architecture with reconfigurable computation patterns

    IEEE Trans. Very Large Scale Integr. Syst.

    (2017)
  • View full text