ABSTRACT
The appealing property of low area, low power, flexible precision, and high bit error tolerance has made Stochastic Computing (SC) a promising alternative to conventional binary arithmetic for many computation intensive tasks, e.g., convolutional neural networks (CNNs). However, to relieve the intrinsic fluctuation noise in SC, long bit stream is normally required in SC-based CNN accelerators to achieve satisfactory accuracy, which leads to extortionate latency. Although the bit parallel structure of a SC multiplier has been proposed to reduce latency, the resulting extra overhead still considerably degrade the overall efficiency of SC. In this paper, we optimize both the micro-architecture of SC multiply-and-accumulate (MAC) unit and the overall acceleration scheme of CNN accelerator to favor SC. An optimized and scalable SC-MAC unit, which fully utilizes the property of low-discrepancy bit stream, is proposed with adjustable parameters to reduce the latency with minor area increase. For the overall accelerator, the parallel dimensions of SC-based MAC array are extended to reuse hardware resources and improve throughput, since the judiciously chosen loop unrolling strategy can better benefit SC operations. The proposed CNN accelerator with extended SC-MAC array is synthesized and demonstrated using TSMC 28nm CMOS on several representative CNNs, which gains 2× performance speedup, 2.8× energy savings and 15% area reduction compared to state-of-the-art SC based CNN accelerator.
- A. Krizhevsky et al., "ImageNet classification with deep convolutional neural networks," in Neural Information Processing Systems (NIPS), 2012.Google Scholar
- Bochkovskiy, Alexey et al., "YOLOv4: Optimal Speed and Accuracy of Object Detection," in ArXiv abs/2004.10934 (2020).Google Scholar
- R. Collobert et al., "A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning," in ACM Int. Conf. on Machine Learning (ICML), 2008.Google Scholar
- H. Song et al., "Learning both weights and connections for efficient neural networks, " Neural Information Processing Systems (NIPS), 2015.Google Scholar
- Y. Ma et al., "Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 7, July 2018.Google Scholar
- Y. Ma et al., "Performance Modeling for CNN Inference Accelerators on FPGA," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 4, April 2020.Google Scholar
- A. Alaghi et al., "Survey of stochastic computing," in ACM Transactions in Embedded Computing Systems, vol. 12, 2013.Google Scholar
- A. Alaghi et al., "Fast and accurate computation using stochastic circuits," IEEE Design, Automation & Test in Europe Conference & Exhibition (DATE), 2014.Google Scholar
- S. Liu et al., "Energy efficient stochastic computing with Sobol sequences," IEEE Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.Google Scholar
- D. Jenson et al., "A deterministic approach to stochastic computation," IEEE/ACM Int. Conf. on Computer-Aided Design (ICCAD), 2016.Google Scholar
- H. Sim et al., "A new stochastic computing multiplier with application to deep convolutional neural networks," ACM/IEEE DAC, 2017.Google Scholar
- H. Sim et al., "DPS: Dynamic Precision Scaling for Stochastic Computing-based Deep Neural Networks," in ACM/IEEE DAC, 2018.Google Scholar
- H. Sim et al., "Log-quantized Stochastic Computing for Memory and Computation Efficient DNNs," in ASPDAC, 2019.Google Scholar
- S. Lee et al., "Successive log Quantization for Cost-Efficient Neural Networks Using Stochastic Computing," in ACM/IEEE DAC, 2019.Google Scholar
- R. Hojabr et al., "SkippyNN: An Embedded Stochastic-Computing Accelerator for Convolutional Neural Networks," ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV, USA, 2019.Google Scholar
- A. Ardakani et al., "VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 10, Oct. 2017.Google Scholar
- K. Kim et al., "Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks," ACM/IEEE DAC, 2016.Google Scholar
- R. Ao, et al. "SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing," International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017.Google Scholar
- J. Deng et al., "ImageNet: A large-scale hierarchical image database," IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Miami, FL, 2009.Google Scholar
- K. Simonyan et al., "Very deep convolutional networks for large-scale image recognition," Int. Conference on Learning Representations (ICLR), 2015.Google Scholar
- Krishnamoorthi et al., "Quantizing deep convolutional networks for efficient inference: A whitepaper," arXiv: Learning, 2018.Google Scholar
- S. Han et al., "EIE: Efficient Inference Engine on Compressed Deep Neural Network," ACM/IEEE International Sympisum on Computer Architecture (ISCA), Seoul, 2016.Google ScholarDigital Library
- N. Suda et al., "Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks," ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Feb. 2016.Google Scholar
- M. H. Najafi et al., "Low-Cost Sorting Network Circuits Using Unary Processing," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 8, Aug. 2018Google Scholar
Recommendations
A New Stochastic Computing Multiplier with Application to Deep Convolutional Neural Networks
DAC '17: Proceedings of the 54th Annual Design Automation Conference 2017Stochastic computing (SC) allows for extremely low cost and low power implementations of common arithmetic operations. However inherent random fluctuation error and long latency of SC lead to the degradation of accuracy and energy efficiency when ...
Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer
The Sunway TaihuLight supercomputer is powered by SW26010, a new 260-core processor designed with on-chip fusion of heterogeneous cores. In this article, we present our work on optimizing the training process of convolutional neural networks (CNNs) on ...
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysConvolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning ...
Comments