ABSTRACT
Convolutional neural networks (CNN) have demonstrated state-of-the-art accuracy in image classification and object detection owing to the increase in data and computation capacity of hardware. However, this state-of-the-art achievement depends heavily on the DSP floating-point computing capability of the device, which increases the power dissipation and cost of the device. In order to solve the problem, we made the first attempt to implement a CNN computing accelerator based on shift operation on FPGA. In this accelerator, an efficient Incremental Network Quantization (INQ) method was applied to compress the CNN model from full precision to 4-bit integer, which represents values of either zero or power of two. Then the multiply and accumulate (MAC) operations for convolution layer and fully-connected layer was converted to shift and accumulation (SAC) operations, and SAC could be easily implemented by the logic elements of FPGA. Consequently, parallelism of CNN inference process can be further expanded. For the SqueezeNet model, single image processing latency was 0.673ms on Intel Arria 10 FPGA (Inspur F10A board) showing a slightly better result than on NVIDIA Tesla P4, and the compute capacity of FPGA increased by 1.77 times at least.
- Intel Arria. 2017. Device Overview.Google Scholar
- Utku Aydonat, Shane O'Connell, Davor Capalija, Andrew C Ling, and Gordon R Chiu. 2017. An OpenCL? deep learning accelerator on arria 10. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 55--64. Google ScholarDigital Library
- Yoonho Boo and Wonyong Sung. 2017. Structured sparse ternary weight coding of deep neural networks for efficient hardware implementations. In Signal Processing Systems (SiPS), 2017 IEEE International Workshop on. IEEE, 1--6.Google ScholarCross Ref
- L-W Chan and Frank Fallside. 1987. An adaptive training algorithm for back propagation networks. Computer speech & language 2, 3--4 (1987), 205--218.Google Scholar
- Convolutional neural network 2018. Convolutional neural network. https: //en.wikipedia.org/wiki/Convolutional_neural_network.Google Scholar
- Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems. 3123--3131. Google ScholarDigital Library
- Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830 (2016).Google Scholar
- Mike Foedisch and Aya Takeuchi. 2004. Adaptive real-time road detection using neural networks. In Intelligent Transportation Systems, 2004. Proceedings. The 7th International IEEE Conference on. IEEE, 167--172.Google ScholarCross Ref
- Sameh Galal and Mark Horowitz. 2011. Energy-efficient floating-point unit design. IEEE Transactions on computers 60, 7 (2011), 913--922. Google ScholarDigital Library
- Yiwen Guo, Anbang Yao, and Yurong Chen. 2016. Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems. 1379--1387. Google ScholarDigital Library
- Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE: efcient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 243--254. Google ScholarDigital Library
- Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).Google Scholar
- Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems. 1135--1143. Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Mappings in Deep Residual Networks. In European Conference on Computer Vision. 630--645.Google Scholar
- Mark Horowitz. {n. d.}. Energy table for 45nm process.Google Scholar
- Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <b model size. arXiv preprint arXiv:1602.07360 (2016).Google Scholar
- FPGA Intel. 2017. SDK for OpenCL Programming Guide. UG-OCL002 8 (2017).Google Scholar
- Cheng-Bin Jin, Shengzhe Li, Trung Dung Do, and Hakil Kim. 2015. Real-time human action recognition using CNN over temporal images for static video surveillance cameras. In Pacifc Rim Conference on Multimedia. Springer, 330--339. Google ScholarDigital Library
- Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, Raquel Urtasun, and Andreas Moshovos. 2015. Reduced-precision strategies for bounded memory in deep neural nets. arXiv preprint arXiv:1511.05236 (2015).Google Scholar
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436.Google Scholar
- Fengfu Li and Bin Liu. 2016. Ternary Weight Networks. CoRR abs/1605.04711 (2016). arXiv:1605.04711 http://arxiv.org/abs/1605.04711Google Scholar
- P. tukjunger M. Bevá. 2005. Fixed-point arithmetic in FPGA. Acta Polytechnica 45, 2 (2005), 389--393.Google Scholar
- Armaan Hasan Nagpurwala, C Sundaresan, and CVS Chaitanya. 2013. Implementation of HDLC controller design using Verilog HDL. In Electrical, Electronics and System Engineering (ICEESE), 2013 International Conference on. IEEE, 7--10.Google ScholarCross Ref
- Eriko Nurvitadhi, Jaewoong Sim, David Shefeld, Asit Mishra, Srivatsan Krishnan, and Debbie Marr. 2016. Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC. In Field Programmable Logic and Applications (FPL), 2016 26th International Conference on. IEEE, 1--4.Google ScholarCross Ref
- G Alonzo Vera, Marios Pattichis, and James Lyke. 2011. A dynamic dual fixedpoint arithmetic architecture for FPGAs. International Journal of Reconfigurable Computing 2011 (2011).Google Scholar
- Shmuel Winograd. 1980. Arithmetic complexity of computations. Vol. 33. Siam.Google Scholar
- Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161--170. Google ScholarDigital Library
- Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. 2017. Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044 (2017).Google Scholar
Index Terms
- A Deep Learning Inference Accelerator Based on Model Compression on FPGA
Recommendations
An FPGA-based Fine Tuning Accelerator for a Sparse CNN
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFine-tuning learns abundant feature expression for a wide range of natural images by using a pre-trained CNN model. It can be applied to a wide range of the neural network (NN)based computer vision problems. This paper proposes an FPGA-based fine-tuning ...
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks
FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysConvolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning ...
Implementation of a CNN accelerator on an Embedded SoC Platform using SDSoC
ICDSP '18: Proceedings of the 2nd International Conference on Digital Signal ProcessingToday, Convolution Neural Networks (CNN) is adopted by various application areas such as computer vision, speech recognition, and natural language processing. Due to a massive amount of computing for CNN, CNN running on an embedded platform may not meet ...
Comments