skip to main content
10.1145/3490422.3502364acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article
Public Access

FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization

Published: 11 February 2022 Publication History

Abstract

With the trend to deploy Deep Neural Network (DNN) inference models on edge devices with limited resources, quantization techniques have been widely used to reduce on-chip storage and improve computation throughput. However, existing DNN quantization work deploying quantization below 8-bit may be either suffering from evident accuracy loss or facing a big gap between the theoretical improvement of computation throughput and the practical inference speedup. In this work, we propose a general framework, called FILM-QNN, to quantize and accelerate multiple DNN models across different embedded FPGA devices. First, we propose the novel intra-layer, mixed-precision quantization algorithm that assigns different precisions onto the filters of each layer. The candidate precision levels and assignment granularity are determined from our empirical study with the capability of preserving accuracy and improving hardware parallelism. Second, we apply multiple optimization techniques for the FPGA accelerator architecture in support of quantized computations, including DSP packing, weight reordering, and data packing, to enhance the overall throughput with the available resources. Moreover, a comprehensive resource model is developed to balance the allocation of FPGA computation resources (LUTs and DSPs) as well as data transfer and on-chip storage resources (BRAMs) to accelerate the computations in mixed precisions within each layer. Finally, to improve the portability of FILM-QNN, we implement it using Vivado High-Level Synthesis (HLS) on Xilinx PYNQ-Z2 and ZCU102 FPGA boards. Our experimental results of ResNet-18, ResNet-50, and MobileNet-V2 demonstrate that the implementations with intra-layer, mixed-precision (95% of 4-bit weights and 5% of 8-bit weights, and all 5-bit activations) can achieve comparable accuracy (70.47%, 77.25%, and 65.67% for the three models) as the 8-bit (and 32-bit) versions and comparable throughput (214.8 FPS, 109.1 FPS, and 537.9 FPS on ZCU102) as the 4-bit designs.

Supplementary Material

MP4 File (FPGA22-fpgafp084.mp4)
Presentation video for FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization

References

[1]
013)]% bengio2013estimating, Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).
[2]
021)]% chang2021mix, Sung-En Chang, Yanyu Li, Mengshu Sun, Runbin Shi, Hayden K-H So, Xuehai Qian, Yanzhi Wang, and Xue Lin. 2021. Mix and Match: A novel FPGA-centric deep neural network quantization framework. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) . IEEE, 208--220.
[3]
019)]% cheng2019uL2Q, Gong Cheng, Lu Ye, Li Tao, Zhang Xiaofan, Hao Cong, Chen Deming, and Chen Yao. 2019. μL2Q: An Ultra-Low Loss Quantization Method for DNN. The 2019 International Joint Conference on Neural Networks (IJCNN) (2019), 1--8.
[4]
018)]% choi2018pact, Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. 2018. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018).
[5]
018)]% colangelo2018exploration, Philip Colangelo, Nasibeh Nasiri, Eriko Nurvitadhi, Asit Mishra, Martin Margala, and Kevin Nealis. 2018. Exploration of low numeric precision deep learning inference using intel® FPGAs. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 73--80.
[6]
015)]% courbariaux2015binaryconnect, Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems (NeurIPS) . 3123--3131.
[7]
016)]% courbariaux2016binarized, Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to
[8]
1 or-1. arXiv preprint arXiv:1602.02830 (2016).
[9]
019a)]% dong2019hawqv2, Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2019 a. HAWQ-V2: Hessian Aware trace-Weighted Quantization of neural networks. arXiv preprint arXiv:1911.03852 (2019).
[10]
019b)]% dong2019hawq, Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2019 b. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) . 293--302.
[11]
019)]% esser2019learned, Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S Modha. 2019. Learned step size quantization. International Conference on Learning Representations (ICLR) (2019).
[12]
021)]% Torchvision, Facebook. 2021. Torchvision. https://pytorch.org/vision/stable/models.html Last accessed Sept 12, 2021.
[13]
019)]% gong2019differentiable, Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. 2019. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 4852--4861.
[14]
021)]% TensorFlow-Lite, Google. 2021. TensorFlow. https://www.tensorflow.org/lite Last accessed May 27, 2021.
[15]
021)]% nn-accelerator, K. Guo, W. Li, K. Zhong, Z. Zhu, S. Zeng, S. Han, Y. Xie, P. Debacker, M. Verhelst, and Y. Wang. 2021. Neural Network Accelerator Comparison. https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/.
[16]
017)]% guo2017angel, Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2017. Angel-Eye: A complete design flow for mapping CNN onto embedded FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), Vol. 37, 1 (2017), 35--47.
[17]
018)]% guo2018fbna, Peng Guo, Hong Ma, Ruizhi Chen, Pin Li, Shaolin Xie, and Donglin Wang. 2018. Fbna: A fully binarized neural network accelerator. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 51--513.
[18]
016)]% han2015deep, Song Han, Huizi Mao, and William J Dally. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR) (2016).
[19]
, Zhezhi He and Deliang Fan. 2019. Simultaneously optimizing weight and quantizer of ternary neural network using truncated gaussian approximation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 11438--11446.
[20]
, Intel. 2017. Intel Arria 10 Native Fixed Point DSP IP Core User Guide. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug_nfp_dsp.pdf Last accessed Sept 11, 2021.
[21]
017)]% jiao2017accelerating, Li Jiao, Cheng Luo, Wei Cao, Xuegong Zhou, and Lingli Wang. 2017. Accelerating low bit-width convolutional neural networks with embedded FPGA. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 1--4.
[22]
016)]% judd2016str, Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M. Aamodt, and Andreas Moshovos. 2016. Stripes: Bit-serial deep neural network computing. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1--12.
[23]
019)]% jung2019learning, Sangil Jung, Changyong Son, Seohyung Lee, Jinwoo Son, Jae-Joon Han, Youngjun Kwak, Sung Ju Hwang, and Changkyu Choi. 2019. Learning to quantize deep networks by optimizing quantization intervals with task loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4350--4359.
[24]
018)]% leng2018extremely, Cong Leng, Zesheng Dou, Hao Li, Shenghuo Zhu, and Rong Jin. 2018. Extremely low bit neural network: Squeeze the last bit out with admm. In Thirty-Second AAAI Conference on Artificial Intelligence (AAAI) .
[25]
016)]% li2016ternary, Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016).
[26]
017)]% lin2017towards, Xiaofan Lin, Cong Zhao, and Wei Pan. 2017. Towards accurate binary convolutional neural network. In Advances in Neural Information Processing Systems (NeurIPS). 345--353.
[27]
019)]% lou2019autoq, Qian Lou, Feng Guo, Minje Kim, Lantao Liu, and Lei Jiang. 2019. AutoQ: Automated Kernel-Wise Neural Network Quantization. In International Conference on Learning Representations (ICLR) .
[28]
021)]% lu2021demystifying, Alec Lu, Zhenman Fang, Weihua Liu, and Lesley Shannon. 2021. Demystifying the memory system of modern datacenter FPGAs for software programmers through microbenchmarking. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). 105--115.
[29]
019)]% luo2019rna, Cheng Luo, Wei Cao, Lingli Wang, and Philip HW Leong. 2019. Rna: An accurate residual network accelerator for quantized and reconstructed deep neural networks. IEICE Transactions on Information and Systems, Vol. 102, 5 (2019), 1037--1045.
[30]
016)]% miyashita2016convolutional, Daisuke Miyashita, Edward H Lee, and Boris Murmann. 2016. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025 (2016).
[31]
017)]% nakahara2017fully, Hiroki Nakahara, Tomoya Fujii, and Shimpei Sato. 2017. A fully connected layer elimination for a binarized convolutional neural network on an FPGA. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 1--4.
[32]
016)]% nakahara2016memory, Hiroki Nakahara, Haruyoshi Yonekawa, Tsutomu Sasao, Hisashi Iwamoto, and Masato Motomura. 2016. A memory-based realization of a binarized deep convolutional neural network. In 2016 International Conference on Field-Programmable Technology (FPT). IEEE, 277--280.
[33]
017)]% nguyen2017double, Dong Nguyen, Daewoo Kim, and Jongeun Lee. 2017. Double MAC: Doubling the performance of convolutional neural networks on modern FPGAs. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017. IEEE, 890--893.
[34]
021)]% Nvidia, Nvidia. 2021. Nvidia Deep Learning Examples. https://github.com/NVIDIA/DeepLearningExamples Last accessed Sept 12, 2021.
[35]
018)]% park2018value, Eunhyeok Park, Sungjoo Yoo, and Peter Vajda. 2018. Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV). 580--595.
[36]
017)]% rahman2017design, Atul Rahman, Sangyun Oh, Jongeun Lee, and Kiyoung Choi. 2017. Design space exploration of FPGA accelerators for convolutional neural networks. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017. IEEE, 1147--1152.
[37]
016)]% rastegari2016xnor, Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision (ECCV). Springer, 525--542.
[38]
019)]% ren2019admm, Ao Ren, Tianyun Zhang, Shaokai Ye, Jiayu Li, Wenyao Xu, Xuehai Qian, Xue Lin, and Yanzhi Wang. 2019. Admm-nn: An algorithm-hardware co-design framework of dnns using alternating direction methods of multipliers. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 925--938.
[39]
018)]% sharma2018bit, Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh. 2018. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural networks. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA). IEEE Press, 764--775.
[40]
020)]% shen2020q, Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. In Thirty-Second AAAI Conference on Artificial Intelligence (AAAI). 8815--8821.
[41]
020)]% uhlich2019mixed, Stefan Uhlich, Lukas Mauch, Fabien Cardinaux, Kazuki Yoshiyama, Javier Alonso Garcia, Stephen Tiedemann, Thomas Kemp, and Akira Nakamura. 2020. Mixed Precision DNNs: All you need is a good parametrization. International Conference on Learning Representations (ICLR) (2020).
[42]
017)]% umuroglu2017finn, Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. 2017. Finn: A framework for fast, scalable binarized neural network inference. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 65--74.
[43]
ias et almbox.(2017)]% vestias2017parallel, Mário Véstias, Rui Policarpo Duarte, José T de Sousa, and Horácio Neto. 2017. Parallel dot-products for deep learning on FPGA. In 2017 27th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 1--4.
[44]
018)]% wang2018design, Junsong Wang, Qiuwen Lou, Xiaofan Zhang, Chao Zhu, Yonghua Lin, and Deming Chen. 2018. Design flow of accelerating hybrid extremely low bit-width neural network in embedded FPGA. In 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 163--1636.
[45]
019)]% wang2019haq, Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2019. HAQ: Hardware-Aware Automated Quantization with Mixed Precision. International Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 8604--8612.
[46]
018)]% wu2018mixed, Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer. 2018. Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090 (2018).
[47]
, Xilinx. 2017. Deep Learning with INT8 Optimization on Xilinx Devices. https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf Last accessed Sept 12, 2021.
[48]
, Xilinx. 2020. Convolutional Neural Network with INT4 Optimization on Xilinx Devices. https://www.xilinx.com/support/documentation/white_papers/wp521--4bit-optimization.pdf Last accessed Sept 12, 2021.
[49]
019)]% yang2019synetgy, Yifan Yang, Qijing Huang, Bichen Wu, Tianjun Zhang, Liang Ma, Giulio Gambardella, Michaela Blott, Luciano Lavagno, Kees Vissers, John Wawrzynek, and Kurt Keutzer. 2019. Synetgy: Algorithm-hardware co-design for convnet accelerators on embedded fpgas. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 23--32.
[50]
019)]% yao2019pyhessian, Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael Mahoney. 2019. PyHessian: Neural networks through the lens of the Hessian. arXiv preprint arXiv:1912.07145 (2019).
[51]
015)]% zhang2015optimizing, Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays (FPGA). 161--170.
[52]
018b)]% zhang2018lq, Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. 2018b. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European conference on computer vision (ECCV) . 365--382.
[53]
018a)]% zhang2018dnnbuilder, Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen. 2018a. DNNBuilder: an automated tool for building high-performance DNN hardware accelerators for FPGAs. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1--8.
[54]
017)]% zhao2017accelerating, Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. 2017. Accelerating binarized convolutional neural networks with software-programmable fpgas. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 15--24.
[55]
017)]% zhou2017incremental, Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. 2017. Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044 (2017).
[56]
016)]% zhou2016dorefa, Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016).
[57]
017)]% zhu2016trained, Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. 2017. Trained ternary quantization. In International Conference on Learning Representations (ICLR) .

Cited By

View all
  • (2025)VCONV: A Convolutional Neural Network Accelerator for FPGAsElectronics10.3390/electronics1404065714:4(657)Online publication date: 8-Feb-2025
  • (2024)LDF-BNN: A Real-Time and High-Accuracy Binary Neural Network Accelerator Based on the Improved BNextMicromachines10.3390/mi1510126515:10(1265)Online publication date: 17-Oct-2024
  • (2024)AMED: Automatic Mixed-Precision Quantization for Edge DevicesMathematics10.3390/math1212181012:12(1810)Online publication date: 11-Jun-2024
  • Show More Cited By

Index Terms

  1. FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
      February 2022
      211 pages
      ISBN:9781450391498
      DOI:10.1145/3490422
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 February 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. deep learning
      2. fpga
      3. hardware acceleration
      4. mixed-precision quantization
      5. model compression

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      FPGA '22
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 125 of 627 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,106
      • Downloads (Last 6 weeks)127
      Reflects downloads up to 03 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)VCONV: A Convolutional Neural Network Accelerator for FPGAsElectronics10.3390/electronics1404065714:4(657)Online publication date: 8-Feb-2025
      • (2024)LDF-BNN: A Real-Time and High-Accuracy Binary Neural Network Accelerator Based on the Improved BNextMicromachines10.3390/mi1510126515:10(1265)Online publication date: 17-Oct-2024
      • (2024)AMED: Automatic Mixed-Precision Quantization for Edge DevicesMathematics10.3390/math1212181012:12(1810)Online publication date: 11-Jun-2024
      • (2024)Auto WS: Automate Weights Streaming in Layer-Wise Pipelined DNN Accelerators2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546621(1-6)Online publication date: 25-Mar-2024
      • (2024)Efficient Deep Learning Infrastructures for Embedded Computing Systems: A Comprehensive Survey and Future EnvisionACM Transactions on Embedded Computing Systems10.1145/370172824:1(1-100)Online publication date: 24-Oct-2024
      • (2024)HyBNN: Quantifying and Optimizing Hardware Efficiency of Binary Neural NetworksACM Transactions on Reconfigurable Technology and Systems10.1145/363161017:2(1-24)Online publication date: 30-Apr-2024
      • (2024)A Design Framework for Generating Energy-Efficient Accelerator on FPGA Toward Low-Level VisionIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2024.340964932:8(1485-1497)Online publication date: Aug-2024
      • (2024)Mobile-X: Dedicated FPGA Implementation of the MobileNet Accelerator Optimizing Depthwise Separable ConvolutionIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.344088471:11(4668-4672)Online publication date: Nov-2024
      • (2024)A 119.64 GOPs/W FPGA-Based ResNet50 Mixed-Precision Accelerator Using the Dynamic DSP PackingIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.337735671:5(2554-2558)Online publication date: May-2024
      • (2024)CASCADE: A Framework for CNN Accelerator Synthesis With Concatenation and Refreshing DataflowIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.345295471:11(5235-5248)Online publication date: Nov-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media