research-article

BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization

Authors:

Li JiangAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 14, Issue 4

Article No.: 18, Pages 1 - 28

https://doi.org/10.1145/3467476

Published: 13 September 2021 Publication History

Abstract

Field-programmable Gate Array (FPGA) is a high-performance computing platform for Convolution Neural Networks (CNNs) inference. Winograd algorithm, weight pruning, and quantization are widely adopted to reduce the storage and arithmetic overhead of CNNs on FPGAs. Recent studies strive to prune the weights in the Winograd domain, however, resulting in irregular sparse patterns and leading to low parallelism and reduced utilization of resources. Besides, there are few works to discuss a suitable quantization scheme for Winograd.

In this article, we propose a regular sparse pruning pattern in the Winograd-based CNN, namely, Sub-row-balanced Sparsity (SRBS) pattern, to overcome the challenge of the irregular sparse pattern. Then, we develop a two-step hardware co-optimization approach to improve the model accuracy using the SRBS pattern. Based on the pruned model, we implement a mixed precision quantization to further reduce the computational complexity of bit operations. Finally, we design an FPGA accelerator that takes both the advantage of the SRBS pattern to eliminate low-parallelism computation and the irregular memory accesses, as well as the mixed precision quantization to get a layer-wise bit width. Experimental results on VGG16/VGG-nagadomi with CIFAR-10 and ResNet-18/34/50 with ImageNet show up to 11.8×/8.67× and 8.17×/8.31×/10.6× speedup, 12.74×/9.19× and 8.75×/8.81×/11.1× energy efficiency improvement, respectively, compared with the state-of-the-art dense Winograd accelerator [20] with negligible loss of model accuracy. We also show that our design has 4.11× speedup compared with the state-of-the-art sparse Winograd accelerator [19] on VGG16.

References

[1]

Wikipedia. 2018. Retrieved from https://en.wikipedia.org/wiki/Sparse_matrix.

[2]

Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan et al. 2019. Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity. In Proceedings of the ACM/SIGDA International Symposium on Field-programmable Gate Arrays. ACM, 63–72.

Digital Library

[3]

Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I.-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. 2018. PACT: Parameterized clipping activation for quantized neural networks. Retrieved from http://arxiv.org/abs/1805.06085.

[4]

L. Deng, G. Li, S. Han, L. Shi, and Y. Xie. 2020. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proc. IEEE 108, 4 (2020), 485–532.

[5]

Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, and Huazhong Yang. 2019. [DL] A survey of FPGA-based neural network inference accelerators. ACM Trans. Reconfigurable Technol. Syst. 12, 1, Article 2 (Mar. 2019), 26 pages.

Digital Library

[6]

P. Gysel, J. Pimentel, M. Motamedi, and S. Ghiasi. 2018. Ristretto: A framework for empirical study of resource-efficient inference in convolutional neural networks. IEEE Trans. Neural Netw. Learn. Syst. 29, 11 (2018), 5784–5789.

[7]

Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. SIGARCH Comput. Archit. News 44, 3 (June 2016), 243254.

Digital Library

[8]

Song Han, Huizi Mao, and William J. Dally. 2016. Deep Compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Proceedings of the 4th International Conference on Learning Representations (ICLR'16).

[9]

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'18).

[10]

Alex Krizhevsky. 2012. Learning multiple layers of features from tiny images. Technical report, University of Toronto.

[11]

Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'16).

[12]

G. Li, L. Liu, X. Wang, X. Ma, and X. Feng. 2020. Lance: Efficient low-precision quantized winograd convolution for neural networks based on graphics processing units. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'20). 3842–3846.

[13]

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2017. Pruning filters for efficient convnets. In Proceedings of the 5th International Conference on Learning Representations (ICLR'17).

[14]

Sheng R. Li, Jongsoo Park, and Ping Tak Peter Tang. 2017. Enabling sparse winograd convolution by native pruning. Retrieved from https://abs/1702.08597.

[15]

Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. DARTS: differentiable architecture search. Retrieved from http://arxiv.org/abs/1806.09055.

[16]

Weibo Liu, Zidong Wang, Xiaohui Liu, Nianyin Zeng, Yurong Liu, and Fuad E. Alsaadi. 2017. A survey of deep neural network architectures and their applications. Neurocomputing 234 (2017), 11–26.

[17]

Xingyu Liu, Jeff Pool, Song Han, and William J. Dally. 2018. Efficient sparse-winograd convolutional neural networks. In Proceedings of the 6th International Conference on Learning Representations (ICLR'18).

[18]

Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision (ICCV'17). 2755–2763.

[19]

L. Lu and Y. Liang. 2018. SpWA: An efficient sparse winograd convolutional neural networks accelerator on FPGAs. In Proceedings of the 55th ACM/ESDA/IEEE Design Automation Conference (DAC'18). 1–6.

Digital Library

[20]

L. Lu, Y. Liang, Q. Xiao, and S. Yan. 2017. Evaluating fast algorithms for convolutional neural networks on FPGAs. In Proceedings of the IEEE 25th Annual International Symposium on Field-programmable Custom Computing Machines (FCCM'17). 101–108.

[21]

Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. 2016. Understanding the effective receptive field in deep convolutional neural networks. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Curran Associates, 4898–4906. Retrieved from https://proceedings.neurips.cc/paper/2016/file/c8067ad1937f728f51288b3eb986afaa-Paper.pdf.

Digital Library

[22]

H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally. 2017. Exploring the granularity of sparsity in convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW'17). 1927–1934.

[23]

Nagadomi. 2014. Code for kaggle-cifar10 competition. 5th place.Retrieved from https://github.com/nagadomi/kaggle-cifar10-torch7.

[24]

E. Park, D. Kim, and S. Yoo. 2018. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA'18). 688–698.

Digital Library

[25]

Adam Paszke et al. 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, 8024–8035. Retrieved from http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.

Digital Library

[26]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. 2014. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115 (2014), 211–252.

Digital Library

[27]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR'15).

[28]

C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR'15). 1–9.

[29]

Yaman Umuroglu and Magnus Jahre. 2017. Streamlined deployment for quantized neural networks. Retrieved from http://arxiv.org/abs/1709.04060.

[30]

Haonan Wang, Wenjian Liu, Tianyi Xu, Jun Lin, and Zhongfeng Wang. 2019. A low-latency sparse-winograd accelerator for convolutional neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'19). IEEE, 1448–1452.

[31]

Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2019. HAQ: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR'19).

[32]

D. Williamson. 1991. Dynamically scaled fixed point arithmetic. In Proceedings of the IEEE Pacific Rim Conference on Communications, Computers and Signal Processing. 315–318vol.1.

[33]

T. Yang, Y. Liao, J. Shi, Y. Liang, N. Jing, and L. Jiang. 2020. A winograd-based CNN accelerator with a fine-grained regular sparsity pattern. In Proceedings of the 30th International Conference on Field-programmable Logic and Applications (FPL'20). 254–261.

[34]

Haibao Yu, Qi Han, Jianbo Li, Jianping Shi, Guangliang Cheng, and Bin Fan. 2020. Search What You Want: Barrier Panelty NAS for Mixed Precision Quantization. Retrieved from https://arXiv.cs.CV/2007.10026.

[35]

Jiecao Yu, Jongsoo Park, and Maxim Naumov. 2018. Spatial-winograd pruning enabling sparse winograd convolution. Retrieved from https://abs/1901.02132.

[36]

Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. 2017. Incremental network quantization: Towards lossless cnns with low-precision weights. Retrieved from http://arxiv.org/abs/1702.03044.

[37]

Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. Retrieved from http://arxiv.org/abs/1606.06160.

[38]

Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. 2016. Trained ternary quantization. Retrieved from http://arxiv.org/abs/1612.01064.

Cited By

Li RLi XYang CHu XLin YFu SHuang HCai SXiong X(2025)A Precision-Scalable Sparse CNN Accelerator with Fine-Grained Mixed Bitwidth ConfigurabilityIEICE Electronics Express10.1587/elex.22.20240601Online publication date: 2025
https://doi.org/10.1587/elex.22.20240601
Xie KLu YHe XYi DDong HChen Y(2024)Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAsACM Transactions on Architecture and Code Optimization10.1145/364368221:2(1-24)Online publication date: 23-Mar-2024
https://dl.acm.org/doi/10.1145/3643682
Bibi UMazhar MSabir DFasih Uddin Butt MHassan AAli Ghazanfar MAli Khan AAbdul W(2024)Advances in Pruning and Quantization for Natural Language ProcessingIEEE Access10.1109/ACCESS.2024.346563112(139113-139128)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3465631
Show More Cited By

Index Terms

BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization
1. Computer systems organization
  1. Architectures
    1. Serial architectures
      1. Pipeline computing
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

An FPGA-based Fine Tuning Accelerator for a Sparse CNN
FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Fine-tuning learns abundant feature expression for a wide range of natural images by using a pre-trained CNN model. It can be applied to a wide range of the neural network (NN)based computer vision problems. This paper proposes an FPGA-based fine-tuning ...
LPAC: A Low-Precision Accelerator for CNN on FPGAs
FPGA '20: Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Low bit quantization of neural network is required on edge devices to achieve lower power consumption and higher performance. 8bit or binary network either consumes a lot of resources or has accuracy degradation. Thus, a full-process hardware-friendly ...
Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators
ASPDAC '21: Proceedings of the 26th Asia and South Pacific Design Automation Conference

ReRAM-based accelerators have shown great potential for accelerating DNN inference because ReRAM crossbars can perform analog matrix-vector multiplication operations with low latency and energy consumption. However, these crossbars require the use of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Volume 14, Issue 4

December 2021

165 pages

ISSN:1936-7406

EISSN:1936-7414

DOI:10.1145/3483341

Editor:
Deming Chen
University of Illinois, Urbana-Champaign Urbana, USA

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2021

Accepted: 01 May 2021

Revised: 01 February 2021

Received: 01 October 2020

Published in TRETS Volume 14, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
456
Total Downloads

Downloads (Last 12 months)61
Downloads (Last 6 weeks)2

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li RLi XYang CHu XLin YFu SHuang HCai SXiong X(2025)A Precision-Scalable Sparse CNN Accelerator with Fine-Grained Mixed Bitwidth ConfigurabilityIEICE Electronics Express10.1587/elex.22.20240601Online publication date: 2025
https://doi.org/10.1587/elex.22.20240601
Xie KLu YHe XYi DDong HChen Y(2024)Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAsACM Transactions on Architecture and Code Optimization10.1145/364368221:2(1-24)Online publication date: 23-Mar-2024
https://dl.acm.org/doi/10.1145/3643682
Bibi UMazhar MSabir DFasih Uddin Butt MHassan AAli Ghazanfar MAli Khan AAbdul W(2024)Advances in Pruning and Quantization for Natural Language ProcessingIEEE Access10.1109/ACCESS.2024.346563112(139113-139128)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3465631
Yang TMa FLi XLiu FZhao YHe ZJiang L(2023)DTATrans: Leveraging Dynamic Token-Based Quantization With Accuracy Compensation Mechanism for Efficient Transformer ArchitectureIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2022.318154142:2(509-520)Online publication date: Feb-2023
https://doi.org/10.1109/TCAD.2022.3181541

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents