skip to main content
10.1145/3404397.3404407acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures

Published: 17 August 2020 Publication History

Abstract

With the continuous demand for higher accuracy of deep neural networks, the model size has increased significantly. Quantization is one of the most widely used model compression methods, which can effectively reduce the model size without severe accuracy loss. Modern processors such as ARM CPU and NVIDIA GPU have already provided the support of low-bit arithmetic instructions. However, there lack efficient and practical optimizations for convolution computation towards extremely low-bit on ARM CPU (e.g., 2 ∼ 8-bit) and NVIDIA GPU (e.g., 4-bit and 8-bit). This paper explores the performance optimization methods of extremely low-bit convolution on diverse architectures. On ARM CPU, we propose two instruction schemes for 2 ∼ 3-bit and 4 ∼ 8-bit convolution with corresponding register allocation methods. In addition, we re-design the GEMM computation with data padding and packing optimizations. We also implement winograd algorithm for convolution with some specific bit width (e.g., 4 ∼ 6-bit) to achieve higher performance. On NVIDIA GPU, we propose a data partition mechanism and multi-level memory access optimizations, to better adapt the computation to GPU thread and memory hierarchy. We also propose quantization fusion to eliminate unnecessary data access. The experiment results demonstrate our implementations achieve better performance of extremely low-bit convolution compared to the state-of-the-art frameworks and libraries such as ncnn and cuDNN. To the best of our knowledge, this is the first work that provides efficient implementations of extremely low-bit convolutions covering 2 ∼ 8-bit on ARM CPU and 4-bit/8-bit on NVIDIA GPU.

References

[1]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, 2018. TVM: an automated end-to-end optimizing compiler for deep learning. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. 579–594.
[2]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759(2014).
[3]
Meghan Cowan, Thierry Moreau, Tianqi Chen, James Bornholt, and Luis Ceze. 2020. Automatic generation of high-performance quantized machine learning kernels. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. 305–316.
[4]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
[5]
Jack J Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS) 16, 1 (1990), 1–17.
[6]
Marat Dukhan, Yiming Wu, and Hao Lu. 2018. QNNPACK: open source library for optimized mobile deep learning. https://github.com/pytorch/QNNPACK.
[7]
Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S. Modha. 2020. LEARNED STEP SIZE QUANTIZATION. In International Conference on Learning Representations.
[8]
Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. 2019. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 4852–4861.
[9]
Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems. 1135–1143.
[10]
Babak Hassibi and David G Stork. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems. 164–171.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[12]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700–4708.
[13]
Intel. 2016. Deep Neural Network Library. https://github.com/intel/mkl-dnn.
[14]
Benoit Jacob 2017. gemmlowp: a small self-contained low-precision GEMM library.(2017).
[15]
Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Lee. 2017. Performance analysis of CNN frameworks for GPUs. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 55–64.
[16]
Liangzhen Lai, Naveen Suda, and Vikas Chandra. 2018. Cmsis-nn: Efficient neural network kernels for arm cortex-m cpus. arXiv preprint arXiv:1801.06601(2018).
[17]
Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4013–4021.
[18]
Rundong Li, Yan Wang, Feng Liang, Hongwei Qin, Junjie Yan, and Rui Fan. 2019. Fully Quantized Network for Object Detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[19]
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2014. Microsoft COCO: Common Objects in Context.
[20]
Weibo Liu, Zidong Wang, Xiaohui Liu, Nianyin Zeng, Yurong Liu, and Fuad E Alsaadi. 2017. A survey of deep neural network architectures and their applications. Neurocomputing 234(2017), 11–26.
[21]
Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. Nvidia tensor core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 522–531.
[22]
Szymon Migacz. 2017. 8-bit inference with TensorRT. In GPU Technology Conference.
[23]
nihui 2017. NCNN. https://github.com/Tencent/ncnn.
[24]
NVIDIA. 2019. NVIDIA Nsight Compute. https://developer.nvidia.com/nsight-compute.
[25]
CUTLASS NVIDIA. 2017. CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass.
[26]
PTX NVIDIA. 2019. Parallel Thread Execution ISA version 6.5. NVIDIA Corporation (November 2019)(2019).
[27]
Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, 2019. Mlperf inference benchmark. arXiv preprint arXiv:1911.02549(2019).
[28]
Günther Schindler, Manfred Mücke, and Holger Fröning. 2017. Linking application description with efficient simd code generation for low-precision signed-integer gemm. In European Conference on Parallel Processing. Springer, 688–699.
[29]
SoftBank. 2017. Q4 2016 Roadshow Slides - Arm.(2017).
[30]
Andrew Tulloch and Yangqing Jia. 2017. High performance ultra-low-precision convolutions on mobile devices. arXiv preprint arXiv:1712.02427(2017).
[31]
Yaman Umuroglu and Magnus Jahre. 2017. Towards efficient quantized neural network inference on mobile devices: work-in-progress. In Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion. 1–2.
[32]
Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv preprint arXiv:1412.7580(2014).
[33]
Yudong Wu, Yichao Wu, Ruihao Gong, Yuanhao Lv, Ken Chen, Ding Liang, Xiaolin Hu, Xianglong Liu, and Junjie Yan. 2020. Rotation Consistent Margin Loss for Efficient Low-bit Face Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[34]
Maohua Zhu, Tao Zhang, Zhenyu Gu, and Yuan Xie. 2019. Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 359–371.

Cited By

View all
  • (2024)LSTM Gate Disclosure as an Embedded AI Methodology for Wearable Fall-Detection SensorsSymmetry10.3390/sym1610129616:10(1296)Online publication date: 2-Oct-2024
  • (2024)Quantization with Gate Disclosure for Embedded Artificial Intelligence Applied to Fall DetectionProceedings of the 2024 International Conference on Information Technology for Social Good10.1145/3677525.3678644(84-87)Online publication date: 4-Sep-2024
  • (2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '20: Proceedings of the 49th International Conference on Parallel Processing
August 2020
844 pages
ISBN:9781450388160
DOI:10.1145/3404397
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ARM CPU
  2. Computation Optimization
  3. Extremely Low-bit Convolution
  4. NVIDIA GPU
  5. Quantized Neural Network

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP '20

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)51
  • Downloads (Last 6 weeks)5
Reflects downloads up to 21 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)LSTM Gate Disclosure as an Embedded AI Methodology for Wearable Fall-Detection SensorsSymmetry10.3390/sym1610129616:10(1296)Online publication date: 2-Oct-2024
  • (2024)Quantization with Gate Disclosure for Embedded Artificial Intelligence Applied to Fall DetectionProceedings of the 2024 International Conference on Information Technology for Social Good10.1145/3677525.3678644(84-87)Online publication date: 4-Sep-2024
  • (2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
  • (2024)IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343257935:9(1672-1689)Online publication date: Sep-2024
  • (2023)DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW59228.2023.00491(4656-4664)Online publication date: Jun-2023
  • (2022)Design and Implementation of 2D Convolution on x86/x64 ProcessorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.317147133:12(3800-3815)Online publication date: 1-Dec-2022
  • (2022)Optimization of General Matrix Multiply Library for Ternary Weight for Fast DNN InferenceJournal of Signal Processing Systems10.1007/s11265-022-01782-394:10(929-943)Online publication date: 1-Oct-2022
  • (2021)LIBSHALOMProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476217(1-14)Online publication date: 14-Nov-2021
  • (2021)TernGEMM: GEneral Matrix Multiply Library with Ternary Weights for Fast DNN Inference2021 IEEE Workshop on Signal Processing Systems (SiPS)10.1109/SiPS52927.2021.00028(111-116)Online publication date: Oct-2021
  • (2021)UNITProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370330(77-89)Online publication date: 27-Feb-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media