research-article

Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures

Authors:

Depei QianAuthors Info & Claims

ICPP '20: Proceedings of the 49th International Conference on Parallel Processing

Article No.: 38, Pages 1 - 12

https://doi.org/10.1145/3404397.3404407

Published: 17 August 2020 Publication History

Abstract

With the continuous demand for higher accuracy of deep neural networks, the model size has increased significantly. Quantization is one of the most widely used model compression methods, which can effectively reduce the model size without severe accuracy loss. Modern processors such as ARM CPU and NVIDIA GPU have already provided the support of low-bit arithmetic instructions. However, there lack efficient and practical optimizations for convolution computation towards extremely low-bit on ARM CPU (e.g., 2 ∼ 8-bit) and NVIDIA GPU (e.g., 4-bit and 8-bit). This paper explores the performance optimization methods of extremely low-bit convolution on diverse architectures. On ARM CPU, we propose two instruction schemes for 2 ∼ 3-bit and 4 ∼ 8-bit convolution with corresponding register allocation methods. In addition, we re-design the GEMM computation with data padding and packing optimizations. We also implement winograd algorithm for convolution with some specific bit width (e.g., 4 ∼ 6-bit) to achieve higher performance. On NVIDIA GPU, we propose a data partition mechanism and multi-level memory access optimizations, to better adapt the computation to GPU thread and memory hierarchy. We also propose quantization fusion to eliminate unnecessary data access. The experiment results demonstrate our implementations achieve better performance of extremely low-bit convolution compared to the state-of-the-art frameworks and libraries such as ncnn and cuDNN. To the best of our knowledge, this is the first work that provides efficient implementations of extremely low-bit convolutions covering 2 ∼ 8-bit on ARM CPU and 4-bit/8-bit on NVIDIA GPU.

References

[1]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, 2018. TVM: an automated end-to-end optimizing compiler for deep learning. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. 579–594.

[2]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759(2014).

[3]

Meghan Cowan, Thierry Moreau, Tianqi Chen, James Bornholt, and Luis Ceze. 2020. Automatic generation of high-performance quantized machine learning kernels. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. 305–316.

Digital Library

[4]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.

[5]

Jack J Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS) 16, 1 (1990), 1–17.

Digital Library

[6]

Marat Dukhan, Yiming Wu, and Hao Lu. 2018. QNNPACK: open source library for optimized mobile deep learning. https://github.com/pytorch/QNNPACK.

[7]

Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S. Modha. 2020. LEARNED STEP SIZE QUANTIZATION. In International Conference on Learning Representations.

[8]

Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. 2019. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 4852–4861.

[9]

Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems. 1135–1143.

[10]

Babak Hassibi and David G Stork. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems. 164–171.

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

[12]

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700–4708.

[13]

Intel. 2016. Deep Neural Network Library. https://github.com/intel/mkl-dnn.

[14]

Benoit Jacob 2017. gemmlowp: a small self-contained low-precision GEMM library.(2017).

[15]

Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Lee. 2017. Performance analysis of CNN frameworks for GPUs. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 55–64.

[16]

Liangzhen Lai, Naveen Suda, and Vikas Chandra. 2018. Cmsis-nn: Efficient neural network kernels for arm cortex-m cpus. arXiv preprint arXiv:1801.06601(2018).

[17]

Andrew Lavin and Scott Gray. 2016. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4013–4021.

[18]

Rundong Li, Yan Wang, Feng Liang, Hongwei Qin, Junjie Yan, and Rui Fan. 2019. Fully Quantized Network for Object Detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2014. Microsoft COCO: Common Objects in Context.

[20]

Weibo Liu, Zidong Wang, Xiaohui Liu, Nianyin Zeng, Yurong Liu, and Fuad E Alsaadi. 2017. A survey of deep neural network architectures and their applications. Neurocomputing 234(2017), 11–26.

[21]

Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. Nvidia tensor core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 522–531.

[22]

Szymon Migacz. 2017. 8-bit inference with TensorRT. In GPU Technology Conference.

[23]

nihui 2017. NCNN. https://github.com/Tencent/ncnn.

[24]

NVIDIA. 2019. NVIDIA Nsight Compute. https://developer.nvidia.com/nsight-compute.

[25]

CUTLASS NVIDIA. 2017. CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass.

[26]

PTX NVIDIA. 2019. Parallel Thread Execution ISA version 6.5. NVIDIA Corporation (November 2019)(2019).

[27]

Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, 2019. Mlperf inference benchmark. arXiv preprint arXiv:1911.02549(2019).

[28]

Günther Schindler, Manfred Mücke, and Holger Fröning. 2017. Linking application description with efficient simd code generation for low-precision signed-integer gemm. In European Conference on Parallel Processing. Springer, 688–699.

[29]

SoftBank. 2017. Q4 2016 Roadshow Slides - Arm.(2017).

[30]

Andrew Tulloch and Yangqing Jia. 2017. High performance ultra-low-precision convolutions on mobile devices. arXiv preprint arXiv:1712.02427(2017).

[31]

Yaman Umuroglu and Magnus Jahre. 2017. Towards efficient quantized neural network inference on mobile devices: work-in-progress. In Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion. 1–2.

Digital Library

[32]

Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv preprint arXiv:1412.7580(2014).

[33]

Yudong Wu, Yichao Wu, Ruihao Gong, Yuanhao Lv, Ken Chen, Ding Liang, Xiaolin Hu, Xianglong Liu, and Junjie Yan. 2020. Rotation Consistent Margin Loss for Efficient Low-bit Face Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]

Maohua Zhu, Tao Zhang, Zhenyu Gu, and Yuan Xie. 2019. Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 359–371.

Digital Library

Cited By

Correia SRoque PMatos-Carvalho J(2024)LSTM Gate Disclosure as an Embedded AI Methodology for Wearable Fall-Detection SensorsSymmetry10.3390/sym1610129616:10(1296)Online publication date: 2-Oct-2024
https://doi.org/10.3390/sym16101296
Correia SMatos-Carvalho JTomic S(2024)Quantization with Gate Disclosure for Embedded Artificial Intelligence Applied to Fall DetectionProceedings of the 2024 International Conference on Information Technology for Social Good10.1145/3677525.3678644(84-87)Online publication date: 4-Sep-2024
https://dl.acm.org/doi/10.1145/3677525.3678644
Li JFeng ZGao YTian SZhang HYe HZhang J(2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673093
Show More Cited By

Recommendations

Performance Portable Applications for Hardware Accelerators: Lessons Learned from SPEC ACCEL
IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop

The popular and diverse hardware accelerator ecosystem makes apples-to-apples comparisons between platforms rather difficult. SPEC ACCEL tries to offer a yardstick to compare different accelerator hardware and software ecosystems. This paper uses this ...
Parallel Implementation of the Irregular Terrain Model (ITM) for Radio Transmission Loss Prediction Using GPU and Cell BE Processors

The Irregular Terrain Model (ITM), also known as the Longley-Rice model, predicts long-range average transmission loss of a radio signal based on atmospheric and geographic conditions. Due to variable terrain effects and constantly changing atmospheric ...
Performance evaluation of all intra Kvazaar and x265 HEVC encoders on embedded system Nvidia Jetson platform
Abstract
The growing demand for high-quality video requires complex coding techniques that cost resource consumption and increase encoding time which represents a challenge for real-time processing on Embedded Systems. Kvazaar and x265 encoders are two ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '20: Proceedings of the 49th International Conference on Parallel Processing

August 2020

844 pages

ISBN:9781450388160

DOI:10.1145/3404397

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Natural Science Foundation of China
SenseTime Research Fund for Young Scholars

Conference

ICPP '20

ICPP '20: 49th International Conference on Parallel Processing

August 17 - 20, 2020

AB, Edmonton, Canada

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
487
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)5

Reflects downloads up to 21 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Correia SRoque PMatos-Carvalho J(2024)LSTM Gate Disclosure as an Embedded AI Methodology for Wearable Fall-Detection SensorsSymmetry10.3390/sym1610129616:10(1296)Online publication date: 2-Oct-2024
https://doi.org/10.3390/sym16101296
Correia SMatos-Carvalho JTomic S(2024)Quantization with Gate Disclosure for Embedded Artificial Intelligence Applied to Fall DetectionProceedings of the 2024 International Conference on Information Technology for Social Good10.1145/3677525.3678644(84-87)Online publication date: 4-Sep-2024
https://dl.acm.org/doi/10.1145/3677525.3678644
Li JFeng ZGao YTian SZhang HYe HZhang J(2024)High-Performance 3D convolution on the Latest Generation Sunway ProcessorProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673093(241-251)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673093
Wei CJia HZhang YYao JLi CCao W(2024)IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.343257935:9(1672-1689)Online publication date: Sep-2024
https://doi.org/10.1109/TPDS.2024.3432579
Ganji DAshfaq SSaboori ESah SMitra SAskariHemmat MHoffman AHassanien ALéonardon M(2023)DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW59228.2023.00491(4656-4664)Online publication date: Jun-2023
https://doi.org/10.1109/CVPRW59228.2023.00491
Kelefouras VKeramidas G(2022)Design and Implementation of 2D Convolution on x86/x64 ProcessorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.317147133:12(3800-3815)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3171471
Choi SShim KChoi JSung WShim B(2022)Optimization of General Matrix Multiply Library for Ternary Weight for Fast DNN InferenceJournal of Signal Processing Systems10.1007/s11265-022-01782-394:10(929-943)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1007/s11265-022-01782-3
Yang WFang JDong DSu XWang Zde Supinski BHall MGamblin T(2021)LIBSHALOMProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476217(1-14)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476217
Choi SShim KChoi JSung WShim B(2021)TernGEMM: GEneral Matrix Multiply Library with Ternary Weights for Fast DNN Inference2021 IEEE Workshop on Signal Processing Systems (SiPS)10.1109/SiPS52927.2021.00028(111-116)Online publication date: Oct-2021
https://doi.org/10.1109/SiPS52927.2021.00028
Weng JJain AWang JWang LWang YNowatzki TLee J(2021)UNITProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370330(77-89)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370330

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents