ABSTRACT
Layer-wise mixed-precision quantization (MPQ) has become prevailing for edge inference since it strikes a better balance between accuracy and efficiency compared to the uniform quantization scheme. Existing MPQ strategies either lacked hardware awareness or incurred huge computation costs, which gated their deployment at the edge. In this work, we propose a novel MPQ search algorithm that obtains an optimal scheme by "sampling" layer-wise sensitivity with respect to a newly proposed metric that incorporates both accuracy and proxy of hardware cost. To further efficiently deploy post-training MPQ on edge chips, we propose to tightly integrate the quantized inference units as part of the processor pipeline through micro-architecture and Instruction Set Architecture (ISA) co-design. Evaluation results show that the proposed search algorithm achieves 3% ~ 11% higher inference accuracy with similar hardware cost compared to the state-of-the-art MPQ strategies. In addition, the tightly integrated MPQ units achieve speedup of 15.13x ~ 29.65x compared to a baseline RISC-V processor.
- Chaim Baskin, Natan Liss, Eli Schwartz, Evgenii Zheltonozhskii, Raja Giryes, Alex M. Bronstein, and Avi Mendelson. 2021. UNIQ: Uniform Noise Injection for Non-Uniform Qantization of Neural Networks. ACM TOCS 37, 1--4 (jun 2021). https://doi.org/10.1145/3444943 arXiv:1804.10969Google ScholarDigital Library
- Logan Beal, Daniel Hill, R Martin, and John Hedengren. 2018. GEKKO Optimiza- tion Suite. Processes 6, 8 (2018), 106. https://doi.org/10.3390/pr6080106Google ScholarCross Ref
- Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. 2020. ZeroQ: A novel zero shot quantization framework. Proceedings of the IEEE CVPR (2020), 13166--13175. https://doi.org/10.1109/CVPR42600.2020. 01318 arXiv:2001.00281Google ScholarCross Ref
- Chen Chen, Xiaoyan Xiang, Chang Liu, Yunhai Shang, Ren Guo, Dongqi Liu, Yimin Lu, Ziyi Hao, Jiahui Luo, Zhijian Chen, et al. 2020. Xuantie-910: A commer- cial multi-core 12-stage pipeline out-of-order 64-bit high performance RISC-V processor with vector extension: Industrial product. In ISCA. IEEE, 52--64.Google Scholar
- Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. 2018. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018).Google Scholar
- Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2019. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of ICCV. 293--302.Google ScholarCross Ref
- Ahmed T. Elthakeb, Prannoy Pilligundla, Fatemehsadat Mireshghallah, Amir Yazdanbakhsh, and Hadi Esmaeilzadeh. 2020. ReLeQ : A Reinforcement Learning Approach for Automatic Deep Quantization of Neural Networks. IEEE Micro 40, 5 (2020), 37--45.Google ScholarDigital Library
- Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. 2021. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630 (2021).Google Scholar
- Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. 2020. Single Path One-Shot Neural Architecture Search with Uniform Sampling. Lecture Notes in Computer Science 12361 LNCS, 2017 (2020), 544--560. https://doi.org/10.1007/978-3-030-58517-4_32 arXiv:1904.00420Google ScholarDigital Library
- Hai Victor Habi, Roy H Jennings, and Arnon Netzer. 2020. Hmq: Hardware friendly mixed precision quantization block for cnns. In ECCV. Springer, 448--463.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE CVPR. 770--778.Google ScholarCross Ref
- John D Hedengren, Reza Asgharzadeh Shishavan, Kody M Powell, and Thomas F Edgar. 2014. Nonlinear modeling, estimation and predictive control in APMonitor. Computers & Chemical Engineering 70 (2014), 133--148.Google ScholarCross Ref
- Yimin Huang, Kai Chen, Zhuang Shao, Yichuan Bai, Yafeng Huang, Yuan Du, Li Du, and Zhongfeng Wang. 2021. LSMQ: A Layer-Wise Sensitivity-Based Mixed-Precision Quantization Method for Bit-Flexible CNN Accelerator. ISOCC (2021), 256--257. https://doi.org/10.1109/ISOCC53507.2021.9613969Google Scholar
- Yimin Huang, Kai Chen, Zhuang Shao, Yichuan Bai, Yafeng Huang, Yuan Du, Li Du, and Zhongfeng Wang. 2021. LSMQ: A Layer-Wise Sensitivity-Based Mixed- Precision Quantization Method for Bit-Flexible CNN Accelerator. In 18th ISOCC. IEEE, 256--257.Google Scholar
- Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao. 2021. Evolutionary quantization of neural networks with mixed-precision. In ICASSP. IEEE, 2785--2789.Google Scholar
- Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, and Carlo Luschi. 2022. 8-bit Numerical Formats for Deep Neural Networks. arXiv preprint arXiv:2206.02915 (2022).Google Scholar
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211--252. https://doi.org/ 10.1007/s11263-015-0816-yGoogle ScholarDigital Library
- Chen Tang, Kai Ouyang, Zhi Wang, Yifei Zhu, Yaowei Wang, Wen Ji, and Wenwu Zhu. 2022. Mixed-Precision Neural Network Quantization via Learned Layer-wise Importance. arXiv preprint arXiv:2203.08368 (2022).Google Scholar
- Mart van Baalen, Brian Kahne, Eric Mahurin, Andrey Kuzmin, Andrii Skliar, Markus Nagel, and Tijmen Blankevoort. 2022. Simulated Quantization, Real Power Savings. In Proceedings of the IEEE/CVF CVPR. 2757--2761.Google ScholarCross Ref
- Mart Van Baalen, Christos Louizos, Markus Nagel, Rana Ali Amjad, Ying Wang, Tijmen Blankevoort, and Max Welling. 2020. Bayesian bits: Unifying quantization and pruning. Advances in neural information processing systems 33 (2020), 5741--5752.Google Scholar
- Vaibhav Verma, Tommy Tracy II, and Mircea R Stan. 2022. EXTREM- EDGE-EXtensions To RISC-V for Energy-efficient ML inference at the EDGE of IoT. Sustainable Computing: Informatics and Systems 35 (2022), 100742.Google ScholarCross Ref
- Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2019. Haq: Hardware- aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF CVPR. 8612--8620.Google ScholarCross Ref
- Dingcheng Yang, Wenjian Yu, Haoyuan Mu, and Gary Yao. 2021. Dynamic programming assisted quantization approaches for compressing Normal and robust DNN models. In Proceedings of the 26th ASPDAC. 351--357.Google ScholarDigital Library
- Huanrui Yang, Lin Duan, Yiran Chen, and Hai Li. 2021. BSQ: Exploring bit-level sparsity for mixed-precision neural network quantization. arXiv preprint arXiv:2102.10462 (2021).Google Scholar
- Linjie Yang and Qing Jin. 2021. Fracbits: Mixed precision quantization via frac- tional bit-widths. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10612--10620.Google Scholar
- Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, et al. 2021. Hawq-v3: Dyadic neural network quantization. In International Conference on Machine Learning. PMLR, 11875--11886.Google Scholar
- Haibao Yu, Qi Han, Jianbo Li, Jianping Shi, Guangliang Cheng, and Bin Fan. 2020. Search what you want: Barrier panelty NAS for mixed precision quantization. In ECCV. Springer, 1--16.Google Scholar
- Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. 2018. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the ECCV. 365--382.Google ScholarDigital Library
Index Terms
- Design Space Exploration of Layer-Wise Mixed-Precision Quantization with Tightly Integrated Edge Inference Units
Recommendations
FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization
FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysWith the trend to deploy Deep Neural Network (DNN) inference models on edge devices with limited resources, quantization techniques have been widely used to reduce on-chip storage and improve computation throughput. However, existing DNN quantization ...
Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators
ASPDAC '21: Proceedings of the 26th Asia and South Pacific Design Automation ConferenceReRAM-based accelerators have shown great potential for accelerating DNN inference because ReRAM crossbars can perform analog matrix-vector multiplication operations with low latency and energy consumption. However, these crossbars require the use of ...
CSMPQ: Class Separability Based Mixed-Precision Quantization
Advanced Intelligent Computing Technology and ApplicationsAbstractNetwork quantization has become increasingly popular due to its ability to reduce storage requirements and accelerate inference time. However, However, ultra low-bit quantization is still challenging due to significant performance degradation. ...
Comments