research-article

Public Access

TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition

Authors:

Chengming Zhang,

Aravind Sukumaran-Rajam,

Dingwen TaoAuthors Info & Claims

PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Pages 260 - 273

https://doi.org/10.1145/3572848.3577478

Published: 21 February 2023 Publication History

Abstract

Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tucker-compressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference code on GPUs. Specifically, we propose an ADMM-based training algorithm that can achieve highly accurate Tucker-format models. We also develop a high-performance kernel for Tucker-format convolutions and analytical performance models to guide the selection of execution parameters. We further propose a co-design framework to determine the proper Tucker ranks driven by practical inference time (rather than FLOPs). Our evaluation on five modern CNNs with A100 demonstrates that our compressed models with our optimized code achieve up to 2.21× speedup over cuDNN, 1.12× speedup over TVM, and 3.27× over the original models using cuDNN with at most 0.05% accuracy loss.

References

[1]

Martín Abadi. Tensorflow: learning functions at scale. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming, pages 1--1, 2016.

Digital Library

[2]

Stephen Boyd, Neal Parikh, and Eric Chu. Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc, 2011.

[3]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. Tvm: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation, pages 578--594, 2018.

[4]

Mathew J Cherukara, Tao Zhou, Youssef Nashed, Pablo Enfedaque, Alex Hexemer, Ross J Harder, and Martin V Holt. Ai-enabled high-resolution scanning coherent diffraction imaging. Applied Physics Letters, 117(4):044103, 2020.

[5]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.

[6]

Michael Collins and Nigel Duffy. Convolution kernels for natural language. Advances in neural information processing systems, 14, 2001.

[7]

cuDNN v2: Higher Performance for Deep Learning on GPUs. https://developer.nvidia.com/blog/cudnn-v2-higher-performance-deep-learning-gpus/.

[8]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248--255. Ieee, 2009.

[9]

Peiyan Dong, Siyue Wang, Wei Niu, Chengming Zhang, Sheng Lin, Zhengang Li, Yifan Gong, Bin Ren, Xue Lin, and Dingwen Tao. Rtmobile: Beyond real-time mobile acceleration of rnns for speech recognition. In 2020 57th ACM/IEEE Design Automation Conference, pages 1--6. IEEE, 2020.

[10]

Boyuan Feng, Yuke Wang, Tong Geng, Ang Li, and Yufei Ding. Apnntc: Accelerating arbitrary precision neural networks on ampere gpu tensor cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--13, 2021.

Digital Library

[11]

Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization: compressing convolutional and fc layers alike. arXiv preprint arXiv:1611.03214, 2016.

[12]

Groq. The Challenge of Batch Size 1: Groq Adds Responsiveness to Inference Performance. https://groq.com/wp-content/uploads/2020/04/GROQP002_groq_whitepaper_V1-DB-1.pdf.

[13]

Julia Gusak, Maksym Kholiavchenko, Evgeny Ponomarev, Larisa Markeeva, Philip Blagoveschensky, Andrzej Cichocki, and Ivan Oseledets. Automated multi-stage compression of neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0--0, 2019.

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770--778, 2016.

[15]

Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 639--648, 2020.

Digital Library

[16]

Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4340--4349, 2019.

[17]

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700--4708, 2017.

[18]

Sian Jin, Sheng Di, Xin Liang, Jiannan Tian, Dingwen Tao, and Franck Cappello. Deepsz: A novel framework to compress deep neural networks by using error-bounded lossy compression. In Proceedings of the 28th international symposium on high-performance parallel and distributed computing, pages 159--170, 2019.

Digital Library

[19]

Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. In International Conference on Learning Representations, 2016.

[20]

Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, et al. Exascale deep learning for climate analytics. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 649--660. IEEE, 2018.

Digital Library

[21]

Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. In International Conference on Learning Representations, 2015.

[22]

Yuchao Li, Shaohui Lin, Jianzhuang Liu, Qixiang Ye, Mengdi Wang, Fei Chao, Fan Yang, Jincheng Ma, Qi Tian, and Rongrong Ji. Towards compact cnns via collaborative compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6438--6447, 2021.

[23]

Chia-Chun Liang and Che-Rung Lee. Automatic selection of tensor decomposition for compressing convolutional neural networks a case study on vgg-type networks. In 2021 IEEE International Parallel and Distributed Processing Symposium Workshops, pages 770--778. IEEE, 2021.

[24]

Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao. Hrank: Filter pruning using high-rank feature map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1529--1538, 2020.

[25]

Xiaolong Ma, Fu-Ming Guo, Wei Niu, Xue Lin, Jian Tang, Kaisheng Ma, Bin Ren, and Yanzhi Wang. Pconv: The missing but desirable sparsity in dnn weight pruning for real-time execution on mobile devices. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5117--5124, 2020.

[26]

Amrita Mathuriya, Deborah Bard, Peter Mendygral, Lawrence Meadows, James Arnemann, Lei Shao, Siyu He, Tuomas Kärnä, Diana Moise, Simon J Pennycook, et al. Cosmoflow: Using deep learning to learn the universe at scale. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 819--829. IEEE, 2018.

Digital Library

[27]

Xuefei Ning, Tianchen Zhao, Wenshuo Li, Peng Lei, Yu Wang, and Huazhong Yang. Dsa: More efficient budgeted pruning via differentiable sparsity allocation. In European Conference on Computer Vision, 2020.

Digital Library

[28]

Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. Advances in Neural Information Processing Systems, 28:442--450, 2015.

[29]

Nvidia. NVIDIA Deep Learning cuDNN Documentation. https://docs.nvidia.com/deeplearning/cudnn/api/index.html.

[30]

Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295--2317, 2011.

Digital Library

[31]

Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. Faster cnns with direct sparse convolutions and guided pruning. arXiv preprint arXiv:1608.01409, 2016.

[32]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.

[33]

Anh-Huy Phan, Konstantin Sobolev, Konstantin Sozykin, Dmitry Ermilov, Julia Gusak, Petr Tichavskỳ, Valeriy Glukhov, Ivan Oseledets, and Andrzej Cichocki. Stable low-rank tensor decomposition for compression of convolutional neural network. In European Conference on Computer Vision, pages 522--539. Springer, 2020.

Digital Library

[34]

Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohammadi, and Sanjeev Khudanpur. Semi-orthogonal low-rank matrix factorization for deep neural networks. In Interspeech, pages 3743--3747, 2018.

[35]

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[36]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1--9, 2015.

[37]

Yehui Tang, Yunhe Wang, Yixing Xu, Dacheng Tao, Chunjing XU, Chao Xu, and Chang Xu. Scop: Scientific control for reliable neural network pruning. In Advances in Neural Information Processing Systems, volume 33, pages 10936--10947, 2020.

[38]

Ledyard R Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279--311, 1966.

[39]

Wenqi Wang, Yifan Sun, Brian Eriksson, Wenlin Wang, and Vaneet Aggarwal. Wide compression: Tensor ring nets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9329--9338, 2018.

[40]

Yuhui Xu, Yuxi Li, Shuai Zhang, Wei Wen, Botao Wang, Yingyong Qi, Yiran Chen, Weiyao Lin, and Hongkai Xiong. Trp: Trained rank pruning for efficient deep neural networks. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pages 977--983, 2020.

[41]

Miao Yin, Siyu Liao, Xiao-Yang Liu, Xiaodong Wang, and Bo Yuan. Towards extremely compact rnns for video recognition with fully decomposed hierarchical tucker structure. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12085--12094, 2021.

[42]

Miao Yin, Yang Sui, Siyu Liao, and Bo Yuan. Towards efficient tensor decomposition-based dnn model compression with optimization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10674--10683, June 2021.

[43]

Chengming Zhang, Geng Yuan, Wei Niu, Jiannan Tian, Sian Jin, Donglin Zhuang, Zhe Jiang, Yanzhi Wang, Bin Ren, Shuaiwen Leon Song, and Dingwen Tao. Clicktrain: Efficient and accurate end-to-end deep learning training via fine-grained architecture-preserving pruning. In Proceedings of the ACM International Conference on Super-computing, pages 266--278, 2021.

Digital Library

[44]

Qibin Zhao, Guoxu Zhou, Shengli Xie, Liqing Zhang, and Andrzej Cichocki. Tensor ring decomposition. arXiv preprint arXiv:1606.05535, 2016.

[45]

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697--8710, 2018.

Cited By

Dai CLu SLiu CGuo B(2024)A light-weight skeleton human action recognition model with knowledge distillation for edge intelligent surveillance applicationsApplied Soft Computing10.1016/j.asoc.2023.111166151(111166)Online publication date: Jan-2024
https://doi.org/10.1016/j.asoc.2023.111166
Xiao JYin MGong YZang XRen JYuan BKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)COMCATProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619995(38125-38136)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619995
Gong YYin MHuang LXiao JSui YDeng CYuan BSolihin YHeinrich M(2023)ETTE: Efficient Tensor-Train-based Computing Engine for Deep Neural NetworksProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589103(1-13)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589103

Index Terms

TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

On the Programmability and Performance of Heterogeneous Platforms
ICPADS '13: Proceedings of the 2013 International Conference on Parallel and Distributed Systems

General-purpose computing on an ever-broadening array of parallel devices has led to an increasingly complex and multi-dimensional landscape with respect to programmability and performance optimization. The growing diversity of parallel architectures ...
Performance of CPU/GPU compiler directives on ISO/TTI kernels

GPUs are slowly becoming ubiquitous devices in High Performance Computing, as their capabilities to enhance the performance per watt of compute intensive algorithms as compared to multicore CPUs have been identified. The primary shortcoming of a GPU is ...
Importance of explicit vectorization for CPU and GPU software performance

Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

February 2023

480 pages

ISBN:9798400700156

DOI:10.1145/3572848

General Chair:
Maryam Mehri Dehnavi
University of Toronto
,
Program Chairs:
Milind Kulkarni
Purdue University
,
Sriram Krishnamoorthy
Google

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 February 2023

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

PPoPP '23

Sponsor:

PPoPP '23: The 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

February 25 - March 1, 2023

QC, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
585
Total Downloads

Downloads (Last 12 months)296
Downloads (Last 6 weeks)22

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dai CLu SLiu CGuo B(2024)A light-weight skeleton human action recognition model with knowledge distillation for edge intelligent surveillance applicationsApplied Soft Computing10.1016/j.asoc.2023.111166151(111166)Online publication date: Jan-2024
https://doi.org/10.1016/j.asoc.2023.111166
Xiao JYin MGong YZang XRen JYuan BKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)COMCATProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619995(38125-38136)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619995
Gong YYin MHuang LXiao JSui YDeng CYuan BSolihin YHeinrich M(2023)ETTE: Efficient Tensor-Train-based Computing Engine for Deep Neural NetworksProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589103(1-13)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589103

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten