research-article

HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU

Authors:

Caiwen DingAuthors Info & Claims

GLSVLSI '21: Proceedings of the 2021 Great Lakes Symposium on VLSI

Pages 169 - 174

https://doi.org/10.1145/3453688.3461740

Published: 22 June 2021 Publication History

Abstract

Although Transformer-based deep learning models have been widely used in many natural language processing (NLP) tasks as well as computer vision, they suffer from gigantic model size and long latency. Network pruning can reduce the computational cost and model size. However, existing works mainly focus on irregular(sparse) pruning, which often causes irregular computations and extra indices per remained weight. In this work, we propose a Tensor-core inspired hierarchical model compression method to push the performance limit on modern GPUs. We present two modes of the two-step process. In the first mode, we use the Tensor-core aware block-based weight pruning method to exploit model sparsity in a coarse-grained manner and then use low-rank [33] decomposition to further reduce the weight storage in a fine-grained manner.In the second mode, we first use irregular pruning to achieve a highly sparse model and then apply the Tensor-core aware weight constraint on the sparse model to decompose the sparse matrix to several smaller but Tensor-core friendly sub-matrices. Experiments on Transformer, BERTBASE models show the proposed method outperforms the state-of-the-art.

Supplemental Material

MP4 File

Although Transformer-based deep learning models have been widely used in many natural language processing (NLP) tasks as well as computer vision, they suffer from gigantic model size and long latency. Network pruning can reduce the computational cost and model size. However, existing works mainly focus on irregular(sparse) pruning, which often causes irregular computations and extra indices per remained weight. In this work, we propose a Tensor-core inspired hierarchical model compression method to push the performance limit on modern GPUs. We present two modes of the two-step process. In the first mode, we use the Tensor-core aware block-based weight pruning method to exploit model sparsity in a coarse-grained manner and then use low-rank [33] decomposition to further reduce the weight storage in a fine-grained manner.In the second mode, we first use irregular pruning to achieve a highly sparse model and then apply the Tensor-core aware weight constraint on the sparse model to decompose the sparse matrix to several smaller but Tensor-core friendly sub-matrices. Experiments on Transformer, BERTBASE models show the proposed method outperforms the state-of-the-art.

Download
21.24 MB

References

[1]

Hangbo Bao et al. 2020. Unilmv2: Pseudo-masked language models for unified language model pre-training. arXiv preprint arXiv:2002.12804 (2020).

[2]

Iz Beltagy et al. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).

[3]

Emily L Denton et al. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems. 1269--1277.

[4]

Jacob Devlin et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1).

[5]

Bill Dolan and Chris Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. In Third International Workshop on Paraphrasing (IWP2005) (third international workshop on paraphrasing (iwp2005) ed.). Asia Federation of Natural Language Processing. https://www.microsoft.com/en-us/research/ publication/automatically-constructing-a-corpus-of-sentential-paraphrases/

[6]

Scott Gray et al. 2017. GPU Kernels for Block-Sparse Weights.

[7]

Cong Guo et al. [n.d.]. Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 204--218.

[8]

Song Han et al. 2015. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems. 1135--1143.

[9]

Song Han et al. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS). 1135--1143.

[10]

Song Han et al. 2016. EIE: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 243--254.

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[12]

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).

[13]

Nvidia Inc. [n.d.]. NVIDIA TESLA V100 GPU ARCHITECTURE. Retrived from https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Accessed: 2021, March 6.

[14]

Nikita Kitaev et al. 2020. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 (2020).

[15]

Hector Levesque et al. 2012. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.

[16]

Bingbing Li et al. 2020. Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).

[17]

Yinhan Liu et al. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

[18]

Stephen Merity et al. 2017. Pointer Sentinel Mixture Models. In 5th International Conference on Learning RepresentationsICLR.

[19]

Sharan Narang et al. 2017. Block-Sparse Recurrent Neural Networks. arXiv:1711.02782 [cs.LG]

[20]

NVIDIA. [n.d.]. CUDA C++ Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma.

[21]

CUDA Nvidia. 2008. Cublas library. NVIDIA Corporation, Santa Clara, California 15, 27 (2008), 31.

[22]

Sai Prasanna et al. 2020. When BERT Plays the Lottery, All Tickets Are Winning. arXiv preprint arXiv:2005.00561 (2020).

[23]

PyTorch. [n.d.]. https://pytorch.org/tutorials/beginner/transformer_tutorial. html.

[24]

Colin Raffel et al. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).

[25]

Richard Socher et al. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 1631--1642. https://www.aclweb.org/anthology/ D13-1170

[26]

Ashish Vaswani et al. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[27]

Alex Wang et al. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).

[28]

Wei Wen et al. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems. 2074--2082.

[29]

Thomas Wolf et al. 2019. Huggingface's transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771 (2019).

[30]

Qizhe Xie et al. 2020. Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems 33 (2020).

[31]

Jiecao Yu et al. 2017. Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 548--560.

[32]

Xingxing Zhang et al. 2019. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5059--5069.

[33]

Yong Zhao, Jinyu Li, and Yifan Gong. 2016. Low-rank plus diagonal adaptation for deep neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5005--5009.

Digital Library

[34]

Shanglin Zhou, Bingbing Li, Caiwu Ding, Lu Lu, and Caiwen Ding. 2020. An Efficient Deep Reinforcement Learning Framework for UAVs. In 2020 21st International Symposium on Quality Electronic Design (ISQED)

Cited By

Chitty-Venkata KMittal SEmani MVishwanath VSomani A(2023)A survey of techniques for optimizing transformer inferenceJournal of Systems Architecture10.1016/j.sysarc.2023.102990144(102990)Online publication date: Nov-2023
https://doi.org/10.1016/j.sysarc.2023.102990
Huang SLiu NLiang YPeng HLi HXu DXie MDing C(2022)An Automatic and Efficient BERT Pruning for Edge AI Systems2022 23rd International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED54688.2022.9806197(1-6)Online publication date: 6-Apr-2022
https://doi.org/10.1109/ISQED54688.2022.9806197
Peng HGurevin DHuang SGeng TJiang WKhan ODing C(2022)Towards Sparsification of Graph Neural Networks2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00048(272-279)Online publication date: Oct-2022
https://doi.org/10.1109/ICCD56317.2022.00048

Index Terms

HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU
1. Computer systems organization
  1. Architectures

Recommendations

Low-Rank tensor completion based on nonconvex regularization
Highlights
- The novel nonconvex minimization approach is proposed to solve the tensor completion problem more effectively by adopting a nonconvex regularization of ...
Abstract
In this paper, we consider the low-rank tensor completion which aims to exactly recover incomplete high-dimensional visual data. Existing studies utilize widely tensor nuclear norm minimization (TNNM), a convex relaxation to tensor-...
Recovering Low-Rank and Sparse Components of Matrices from Incomplete and Noisy Observations

Many problems can be characterized by the task of recovering the low-rank and sparse components of a given matrix. Recently, it was discovered that this nondeterministic polynomial-time hard (NP-hard) task can be well accomplished, both theoretically ...
Lingo: Linearized Grassmannian Optimization for Nuclear Norm Minimization
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

As a popular heuristic to the matrix rank minimization problem, nuclear norm minimization attracts intensive research attentions. Matrix factorization based algorithms can reduce the expensive computation cost of SVD for nuclear norm minimization. ...

Comments

Information & Contributors

Information

Published In

GLSVLSI '21: Proceedings of the 2021 Great Lakes Symposium on VLSI

June 2021

504 pages

ISBN:9781450383936

DOI:10.1145/3453688

General Chairs:
Yiran Chen
Duke University, USA
,
Victor Zhirnov
Semiconductor Research Corporation, USA
,
Program Chairs:
Avesta Sasan
George Mason University, USA
,
Ioannis Savidis
Drexel University, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Data Availability

Although Transformer-based deep learning models have been widely used in many natural language processing (NLP) tasks as well as computer vision, they suffer from gigantic model size and long latency. Network pruning can reduce the computational cost and model size. However, existing works mainly focus on irregular(sparse) pruning, which often causes irregular computations and extra indices per remained weight. In this work, we propose a Tensor-core inspired hierarchical model compression method to push the performance limit on modern GPUs. We present two modes of the two-step process. In the first mode, we use the Tensor-core aware block-based weight pruning method to exploit model sparsity in a coarse-grained manner and then use low-rank [33] decomposition to further reduce the weight storage in a fine-grained manner.In the second mode, we first use irregular pruning to achieve a highly sparse model and then apply the Tensor-core aware weight constraint on the sparse model to decompose the sparse matrix to several smaller but Tensor-core friendly sub-matrices. Experiments on Transformer, BERTBASE models show the proposed method outperforms the state-of-the-art. https://dl.acm.org/doi/10.1145/3453688.3461740#GLSVLSI21_vlsi13s.mp4

Conference

GLSVLSI '21

Sponsor:

SIGDA

GLSVLSI '21: Great Lakes Symposium on VLSI 2021

June 22 - 25, 2021

Virtual Event, USA

Upcoming Conference

GLSVLSI '25

Sponsor:
sigda

Great Lakes Symposium on VLSI 2025

June 30 - July 2, 2025

New Orleans , LA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
234
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)6

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chitty-Venkata KMittal SEmani MVishwanath VSomani A(2023)A survey of techniques for optimizing transformer inferenceJournal of Systems Architecture10.1016/j.sysarc.2023.102990144(102990)Online publication date: Nov-2023
https://doi.org/10.1016/j.sysarc.2023.102990
Huang SLiu NLiang YPeng HLi HXu DXie MDing C(2022)An Automatic and Efficient BERT Pruning for Edge AI Systems2022 23rd International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED54688.2022.9806197(1-6)Online publication date: 6-Apr-2022
https://doi.org/10.1109/ISQED54688.2022.9806197
Peng HGurevin DHuang SGeng TJiang WKhan ODing C(2022)Towards Sparsification of Graph Neural Networks2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00048(272-279)Online publication date: Oct-2022
https://doi.org/10.1109/ICCD56317.2022.00048

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten