skip to main content
10.1145/3453688.3461740acmconferencesArticle/Chapter ViewAbstractPublication PagesglsvlsiConference Proceedingsconference-collections
research-article

HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU

Published: 22 June 2021 Publication History

Abstract

Although Transformer-based deep learning models have been widely used in many natural language processing (NLP) tasks as well as computer vision, they suffer from gigantic model size and long latency. Network pruning can reduce the computational cost and model size. However, existing works mainly focus on irregular(sparse) pruning, which often causes irregular computations and extra indices per remained weight. In this work, we propose a Tensor-core inspired hierarchical model compression method to push the performance limit on modern GPUs. We present two modes of the two-step process. In the first mode, we use the Tensor-core aware block-based weight pruning method to exploit model sparsity in a coarse-grained manner and then use low-rank [33] decomposition to further reduce the weight storage in a fine-grained manner.In the second mode, we first use irregular pruning to achieve a highly sparse model and then apply the Tensor-core aware weight constraint on the sparse model to decompose the sparse matrix to several smaller but Tensor-core friendly sub-matrices. Experiments on Transformer, BERTBASE models show the proposed method outperforms the state-of-the-art.

Supplemental Material

MP4 File
Although Transformer-based deep learning models have been widely used in many natural language processing (NLP) tasks as well as computer vision, they suffer from gigantic model size and long latency. Network pruning can reduce the computational cost and model size. However, existing works mainly focus on irregular(sparse) pruning, which often causes irregular computations and extra indices per remained weight. In this work, we propose a Tensor-core inspired hierarchical model compression method to push the performance limit on modern GPUs. We present two modes of the two-step process. In the first mode, we use the Tensor-core aware block-based weight pruning method to exploit model sparsity in a coarse-grained manner and then use low-rank [33] decomposition to further reduce the weight storage in a fine-grained manner.In the second mode, we first use irregular pruning to achieve a highly sparse model and then apply the Tensor-core aware weight constraint on the sparse model to decompose the sparse matrix to several smaller but Tensor-core friendly sub-matrices. Experiments on Transformer, BERTBASE models show the proposed method outperforms the state-of-the-art.

References

[1]
Hangbo Bao et al. 2020. Unilmv2: Pseudo-masked language models for unified language model pre-training. arXiv preprint arXiv:2002.12804 (2020).
[2]
Iz Beltagy et al. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).
[3]
Emily L Denton et al. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems. 1269--1277.
[4]
Jacob Devlin et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1).
[5]
Bill Dolan and Chris Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. In Third International Workshop on Paraphrasing (IWP2005) (third international workshop on paraphrasing (iwp2005) ed.). Asia Federation of Natural Language Processing. https://www.microsoft.com/en-us/research/ publication/automatically-constructing-a-corpus-of-sentential-paraphrases/
[6]
Scott Gray et al. 2017. GPU Kernels for Block-Sparse Weights.
[7]
Cong Guo et al. [n.d.]. Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 204--218.
[8]
Song Han et al. 2015. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems. 1135--1143.
[9]
Song Han et al. 2015. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS). 1135--1143.
[10]
Song Han et al. 2016. EIE: efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 243--254.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[12]
Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).
[13]
Nvidia Inc. [n.d.]. NVIDIA TESLA V100 GPU ARCHITECTURE. Retrived from https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf. Accessed: 2021, March 6.
[14]
Nikita Kitaev et al. 2020. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 (2020).
[15]
Hector Levesque et al. 2012. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning.
[16]
Bingbing Li et al. 2020. Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
[17]
Yinhan Liu et al. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[18]
Stephen Merity et al. 2017. Pointer Sentinel Mixture Models. In 5th International Conference on Learning RepresentationsICLR.
[19]
Sharan Narang et al. 2017. Block-Sparse Recurrent Neural Networks. arXiv:1711.02782 [cs.LG]
[20]
NVIDIA. [n.d.]. CUDA C++ Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma.
[21]
CUDA Nvidia. 2008. Cublas library. NVIDIA Corporation, Santa Clara, California 15, 27 (2008), 31.
[22]
Sai Prasanna et al. 2020. When BERT Plays the Lottery, All Tickets Are Winning. arXiv preprint arXiv:2005.00561 (2020).
[23]
PyTorch. [n.d.]. https://pytorch.org/tutorials/beginner/transformer_tutorial. html.
[24]
Colin Raffel et al. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
[25]
Richard Socher et al. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA, 1631--1642. https://www.aclweb.org/anthology/ D13-1170
[26]
Ashish Vaswani et al. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[27]
Alex Wang et al. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).
[28]
Wei Wen et al. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems. 2074--2082.
[29]
Thomas Wolf et al. 2019. Huggingface's transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771 (2019).
[30]
Qizhe Xie et al. 2020. Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems 33 (2020).
[31]
Jiecao Yu et al. 2017. Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 548--560.
[32]
Xingxing Zhang et al. 2019. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5059--5069.
[33]
Yong Zhao, Jinyu Li, and Yifan Gong. 2016. Low-rank plus diagonal adaptation for deep neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5005--5009.
[34]
Shanglin Zhou, Bingbing Li, Caiwu Ding, Lu Lu, and Caiwen Ding. 2020. An Efficient Deep Reinforcement Learning Framework for UAVs. In 2020 21st International Symposium on Quality Electronic Design (ISQED)

Cited By

View all
  • (2023)A survey of techniques for optimizing transformer inferenceJournal of Systems Architecture10.1016/j.sysarc.2023.102990144(102990)Online publication date: Nov-2023
  • (2022)An Automatic and Efficient BERT Pruning for Edge AI Systems2022 23rd International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED54688.2022.9806197(1-6)Online publication date: 6-Apr-2022
  • (2022)Towards Sparsification of Graph Neural Networks2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00048(272-279)Online publication date: Oct-2022

Index Terms

  1. HMC-TRAN: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    GLSVLSI '21: Proceedings of the 2021 Great Lakes Symposium on VLSI
    June 2021
    504 pages
    ISBN:9781450383936
    DOI:10.1145/3453688
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 June 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bert
    2. block weight pruning
    3. low-rank
    4. tensor-core
    5. transformer

    Qualifiers

    • Research-article

    Data Availability

    Although Transformer-based deep learning models have been widely used in many natural language processing (NLP) tasks as well as computer vision, they suffer from gigantic model size and long latency. Network pruning can reduce the computational cost and model size. However, existing works mainly focus on irregular(sparse) pruning, which often causes irregular computations and extra indices per remained weight. In this work, we propose a Tensor-core inspired hierarchical model compression method to push the performance limit on modern GPUs. We present two modes of the two-step process. In the first mode, we use the Tensor-core aware block-based weight pruning method to exploit model sparsity in a coarse-grained manner and then use low-rank [33] decomposition to further reduce the weight storage in a fine-grained manner.In the second mode, we first use irregular pruning to achieve a highly sparse model and then apply the Tensor-core aware weight constraint on the sparse model to decompose the sparse matrix to several smaller but Tensor-core friendly sub-matrices. Experiments on Transformer, BERTBASE models show the proposed method outperforms the state-of-the-art. https://dl.acm.org/doi/10.1145/3453688.3461740#GLSVLSI21_vlsi13s.mp4

    Conference

    GLSVLSI '21
    Sponsor:
    GLSVLSI '21: Great Lakes Symposium on VLSI 2021
    June 22 - 25, 2021
    Virtual Event, USA

    Upcoming Conference

    GLSVLSI '25
    Great Lakes Symposium on VLSI 2025
    June 30 - July 2, 2025
    New Orleans , LA , USA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)35
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A survey of techniques for optimizing transformer inferenceJournal of Systems Architecture10.1016/j.sysarc.2023.102990144(102990)Online publication date: Nov-2023
    • (2022)An Automatic and Efficient BERT Pruning for Edge AI Systems2022 23rd International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED54688.2022.9806197(1-6)Online publication date: 6-Apr-2022
    • (2022)Towards Sparsification of Graph Neural Networks2022 IEEE 40th International Conference on Computer Design (ICCD)10.1109/ICCD56317.2022.00048(272-279)Online publication date: Oct-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media