Accelerating BERT inference with GPU-efficient exit prediction

Li, Lei; Wang, Chengyu; Qiu, Minghui; Chen, Cen; Gao, Ming; Zhou, Aoying

doi:10.1007/s11704-022-2341-9

Accelerating BERT inference with GPU-efficient exit prediction

Research Article
Published: 22 January 2024

Volume 18, article number 183308, (2024)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Lei Li¹,
Chengyu Wang²,
Minghui Qiu²,
Cen Chen¹,
Ming Gao^1,3 &
…
Aoying Zhou¹

115 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

BERT is a representative pre-trained language model that has drawn extensive attention for significant improvements in downstream Natural Language Processing (NLP) tasks. The complex architecture and massive parameters bring BERT competitive performance but also result in slow speed at model inference time. To speed up BERT inference, FastBERT realizes adaptive inference with an acceptable drop in accuracy based on knowledge distillation and the early-exit technique. However, many factors may limit the performance of FastBERT, such as the teacher classifier that is not knowledgeable enough, the batch size shrinkage and the redundant computation of student classifiers. To overcome these limitations, we propose a new BERT inference method with GPU-Efficient Exit Prediction (GEEP). GEEP leverages the shared exit loss to simplify the training process of FastBERT from two steps into only one step and makes the teacher classifier more knowledgeable by feeding diverse Transformer outputs to the teacher classifier. In addition, the exit layer prediction technique is proposed to utilize a GPU hash table to handle the token-level exit layer distribution and to sort test samples by predicted exit layers. In this way, GEEP can avoid batch size shrinkage and redundant computation of student classifiers. Experimental results on twelve public English and Chinese NLP datasets prove the effectiveness of the proposed approach. The source codes of GEEP will be released to the public upon paper acceptance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Contrastive Self-distillation BERT with Kernel Alignment-Based Inference

A Transformer-Based Generative AI Model in Education: Fine-Tuning BERT for Domain-Specific in Student Advising

Combining Feature Selection Methods with BERT: An In-depth Experimental Study of Long Text Classification

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171–4186
Radford A, Narasimhan K. Improving language understanding by generative pre-training. See https://cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf website. 2018
Yang Z, Dai Z, Yang Y, Carbonell J G, Salakhutdinov R, Le Q. XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 517
Gou J, Yu B, Maybank S J, Tao D. Knowledge distillation: a survey. International Journal of Computer Vision, 2021, 129(6): 1789–1819
Article Google Scholar
Laskaridis S, Kouris A, Lane N D. Adaptive inference through early-exit networks: design, challenges and directions. In: Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning. 2021, 1–6
Liu W, Zhou P, Wang Z, Zhao Z, Deng H, Ju Q. FastBERT: a self-distilling BERT with adaptive inference time. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 6035–6044
Wang C, Qiu M, Zhang T, Liu T, Li L, Wang J, Wang M, Huang J, Lin W. EasyNLP: A comprehensive and easy-to-use toolkit for natural language processing. 2022, arXiv preprint arXiv: 2205.00258
Wang C, Qiu M, Huang J. Building natural language processing applications with EasyNLP. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2022, 5100–5101
Buciluă C, Caruana R, Niculescu-Mizil A. Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006, 535–541
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015, arXiv preprint arXiv: 1503.02531
Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 2019, arXiv preprint arXiv: 1910.01108
Zhang L, Song J, Gao A, Chen J, Bao C, Ma K. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 3712–3721
Berestizshevsky K, Even G. Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence. In: Proceedings of the Artificial Neural Networks and Machine Learning-ICANN 2019: Deep Learning: the 28th International Conference on Artificial Neural Networks. 2019, 306–320
Gormez A, Koyuncu E. Class means as an early exit decision mechanism. 2021, arXiv preprint arXiv: 2103.01148v1
Jiang H, Kim B, Guan M Y, Gupta M. To trust or not to trust a classifier. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 5546–5557
Zhou W, Xu C, Ge T, McAuley J J, Xu K, Wei F. BERT loses patience: fast and robust inference with early exit. In: Proceedings of the Conference on Neural Information Processing Systems. 2020, 18330–18341
Sun T, Liu X, Zhu W, Geng Z, Wu L, He Y, Ni Y, Xie G, Huang X, Qiu X. A simple hash-based early exiting approach for language understanding and generation. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2022. 2022, 2409–2421
Lessley B, Childs H. Data-parallel hashing techniques for GPU architectures. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(1): 237–250
Article Google Scholar
Cormen T H, Leiserson C E, Rivest R L, Stein C. Introduction to Algorithms. 3rd ed. Massachusetts: The MIT Press, 2009
Google Scholar
Bordawekar R. Evaluation of parallel hashing techniques. In: Proceedings (Findings) of the GPU Technology Conference. See https://on-demand.gputechconf.com/gtc/2014/presentations/S4507-evaluation-of-parallel-hashing-techniques.pdf website. 2014, 1–27
Pagh R, Rodler F F. Cuckoo hashing. Journal of Algorithms, 2004, 51(2): 122–144
Article MathSciNet Google Scholar
Breslow A D, Jayasena N S. Morton filters: faster, space-efficient cuckoo filters via biasing, compression, and decoupled logical sparsity. Proceedings of the VLDB Endowment, 2018, 11(9): 1041–1055
Article Google Scholar
Alipourfard O, Moshref M, Zhou Y, Yang T, Yu M. A comparison of performance and accuracy of measurement algorithms in software. In: Proceedings of the Symposium on SDN Research. 2018, 18
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000–6010
Voita E, Sennrich R, Titov I. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, 4396–4406
Xiong R, Yang Y, He D, Zheng K, Zheng S, Xing C, Zhang H, Lan Y, Wang L, Liu T. On layer normalization in the transformer architecture. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 10524–10533
Cover T M, Thomas J A. Elements of Information Theory. 2nd ed. Hoboken: John Wiley & Sons, Inc., 2006, 57–58
Google Scholar
Liu X, Chen Q, Deng C, Zeng H, Chen J, Li D, Tang B. LCQMC: A large-scale Chinese question matching corpus. In: Proceedings of the 27th International Conference on Computational Linguistics. 2018, 1952–1962
Zhang X, Zhao J, LeCun Y. Character-level convolutional networks for text classification. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 649–657
Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L, Wang F, Liu Q. TinyBERT: distilling BERT for natural language understanding. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2020. 2020, 4163–4174
Chen X, He B, Hui K, Sun L, Sun Y. Simplified tinyBERT: Knowledge distillation for document retrieval. In: Proceedings of the 43rd European Conference on Information Retrieval. 2021, 241–248
Li L, Lin Y, Chen D, Ren S, Li P, Zhou J, Sun X. CascadeBERT: Accelerating inference of pre-trained language models via calibrated complete models cascade. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2021. 2021, 475–486
Sun T, Zhou Y, Liu X, Zhang X, Jiang H, Cao Z, Huang X, Qiu X. Early exiting with ensemble internal classifiers. 2021, arXiv preprint arXiv: 2105.13792
Zhu W. LeeBERT: Learned Early Exit for BERT with cross-level optimization. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, 2968–2980
Ji X, Tang R, Lee J, Yu Y, Lin J. DeeBERT: dynamic early exiting for accelerating BERT inference. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 2246–2251
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 2018, 353–355

Download references

Acknowledgements

This work has been supported by the National Natural Science Foundation of China (Grant Nos. U1911203, 61877018, 61977025, 62202170), and Alibaba Group through the Alibaba Innovation Research Program.

Author information

Authors and Affiliations

Shanghai Engineering Research Center of Big Data Management, School of Data Science and Engineering, East China Normal University, Shanghai, 200062, China
Lei Li, Cen Chen, Ming Gao & Aoying Zhou
Alibaba Group, Hangzhou, 311121, China
Chengyu Wang & Minghui Qiu
KLATASDS-MOE, School of Statistics, East China Normal University, Shanghai, 200062, China
Ming Gao

Authors

Lei Li
View author publications
Search author on:PubMed Google Scholar
Chengyu Wang
View author publications
Search author on:PubMed Google Scholar
Minghui Qiu
View author publications
Search author on:PubMed Google Scholar
Cen Chen
View author publications
Search author on:PubMed Google Scholar
Ming Gao
View author publications
Search author on:PubMed Google Scholar
Aoying Zhou
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Cen Chen.

Additional information

Lei Li received his master degree in computer technology from Yunnan University, China in 2019. He is a PhD candidate in software engineering at East China Normal University, China, under the supervision of Professor Ming Gao. He is interested in Natural Language Processing and efficient model inference.

Chengyu Wang is an algorithm expert at Alibaba Group. He has obtained his PhD degree from East China Normal University (ECNU), China. Currently, he works on deep learning algorithms on various topics for Alibaba Cloud Machine Learning Platform of AI (PAI), and builds NLP toolkits named EasyTransfer and EasyNLP for Alibaba Cloud. He has published 70+ research papers in international conferences and journals, such as ACL, KDD, WWW, SIGIR, AAAI, TKDE, and WSDM.

Minghui Qiu held a PhD degree from School of Information Systems, Singapore Management University, Singapore, under the supervision of Associate Prof. Jing Jiang and Prof. Ee-peng Lim. From 2013 to 2014, he visited Language Technologies Institute, Carnegie Mellon University, USA, working with Noah Smith and Alex Smola. In the summer of 2014, he worked as an intern at Google Inc., Mountain View, CA, with Amr Ahmed and Yuan Wang. Recently, he is a senior algorithm expert in Alibaba cloud, working on deep learning and transfer learning for many NLP tasks, including paraphrastic sentence/doc embedding, neural conversation models, and sequence labeling. He is responsible for building the NLP and transfer learning toolkit named EasyNLP for Alibaba Cloud, supporting 10+ business units and 20+ applications in Alibaba Group.

Cen Chen is currently a tenure-track Associate Professor at East China Normal University, China. Before that, she worked as an algorithm expert at Ant Group from Aug 2017 to Aug 2021 (selected as Alistar 2017). She obtained a PhD degree from Singapore Management University under the supervision of Professor Lau Hoong Chuin and Associate Professor Cheng Shihfen from Jan 2013 to Jun 2017. From Aug 2015 to June 2016, she visited the Robotics Institute, Carnegie Mellon University, USA, working with Professor Stephen F. Smith and Dr. Zack Rubinstein. Her research focuses on analyzing, modeling, and designing of intelligent systems for supporting business and/or financial decisionmaking. Recent works include federated learning, transfer learning, and retrieval-based QA.

Ming Gao is working as a professor at School of Data Science and Engineering (DASE), East China Normal University, China. Prior to joining ECNU, he worked with Prof. Ee-Peng Lim as a Postdoctoral Fellow at Social Network Mining Research Group in School of Information System, Singapore Management University, Singapore. Before that, he started his PhD program in 2008 at Fudan University, China. From Aug. 2010 to Feb. His main research interests are knowledge graph, knowledge engineering, user profiling, social mining, and uncertain data management.

Aoying Zhou is a professor on computer science at East China Normal University (ECNU), China, where he is heading the School of Data Science and Engineering. Before joining ECNU in 2008, he worked for Fudan University at the Computer Science Department for 15 years. He is the winner of the National Science Fund for Distinguished Young Scholars supported by NSFC. He is now acting as a vice director of ACM SIGMOD China and Database Technology Committee of China Computer Federation. He is serving as a member of the editorial boards of the VLDB Journal, the WWW Journal, and so on. His research interests include data management, in-memory cluster computing, big data benchmarking, and performance optimization.

Electronic supplementary material