research-article

OdeBERT: One-stage Deep-supervised Early-exiting BERT for Fast Inference in User Intent Classification

Authors:

Fu Lee WangAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 5

Article No.: 129, Pages 1 - 18

https://doi.org/10.1145/3587464

Published: 09 May 2023 Publication History

Abstract

User intent classification is a vital task for analyzing users’ essential requirements from the users’ input query in information retrieval systems, question answering systems, and dialogue systems. Pre-trained language model Bidirectional Encoder Representation from Transformers (BERT) has been widely applied to the user intent classification task. However, BERT is compute intensive and time-consuming during inference and usually causes latency in real-time applications. To improve the inference efficiency of BERT for the user intent classification task, this article proposes a new network named one-stage deep-supervised early-exiting BERT as one-stage deep-supervised early-exiting BERT (OdeBERT). In addition, a deep supervision strategy is developed to incorporate the network with internal classifiers by one-stage joint training to improve the learning process of classifiers by extracting discriminative category features. Experiments are conducted on publicly available datasets, including ECDT, SNIPS, and FDQuestion. The results show that the OdeBERT can speed up original BERT 12 times faster at most with the same performance, outperforming state-of-the-art baseline methods.

References

[1]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). Association for Computational Linguistics, 4171–4186.

[2]

Qian Chen, Zhu Zhuo, and Wen Wang. 2019. BERT for joint intent classification and slot filling. arXiv:1902.10909. Retrieved from https://arxiv.org/abs/1902.10909

[3]

Yuanxia Liu, Hai Liu, Leung-Pun Wong, Lap-Kei Lee, Haijun Zhang, and Tianyong Hao. 2020. A hybrid neural network RBERT-C Based on Pre-trained RoBERTa and CNN for user intent classification. In Proceedings of Neural Computing for Advanced Applications (NCAA’20). Springer, Singapore, 306–319.

[4]

Changai He, Sibao Chen, Shilei Huang, and Jian Zhang. 2019. Using convolutional neural network with BERT for intent determination. In Proceedings of International Conference on Asian Language Processing (IALP’19). IEEE, China. 65–70.

[5]

Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, Doha, Qatar, 1746–1751.

[6]

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI'16). AAAI Press, 2873–2879.

Digital Library

[7]

Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. 2020. DynaBERT: Dynamic BERT with adaptive width and depth. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS’20), 9782–9793.

[8]

Xindian Ma, Peng Zhang, Shuai Zhang, Nan Duan, Yuexian Hou, Dawei Song, and Ming Zhou. 2019. A tensorized transformer for language modeling. In Proceedings of 33rd Conference on Neural Information Processing Systems (NeurIPS’19), 2232–2242.

Digital Library

[9]

Mitchell Gordon, Kevin Duh, and Nicholas Andrews. 2020. Compressing BERT: Studying the effects of weight pruning on transfer learning. In Proceedings of the 5th Workshop on Representation Learning for NLP. Association for Computational Linguistics, 143–155.

[10]

J. S. McCarley. 2019. Pruning a bert-based question answering model. arXiv:1910.06360. Retrieved from https://arxiv.org/pdf/1910.06360v1

[11]

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 5797–5808.

[12]

Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? In Proceedings of 33rd Conference on Neural Information Processing Systems (NeurIPS’19), 14014–14024.

[13]

Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8bert: Quantized 8bit bert. In Proceedings of 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS’19). IEEE, 36–39.

[14]

Aishwarya Bhandare, Vamsi Sripathi, Deepthi Karkada, Vivek Menon, Sun Choi, Kushal Datta, and Vikram Saletore. 2019. Efficient 8-bit quantization of transformer neural machine language translation model. arXiv:1906.00532. Retrieved from https://arxiv.org/abs/1906.00532

[15]

Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. 2019. Q-BERT: Hessian based ultra low precision quantization of BERT. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI’19). 8815–8821.

[16]

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Patient knowledge distillation for bert model compression. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 4323–4332.

[17]

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. TinyBERT: Distilling BERT for natural language understanding. In Proceedings of the Association for Computational Linguistics (EMNLP’20). Association for Computational Linguistics, 4163–4174.

[18]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv:1910.01108. Retrieved from https://arxiv.org/abs/1910.01108v4

[19]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv:1909.11942. Retrieved from https://arxiv.org/abs/1909.11942

[20]

Canwen Xu and Julian McAuley. 2022. A survey on dynamic neural networks for natural language processing. arXiv:2202.07101. Retrieved from https://arxiv.org/pdf/2202.07101

[21]

Surat Teerapittayanon, Bradley McDanel, and HsiangTsung Kung. 2017. BranchyNet: Fast inference via early exiting from deep neural networks. In Proceedings of the 23rd International Conference on Pattern Recognition (ICPR’17). 2464–2469.

[22]

Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. 2019. Shallow-deep networks: Understanding and mitigating network overthinking. In Proceedings of the 36th International Conference on Machine Learning (ICML’19). PMLR, 3301–3310.

[23]

Hao Li, Hong Zhang, Xiaojuan Qi, Ruigang Yang, and Gao Huang. 2019. Improved techniques for training adaptive deep networks. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE, 1891–1900.

[24]

Linfeng Zhang, Jiebo Song, Anni Gao, Jingwei Chen, Chenglong Bao, and Kaisheng Ma. 2019. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV’19). IEEE, 3712–3721.

[25]

Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. 2020. DeeBERT: Dynamic early exiting for accelerating BERT inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2246–2251.

[26]

Roy Schwartz, Gabriel Stanovsky, Swabha Swayamdipta, Jesse Dodge, and Noah A. Smith. 2020. The right tool for the job: Matching model and instance complexities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 6640–6651.

[27]

Shijie Geng, Peng Gao, Zuohui Fu, and Yongfeng Zhang. Romebert: Robust training of multi-exit bert. arXiv: 2101.09755. Retrieved from https://arxiv.org/abs/2101.09755v1.

[28]

Stefanos Laskaridis, Alexandros Kouris, and Nicholas D. Lane. Adaptive inference through early-exit networks: Design, challenges and directions. In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning. Association for Computing Machinery, New York, NY, 1–6.

[29]

Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. 2017. Dynamic routing between capsules. In Proceedings of Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing System. 3859–3869.

[30]

Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. 2018. CosFace: Large margin cosine loss for deep face recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18). IEEE, 5265–5274.

Cited By

Rahmath P HSrivastava VChaurasia KPacheco RCouto R(2024)Early-Exit Deep Neural Network - A Comprehensive SurveyACM Computing Surveys10.1145/369876757:3(1-37)Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1145/3698767
Wen Y(2024)Self-adaptive Education Resource Allocation Using BERT ModelProceedings of the 2024 International Conference on Machine Intelligence and Digital Applications10.1145/3662739.3672178(517-522)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3662739.3672178
Chen GXu QZhan CWang FLiu KLiu HHao T(2024)Improving Open Intent Detection via Triplet-Contrastive Learning and Adaptive BoundaryIEEE Transactions on Consumer Electronics10.1109/TCE.2024.336389670:1(2806-2816)Online publication date: 19-Feb-2024
https://dl.acm.org/doi/10.1109/TCE.2024.3363896
Show More Cited By

Index Terms

OdeBERT: One-stage Deep-supervised Early-exiting BERT for Fast Inference in User Intent Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Discourse, dialogue and pragmatics

Recommendations

Two-Stage COVID19 Classification Using BERT Features
Computer Vision – ECCV 2022 Workshops
Abstract
We propose an automatic COVID1-19 diagnosis framework from lung CT-scan slice images using double BERT feature extraction. In the first BERT feature extraction, A 3D-CNN is first used to extract CNN internal feature maps. Instead of using the ...
An Ensemble Deep Active Learning Method for Intent Classification
CSAI '19: Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence

Intent classification plays a primary and critical role in intelligent dialogue systems. However, faced with the lack of labeled data, the training of robust intent classification model is time-consuming and costly. Thanks to the powerful pre-trained ...
Rethinking of BERT sentence embedding for text classification
Abstract
Text classification is a fundamental task in NLP that is used in several real-life tasks and applications. Large pre-trained language models such as BERT achieve state-of-the-art performance in several NLP tasks including text classification ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 5

May 2023

653 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3596451

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 May 2023

Online AM: 13 March 2023

Accepted: 07 March 2023

Revised: 18 January 2023

Received: 10 May 2022

Published in TALLIP Volume 22, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Research Grants Council of the Hong Kong Special Administrative Region, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
257
Total Downloads

Downloads (Last 12 months)92
Downloads (Last 6 weeks)7

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rahmath P HSrivastava VChaurasia KPacheco RCouto R(2024)Early-Exit Deep Neural Network - A Comprehensive SurveyACM Computing Surveys10.1145/369876757:3(1-37)Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1145/3698767
Wen Y(2024)Self-adaptive Education Resource Allocation Using BERT ModelProceedings of the 2024 International Conference on Machine Intelligence and Digital Applications10.1145/3662739.3672178(517-522)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3662739.3672178
Chen GXu QZhan CWang FLiu KLiu HHao T(2024)Improving Open Intent Detection via Triplet-Contrastive Learning and Adaptive BoundaryIEEE Transactions on Consumer Electronics10.1109/TCE.2024.336389670:1(2806-2816)Online publication date: 19-Feb-2024
https://dl.acm.org/doi/10.1109/TCE.2024.3363896
Zhan HMeng XAsif M(2024)Risk Early Warning of a Dynamic Ideological and Political Education System Based on LSTM-MLP: Online Education Data Processing and OptimizationMobile Networks and Applications10.1007/s11036-024-02439-029:2Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1007/s11036-024-02439-0
Li SQian XTan KWang MHao T(2024)The CHIP 2023 Shared Task 6: Chinese Diabetes Question ClassificationHealth Information Processing. Evaluation Track Papers10.1007/978-981-97-1717-0_18(197-204)Online publication date: 20-Mar-2024
https://doi.org/10.1007/978-981-97-1717-0_18
Qian XLi SQiu WChen TWang MHao T(2023)Shared Task 1 on NCAA 2023: Chinese Diabetes Question ClassificationInternational Conference on Neural Computing for Advanced Applications10.1007/978-981-99-5847-4_42(591-596)Online publication date: 30-Aug-2023
https://doi.org/10.1007/978-981-99-5847-4_42
Chen GXu QZhan CWang FZhu KLiu HHao T(2023)A Triplet-Contrastive Representation Learning Strategy for Open Intent DetectionInternational Conference on Neural Computing for Advanced Applications10.1007/978-981-99-5847-4_17(229-244)Online publication date: 30-Aug-2023
https://doi.org/10.1007/978-981-99-5847-4_17

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents