Skip to main content

DAP-BERT: Differentiable Architecture Pruning of BERT

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 13108))

Included in the following conference series:

Abstract

The recent development of pre-trained language models (PLMs) like BERT suffers from increasing computational and memory overhead. In this paper, we focus on automatic pruning for efficient BERT architectures on natural language understanding tasks. Specifically, we propose differentiable architecture pruning (DAP) to prune redundant attention heads and hidden dimensions in BERT, which benefits both from network pruning and neural architecture search. Meanwhile, DAP can adjust itself to deploy the pruned BERT on various edge devices with different resource constraints. Empirical results show that the \(\text {BERT}_\text {BASE}\) architecture pruned by DAP achieves \(5\times \) speed-up with only a minor performance drop. The code is available at https://github.com/OscarYau525/DAP-BERT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    To obtain the task-specific parameters, we follow the standard fine-tuning pipeline in https://huggingface.co/bert-base-uncased.

  2. 2.

    Task specific model parameters available at https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT.

References

  1. Bai, H., Hou, L., Shang, L., Jiang, X., King, I., Lyu, M.R.: Towards efficient post-training quantization of pre-trained language models. Preprint arXiv:2109.15082 (2021)

  2. Bai, H., Wu, J., King, I., Lyu, M.: Few shot network compression via cross distillation. In: AAAI, vol. 34, pp. 3203–3210 (2020)

    Google Scholar 

  3. Bai, H., et al.: BinaryBERT: pushing the limit of BERT quantization. In: ACL (2020)

    Google Scholar 

  4. Bernstein, J., Wang, Y.X., Azizzadenesheli, K., Anandkumar, A.: signSGD: Compressed optimisation for non-convex problems. In: ICML (2018)

    Google Scholar 

  5. Chen, D., et al.: AdaBERT: task-adaptive BERT compression with differentiable neural architecture search. In: IJCAI (2021)

    Google Scholar 

  6. Chen, T., et al.: The lottery ticket hypothesis for pre-trained BERT networks. In: NeurIPS (2020)

    Google Scholar 

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)

    Google Scholar 

  8. Dong, X., Yang, Y.: Network pruning via transformable architecture search. In: NeurIPS (2019)

    Google Scholar 

  9. Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: ICML (2018)

    Google Scholar 

  10. Gordon, M.A., Duh, K., Andrews, N.: Compressing BERT: studying the effects of weight pruning on transfer learning. In: ACL (2020)

    Google Scholar 

  11. Hou, L., Huang, Z., Shang, L., Jiang, X., Chen, X., Liu, Q.: DynaBERT: dynamic BERT with adaptive width and depth. In: NeurIPS (2020)

    Google Scholar 

  12. Jiao, X., et al.: TinyBERT: distilling BERT for natural language understanding. In: EMNLP (2020)

    Google Scholar 

  13. Li, Y., Wang, W., Bai, H., Gong, R., Dong, X., Yu, F.: Efficient bitwidth search for practical mixed precision neural network. Preprint arXiv:2003.07577 (2020)

  14. Liu, H., Simonyan, K., Yang, Y.: DARTS: differentiable architecture search. In: ICLR (2019)

    Google Scholar 

  15. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. Preprint arXiv:1907.11692 (2019)

  16. McCarley, J.S., Chakravarti, R., Sil, A.: Structured pruning of a BERT-based question answering model. Preprint arXiv:1910.06360 (2021)

  17. Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? In: NeurIPS (2019)

    Google Scholar 

  18. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: ICML (2013)

    Google Scholar 

  19. Pham, H., Guan, M.Y., Zoph, B., Le, Q.V., Dean, J.: Efficient neural architecture search via parameter sharing. In: ICML, pp. 4092–4101 (2018)

    Google Scholar 

  20. Prasanna, S., Rogers, A., Rumshisky, A.: When BERT plays the lottery, all tickets are winning. In: EMNLP (2020)

    Google Scholar 

  21. Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: AAAI (2019)

    Google Scholar 

  22. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: NeurIPS (2020)

    Google Scholar 

  23. Shen, S., et al.: Q-BERT: hessian based ultra low precision quantization of BERT. In: AAAI (2019)

    Google Scholar 

  24. Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: MobileBERT: a compact task-agnostic BERT for resource-limited devices. In: ACL (2020)

    Google Scholar 

  25. Turc, I., Chang, M.W., Lee, K., Toutanova, K.: Well-read students learn better: on the importance of pre-training compact models. Preprint arXiv:1908.08962v2 (2019)

  26. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: ICLR (2019)

    Google Scholar 

  27. Wang, J., et al.: Revisiting parameter sharing for automatic neural channel number search. In: NeurIPS, vol. 33 (2020)

    Google Scholar 

  28. Wang, J., Bai, H., Wu, J., Cheng, J.: Bayesian automatic model compression. IEEE JSTSP 14(4), 727–736 (2020)

    Google Scholar 

  29. Wang, K., Liu, Z., Lin, Y., Lin, J., Han, S.: HAQ: hardware-aware automated quantization with mixed precision. In: CVPR, pp. 8612–8620 (2019)

    Google Scholar 

  30. Wen, L., Zhang, X., Bai, H., Xu, Z.: Structured pruning of recurrent neural networks through neuron selection. NN 123, 134–141 (2020)

    Google Scholar 

  31. Wu, J., et al.: PocketFlow: an automated framework for compressing and accelerating deep neural networks. In: NeurIPS, CDNNRIA workshop (2018)

    Google Scholar 

  32. Xu, J., et al.: Nas-BERT: task-agnostic and adaptive-size BERT compression with neural architecture search. In: KDD (2021)

    Google Scholar 

  33. Zhang, W., et al.: TernaryBERT: distillation-aware ultra-low bit BERT. In: EMNLP (2020)

    Google Scholar 

  34. Zhao, S., Gupta, R., Song, Y., Zhou, D.: Extremely small BERT models from mixed-vocabulary training. In: EACL (2021)

    Google Scholar 

  35. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: CVPR (2018)

    Google Scholar 

Download references

Acknowledgement

The work described in this paper was partially supported by the National Key Research and Development Program of China (No. 2018AAA0100204) and the Research Grants Council of the Hong Kong Special Administrative Region, China (No. CUHK 14210920 of the General Research Fund).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chung-Yiu Yau .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yau, CY., Bai, H., King, I., Lyu, M.R. (2021). DAP-BERT: Differentiable Architecture Pruning of BERT. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Lecture Notes in Computer Science(), vol 13108. Springer, Cham. https://doi.org/10.1007/978-3-030-92185-9_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-92185-9_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-92184-2

  • Online ISBN: 978-3-030-92185-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics