Skip to main content

Classification of Heads in Multi-head Attention Mechanisms

  • Conference paper
  • First Online:
Knowledge Science, Engineering and Management (KSEM 2022)

Abstract

Transformer model has become the dominant modeling paradigm in deep learning, of which multi-head attention is a critical component. While increasing Transformer effect, it also has some issues. When the number of heads reaches a point, some attention heads have remarkably similar attention graphs, which indicates that these heads are doing repetitive calculations. Some heads may even focus on extraneous things, affecting the final result. After analyzing the multi-head attention mechanism, this paper believes that the consistency of the inputs to the multi-head attention mechanism is the underlying reason for the similarity of the attention graph between heads. For this reason, this paper proposes the concept of classifying the heads in multi-head attention mechanism and summarizes the general classification process. Three classification schemes are designed for the Multi30k dataset. Experiments demonstrate that our method converges faster than the baseline model and that the BLEU improves by 3.08–4.38 compared to the baseline model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Correia, G.M., Niculae, V., Martins, A.F.: Adaptively sparse transformers. arXiv preprint arXiv:1909.00015 (2019)

  2. Devlin, J., Chang, M.W., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv e-prints arXiv:1810.04805, October 2018. https://ui.adsabs.harvard.edu/abs/2018arXiv181004805D

  3. Dosovitskiy, A., Beyer, L., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th ICLR (2021)

    Google Scholar 

  4. Elliott, D., Frank, S., Sima’an, K., Specia, L.: Multi30k: multilingual English-German image descriptions. arXiv preprint arXiv:1605.00459 (2016)

  5. Huang, L., Yuan, Y., Guo, J., Zhang, C., Chen, X., Wang, J.: Interlaced sparse self-attention for semantic segmentation. arXiv preprint arXiv:1907.12273 (2019)

  6. Kawakami, K.: Supervised sequence labelling with recurrent neural networks. Ph.D. thesis (2008)

    Google Scholar 

  7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)

    Google Scholar 

  8. Li, Y., Song, Y., et al.: Intelligent fault diagnosis by fusing domain adversarial training and maximum mean discrepancy via ensemble learning. IEEE TII 17(4), 2833–2841 (2021)

    Google Scholar 

  9. Liu, M., Zhang, S., et al.: H infinite state estimation for discrete-time chaotic systems based on a unified model. IEEE Trans. SMC (B) (2012)

    Google Scholar 

  10. Lu, Z., Wang, N., et al.: IoTDeM: an IoT big data-oriented MapReduce performance prediction extended model in multiple edge clouds. JPDC 118, 316–327 (2018)

    Google Scholar 

  11. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)

  12. Qiu, H., Qiu, M., Lu, Z.: Selective encryption on ECG data in body sensor network based on supervised machine learning. Infor. Fusion 55, 59–67 (2020)

    Article  Google Scholar 

  13. Qiu, H., Zheng, Q., et al.: Deep residual learning-based enhanced JPEG compression in the internet of things. IEEE TII 17(3), 2124–2133 (2020)

    Google Scholar 

  14. Qiu, H., Zheng, Q., et al.: Topological graph convolutional network-based urban traffic flow and density prediction. IEEE TITS 22(7), 4560–4569 (2021)

    Google Scholar 

  15. Qiu, M., Gai, K., Xiong, Z.: Privacy-preserving wireless communications using bipartite matching in social big data. FGCS 87, 772–781 (2018)

    Article  Google Scholar 

  16. Qiu, M., Liu, J., et al.: A novel energy-aware fault tolerance mechanism for wireless sensor networks. In: IEEE/ACM Conference, GCC (2011)

    Google Scholar 

  17. Qiu, M., Xue, C., et al.: Energy minimization with soft real-time and DVS for uniprocessor and multiprocessor embedded systems. In: IEEE DATE Conference, pp. 1–6 (2007)

    Google Scholar 

  18. Shazeer, N., Lan, Z., Cheng, Y., Ding, N., Hou, L.: Talking-heads attention. arXiv preprint arXiv:2003.02436 (2020)

  19. Tang, G., Nivre, J.: An analysis of attention mechanisms: the case of word sense disambiguation in neural machine translation. In: 3rd Conference on Machine Translation (2018)

    Google Scholar 

  20. Vaswani, A., Shazeer, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  21. Voita, E., Serdyukov, P., Sennrich, R., Titov, I.: Context-aware neural machine translation learns anaphora resolution. arXiv e-prints arXiv:1805.10163, May 2018. https://ui.adsabs.harvard.edu/abs/2018arXiv180510163V

  22. Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned (2019). https://doi.org/10.18653/v1/p19-1580

  23. Wu, G., Zhang, H., et al.: A decentralized approach for mining event correlations in distributed system monitoring. JPDC 73(3), 330–340 (2013)

    MATH  Google Scholar 

  24. Zaheer, M., Guruganesh, G., et al.: Big bird: transformers for longer sequences (2021)

    Google Scholar 

  25. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. CoRR abs/1409.2329 (2014). http://arxiv.org/abs/1409.2329

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Jiang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, F., Jiang, M., Liu, F., Xu, D., Fan, Z., Wang, Y. (2022). Classification of Heads in Multi-head Attention Mechanisms. In: Memmi, G., Yang, B., Kong, L., Zhang, T., Qiu, M. (eds) Knowledge Science, Engineering and Management. KSEM 2022. Lecture Notes in Computer Science(), vol 13370. Springer, Cham. https://doi.org/10.1007/978-3-031-10989-8_54

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-10989-8_54

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-10988-1

  • Online ISBN: 978-3-031-10989-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics