Classification of Heads in Multi-head Attention Mechanisms

Huang, Feihu; Jiang, Min; Liu, Fang; Xu, Dian; Fan, Zimeng; Wang, Yonghao

doi:10.1007/978-3-031-10989-8_54

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13370))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1856 Accesses

Abstract

Transformer model has become the dominant modeling paradigm in deep learning, of which multi-head attention is a critical component. While increasing Transformer effect, it also has some issues. When the number of heads reaches a point, some attention heads have remarkably similar attention graphs, which indicates that these heads are doing repetitive calculations. Some heads may even focus on extraneous things, affecting the final result. After analyzing the multi-head attention mechanism, this paper believes that the consistency of the inputs to the multi-head attention mechanism is the underlying reason for the similarity of the attention graph between heads. For this reason, this paper proposes the concept of classifying the heads in multi-head attention mechanism and summarizes the general classification process. Three classification schemes are designed for the Multi30k dataset. Experiments demonstrate that our method converges faster than the baseline model and that the BLEU improves by 3.08–4.38 compared to the baseline model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Correia, G.M., Niculae, V., Martins, A.F.: Adaptively sparse transformers. arXiv preprint arXiv:1909.00015 (2019)
Devlin, J., Chang, M.W., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv e-prints arXiv:1810.04805, October 2018. https://ui.adsabs.harvard.edu/abs/2018arXiv181004805D
Dosovitskiy, A., Beyer, L., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: 9th ICLR (2021)
Google Scholar
Elliott, D., Frank, S., Sima’an, K., Specia, L.: Multi30k: multilingual English-German image descriptions. arXiv preprint arXiv:1605.00459 (2016)
Huang, L., Yuan, Y., Guo, J., Zhang, C., Chen, X., Wang, J.: Interlaced sparse self-attention for semantic segmentation. arXiv preprint arXiv:1907.12273 (2019)
Kawakami, K.: Supervised sequence labelling with recurrent neural networks. Ph.D. thesis (2008)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Google Scholar
Li, Y., Song, Y., et al.: Intelligent fault diagnosis by fusing domain adversarial training and maximum mean discrepancy via ensemble learning. IEEE TII 17(4), 2833–2841 (2021)
Google Scholar
Liu, M., Zhang, S., et al.: H infinite state estimation for discrete-time chaotic systems based on a unified model. IEEE Trans. SMC (B) (2012)
Google Scholar
Lu, Z., Wang, N., et al.: IoTDeM: an IoT big data-oriented MapReduce performance prediction extended model in multiple edge clouds. JPDC 118, 316–327 (2018)
Google Scholar
Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)
Qiu, H., Qiu, M., Lu, Z.: Selective encryption on ECG data in body sensor network based on supervised machine learning. Infor. Fusion 55, 59–67 (2020)
Article Google Scholar
Qiu, H., Zheng, Q., et al.: Deep residual learning-based enhanced JPEG compression in the internet of things. IEEE TII 17(3), 2124–2133 (2020)
Google Scholar
Qiu, H., Zheng, Q., et al.: Topological graph convolutional network-based urban traffic flow and density prediction. IEEE TITS 22(7), 4560–4569 (2021)
Google Scholar
Qiu, M., Gai, K., Xiong, Z.: Privacy-preserving wireless communications using bipartite matching in social big data. FGCS 87, 772–781 (2018)
Article Google Scholar
Qiu, M., Liu, J., et al.: A novel energy-aware fault tolerance mechanism for wireless sensor networks. In: IEEE/ACM Conference, GCC (2011)
Google Scholar
Qiu, M., Xue, C., et al.: Energy minimization with soft real-time and DVS for uniprocessor and multiprocessor embedded systems. In: IEEE DATE Conference, pp. 1–6 (2007)
Google Scholar
Shazeer, N., Lan, Z., Cheng, Y., Ding, N., Hou, L.: Talking-heads attention. arXiv preprint arXiv:2003.02436 (2020)
Tang, G., Nivre, J.: An analysis of attention mechanisms: the case of word sense disambiguation in neural machine translation. In: 3rd Conference on Machine Translation (2018)
Google Scholar
Vaswani, A., Shazeer, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Voita, E., Serdyukov, P., Sennrich, R., Titov, I.: Context-aware neural machine translation learns anaphora resolution. arXiv e-prints arXiv:1805.10163, May 2018. https://ui.adsabs.harvard.edu/abs/2018arXiv180510163V
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned (2019). https://doi.org/10.18653/v1/p19-1580
Wu, G., Zhang, H., et al.: A decentralized approach for mining event correlations in distributed system monitoring. JPDC 73(3), 330–340 (2013)
MATH Google Scholar
Zaheer, M., Guruganesh, G., et al.: Big bird: transformers for longer sequences (2021)
Google Scholar
Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. CoRR abs/1409.2329 (2014). http://arxiv.org/abs/1409.2329

Download references

Author information

Authors and Affiliations

College of Computer Science, Wuhan University of Science and Technology, Wuhan, China
Feihu Huang, Min Jiang, Dian Xu & Zimeng Fan
Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan, China
Feihu Huang, Dian Xu & Zimeng Fan
School of Computer Science, Wuhan University, Wuhan, China
Fang Liu
Department of Information Engineering, Wuhan Institute of City, Wuhan, China
Fang Liu
DMT Lab, Birmingham City University, Birmingham, UK
Yonghao Wang

Authors

Feihu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Min Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Fang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dian Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zimeng Fan
View author publications
You can also search for this author in PubMed Google Scholar
Yonghao Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Min Jiang .

Editor information

Editors and Affiliations

Télécom Paris, Paris, France
Gerard Memmi
Purdue University, West Lafayette, IN, USA
Baijian Yang
Shanghai Jiao Tong University, Shanghai, Shanghai, China
Linghe Kong
Nanyang Technological University, Singapore, Singapore
Tianwei Zhang
Texas A&M University – Commerce, Commerce, TX, USA
Meikang Qiu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, F., Jiang, M., Liu, F., Xu, D., Fan, Z., Wang, Y. (2022). Classification of Heads in Multi-head Attention Mechanisms. In: Memmi, G., Yang, B., Kong, L., Zhang, T., Qiu, M. (eds) Knowledge Science, Engineering and Management. KSEM 2022. Lecture Notes in Computer Science(), vol 13370. Springer, Cham. https://doi.org/10.1007/978-3-031-10989-8_54

Download citation

DOI: https://doi.org/10.1007/978-3-031-10989-8_54
Published: 19 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10988-1
Online ISBN: 978-3-031-10989-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Classification of Heads in Multi-head Attention Mechanisms