Elsevier

Neurocomputing

Volume 454, 24 September 2021, Pages 14-24
Neurocomputing

On the diversity of multi-head attention

https://doi.org/10.1016/j.neucom.2021.04.038Get rights and content

Abstract

Multi-head attention is appealing for the ability to jointly attend to information from different representation subspaces at different positions. In this work, we propose two approaches to better exploit such diversity for multi-head attention, which are complementary to each other. First, we introduce a disagreement regularization to explicitly encourage the diversity among multiple attention heads. Specifically, we propose three types of disagreement regularization, which respectively encourage the subspace, the attended positions, and the output representation associated with each attention head to be different from other heads. Second, we propose to better capture the diverse information distributed in the extracted partial-representations with the routing-by-agreement algorithm. The routing algorithm iteratively updates the proportion of how much a part (i.e. the distinct information learned from a specific subspace) should be assigned to a whole (i.e. the final output representation), based on the agreement between parts and wholes. Experimental results on the machine translation, sentence encoding and logical inference tasks demonstrate the effectiveness and universality of the proposed approaches, which indicate the necessity of better exploiting the diversity for multi-head attention. While the two strategies individually boost performance, combining them together can further improve the model performance.

Introduction

Attention model becomes a standard component of the deep learning networks, contributing to impressive results in machine translation [1], [2], image captioning [3], speech recognition [4], among many other applications. Recently, the performance of attention is further improved by multi-head mechanism [5], which concurrently performs the attention functions on different representation subspaces of the input sequence. Consequently, different attention heads are able to capture distinct properties of the input, which are embedded in different subspaces [6]. Subsequently, a linear transformation is generally employed to aggregate the partial representations extracted by different attention heads [5], [7], producing the final output representation.

However, the conventional multi-head mechanism may not fully exploit the diversity among attention heads. First, one strong point of multi-head attention is the ability to jointly attend to information from different representation subspaces at different positions. But currently there is no mechanism to guarantee that different attention heads indeed capture distinct information. Second, we believe that information extraction and information aggregation are both important to produce an informative representation. We argue that the straightforward linear transformation are not expressive enough to fully capture the rich information distributed in the extracted partial-representations. In this work, we propose two strategies to better exploit the diversity of multi-head attention, namely disagreement regularization and advanced information aggregation.

In response to the first problem, we introduce a disagreement regularization term to explicitly encourage the diversity among multiple attention heads. The disagreement regularization serves as an auxiliary objective to guide the training of the related attention component. Specifically, we propose three types of disagreement regularization, which are applied to the three key components that refer to the calculation of information vector using multi-head attention. Two regularization terms are respectively to maximize cosine distances of the input subspaces and output representations, while the last one is to disperse the positions attended by multiple heads with element-wise multiplication of the corresponding attention matrices. The three regularization terms can be either used individually or in combination.

To address the second problem, we replace the standard linear transformation in conventional multi-head attention [5] with an advanced routing-by-agreement algorithm, to better aggregate the diverse information distributed in the extracted partial-representations. Specifically, we cast information aggregation as the assigning-parts-to-wholes problem [8], and investigate the effectiveness of the routing-by-agreement algorithm, which is an appealing alternative to solving this problem [9], [10]. The routing algorithm iteratively updates the proportion of how much a part should be assigned to a whole, based on the agreement between parts and wholes.

In addition, it is natural to combine the two types of approaches and apply them simultaneously, since the former focuses on extracting more diverse information while the latter aims to better aggregate the extracted information. We apply them simultaneously by modifying both the training objective and network architecture.

We evaluate the performance of the proposed approaches on three representative NLP tasks: machine translation, sentence encoding, and logical inference tasks. For machine translation, we validate our approaches on top of the advanced Transformer model [5] on both WMT14 EnglishGerman and WMT17 ChineseEnglish data. Experimental results show that our approaches consistently improve the translation performance across language pairs while keeping the computational efficiency. For sentence encoding, we evaluate with the linguistic probing tasks [11], which consist of 10 classification problems to study what linguistic properties are captured by input encoding representations. Probing analysis shows that our approaches indeed produce more informative representation, which embeds more syntactic and semantic information. Experiments on logical inference further demonstrate the ability of modeling hierarchical structure. Precisely, our study reveals that:

  • Directly applying disagreement regularization on the output representations of multiple attention heads is most effective.

  • The EM routing algorithm shows its superiority on information aggregation over the standard linear transformation and other aggregation algorithms.

  • Disagreement regularization and advanced information aggregation are complementary to each other, as indicated from analyses in machine translation and sentence encoding.

This paper combines and extends results presented at the 2018 Conference on Empirical Methods in Natural Language Processing (entitled “Multi-Head Attention with Disagreement Regularization” [12]) and at the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (entitled “Information Aggregation for Multi-Head Attention with Routing-by-Agreement” [13]). The extensions include

  • 1.

    We further refine our proposed model by combining the two sorts of strategies and exploiting the advantages of simultaneously applying them (Section 3.3). We demonstrate the effectiveness of the combined method in experiments (Table 5).

  • 2.

    We carry out more experiments and in-depth analyses to validate the effectiveness of our approaches on more tasks, including linguistic probing tasks (Section 4.2) and and logical inference tasks (Section 4.3). Results on linguistic probing tasks prove the superiority of our approach on capturing surface, syntactic and semantic information. Results on logical inference tasks show that the proposed approach performs better at modeling hierarchical structure.

  • 3.

    We present a more comprehensive description of the proposed models and algorithms (Section 3).

  • 4.

    For reproducibility, we release the source code, preprocessed data, and trained models, which make it easy to reproduce the experiments in this work.1

Section snippets

Background

Attention mechanism aims at modeling the relevance between representation pairs, thus a representation is allowed to build a direct relation with another representation. Instead of performing a single attention function, Vaswani et al. [5] found it is beneficial to capture different context features with multiple individual attention functions, namely multi-head attention. Fig. 1 shows an example of a two–head attention model. For the query word “Bush”, green and red heads pay attention to

Approach

In this work, we propose to better exploit the diversity of multi-head attention from two perspectives:

  • Disagreement Regularization: Conventional multi-head attention conducts multiple attention functions in parallel (Eq. 2), while there is no mechanism to guarantee that different attention heads indeed capture distinct information. In response to this problem, we introduce disagreement regularizations to explicitly encourage different attention heads to extract distinct information (Section 3.1

Experiments

In this section, we validate the effectiveness of our approaches on machine translation tasks (Section 4.1), sentence encoding tasks (Section 4.2), and logical inference tasks(Section 4.3). We conduct ablation study of the proposed approaches on the benchmark machine translation tasks, and carry out final evaluation on all the other tasks.

Multi-head attention

Multi-head attention has shown promising empirical results in many NLP tasks, such as machine translation [5], [39], semantic role labeling [40], dialog [41], subject-verb agreement task [26]. The strength of multi-head attention lies in the rich expressiveness by using multiple attention functions in different representation subspaces.

Previous work show that multi-head attention can be further enhanced by encouraging individual attention heads to extract distinct information. For example, Lin

Conclusion

In this work, we propose to better exploit the diversity of multi-head attention by incorporating disagreement regularization and employing advanced information aggregation. To this end, we propose several effective and efficient strategies to implement the disagreement regularization and advanced information aggregation. We find that the output disagreement term and EM routing algorithm yield the best performances, and are complementary to each other. Experimental results on machine

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We thank the anonymous reviewers for their insightful comments and suggestions.

Jian Li is a Ph.D. candidate at The Chinese University of Hong Kong. He received his bachelor degree at University of Electronic Science and Technology of China, Chengdu, China, in 2015. His research interests include machine translation, question answering, and information retrieval.

References (53)

  • D. Bahdanau, K. Cho, Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, in: ICLR,...
  • M.-T. Luong, H. Pham, C. D. Manning, Effective Approaches to Attention-based Neural Machine Translation, in: EMNLP,...
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio, Show, Attend and Tell:...
  • J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, Attention-based Models for Speech Recognition, in: NIPS,...
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention Is All You...
  • A. Raganato, J. Tiedemann, An Analysis of Encoder Representations in Transformer-Based Machine Translation, in: EMNLP...
  • K. Ahmed, N. S. Keskar, R. Socher, Weighted Transformer Network for Machine Translation, in: arXiv preprint...
  • G. E. Hinton, A. Krizhevsky, S. D. Wang, Transforming Auto-encoders, in: ICANN,...
  • S. Sabour, N. Frosst, G. E. Hinton, Dynamic Routing Between Capsules, in: NIPS,...
  • G. E. Hinton, S. Sabour, N. Frosst, Matrix Capsules with EM Routing, in: ICLR,...
  • A. Conneau, G. Kruszewski, G. Lample, L. Barrault, M. Baroni, What You Can Cram into A Single $&!#* Vector: Probing...
  • J. Li, Z. Tu, B. Yang, M. R. Lyu, T. Zhang, Multi-Head Attention with Disagreement Regularization, in: EMNLP,...
  • J. Li, B. Yang, Z.-Y. Dou, X. Wang, M. R. Lyu, Z. Tu, Information aggregation for multi-head attention with...
  • P. Liang, B. Taskar, D. Klein, Alignment by agreement, in: NAACL,...
  • Y. Cheng, S. Shen, Z. He, W. He, H. Wu, M. Sun, Y. Liu, Agreement-based joint training for bidirectional...
  • A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal Compact Bilinear Pooling for Visual...
  • H. Ben-Younes, R. Cadene, M. Cord, N. Thome, Mutan: Multimodal Tucker Fusion for Visual Question Answering, in: ICCV,...
  • Z. Dou, Z. Tu, X. Wang, S. Shi, T. Zhang, Exploiting deep representations for neural machine translation, in: EMNLP,...
  • W. Jiang, L. Ma, Y.-G. Jiang, W. Liu, T. Zhang, Recurrent fusion network for image captioning, arXiv preprint...
  • R. Sennrich, B. Haddow, A. Birch, Neural Machine Translation of Rare Words with Subword Units, in: ACL,...
  • K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: A method for Automatic Evaluation of Machine Translation, in: ACL,...
  • P. Koehn, Statistical Significance Tests for Machine Translation Evaluation, in: EMNLP,...
  • D. P. Kingma, J. Ba, Adam: A method for stochastic optimization,...
  • C. Szegedy et al.

    Rethinking the inception architecture for computer vision

  • L. v. d. Maaten, G. Hinton, Visualizing data using t-sne, Journal of machine learning research 9 (Nov) (2008)...
  • G. Tang, M. Müller, A. Rios, R. Sennrich, Why Self-Attention? A Targeted Evaluation of Neural Machine Translation...
  • Cited by (27)

    View all citing articles on Scopus

    Jian Li is a Ph.D. candidate at The Chinese University of Hong Kong. He received his bachelor degree at University of Electronic Science and Technology of China, Chengdu, China, in 2015. His research interests include machine translation, question answering, and information retrieval.

    Xing Wang is a researcher with the Tencent AI Lab, Shenzhen, China. He received his Ph.D. degree from Soochow University in 2018. His research interests include statistical machine translation and neural machine translation.

    Zhaopeng Tu is a Principal Researcher with the Tencent AI Lab, Shenzhen, China. He received his Ph.D. degree from Institute of Computing Technology, Chinese Academy of Sciences in 2013. He was a Postdoctoral Researcher at University of California at Davis from 2013 to 2014. He was a researcher at Huawei Noahs Ark Lab, Hong Kong from 2014 to 2017. His research focuses on deep learning for natural language processing.

    Michael R. Lyu received the B.S. degree in electrical engineering from National Taiwan University, Taipei, Taiwan, the M.S. degree in computer engineering from the University of California at Santa Barbara, Santa Barbara, CA, USA, and the Ph.D. degree in computer engineering from the University of California at Los Angeles, Los Angeles, CA, USA. He was with the Jet Propulsion Laboratory, Pasadena, CA, USA, Telcordia Technologies, Piscataway, NJ, USA, and the Bell Laboratory, Murray Hill, NJ, USA, and taught at The University of Iowa, Iowa City, IA, USA. He is currently a Professor with the Computer Science and Engineering Department, The Chinese University of Hong Kong, Hong Kong. He has participated in more than 30 industrial projects and authored more than 500 papers. Dr. Lyu is a fellow of The Institute of Electrical and Electronics Engineers (IEEE) the The Association for Computing Machinery (ACM). His current research interests include software engineering, distributed systems, multimedia technologies, machine learning, social computing.

    View full text