On the diversity of multi-head attention

doi:10.1016/j.neucom.2021.04.038

Neurocomputing

Volume 454, 24 September 2021, Pages 14-24

https://doi.org/10.1016/j.neucom.2021.04.038 Get rights and content

Abstract

Multi-head attention is appealing for the ability to jointly attend to information from different representation subspaces at different positions. In this work, we propose two approaches to better exploit such diversity for multi-head attention, which are complementary to each other. First, we introduce a disagreement regularization to explicitly encourage the diversity among multiple attention heads. Specifically, we propose three types of disagreement regularization, which respectively encourage the subspace, the attended positions, and the output representation associated with each attention head to be different from other heads. Second, we propose to better capture the diverse information distributed in the extracted partial-representations with the routing-by-agreement algorithm. The routing algorithm iteratively updates the proportion of how much a part (i.e. the distinct information learned from a specific subspace) should be assigned to a whole (i.e. the final output representation), based on the agreement between parts and wholes. Experimental results on the machine translation, sentence encoding and logical inference tasks demonstrate the effectiveness and universality of the proposed approaches, which indicate the necessity of better exploiting the diversity for multi-head attention. While the two strategies individually boost performance, combining them together can further improve the model performance.

Introduction

Attention model becomes a standard component of the deep learning networks, contributing to impressive results in machine translation [1], [2], image captioning [3], speech recognition [4], among many other applications. Recently, the performance of attention is further improved by multi-head mechanism [5], which concurrently performs the attention functions on different representation subspaces of the input sequence. Consequently, different attention heads are able to capture distinct properties of the input, which are embedded in different subspaces [6]. Subsequently, a linear transformation is generally employed to aggregate the partial representations extracted by different attention heads [5], [7], producing the final output representation.

However, the conventional multi-head mechanism may not fully exploit the diversity among attention heads. First, one strong point of multi-head attention is the ability to jointly attend to information from different representation subspaces at different positions. But currently there is no mechanism to guarantee that different attention heads indeed capture distinct information. Second, we believe that information extraction and information aggregation are both important to produce an informative representation. We argue that the straightforward linear transformation are not expressive enough to fully capture the rich information distributed in the extracted partial-representations. In this work, we propose two strategies to better exploit the diversity of multi-head attention, namely disagreement regularization and advanced information aggregation.

In response to the first problem, we introduce a disagreement regularization term to explicitly encourage the diversity among multiple attention heads. The disagreement regularization serves as an auxiliary objective to guide the training of the related attention component. Specifically, we propose three types of disagreement regularization, which are applied to the three key components that refer to the calculation of information vector using multi-head attention. Two regularization terms are respectively to maximize cosine distances of the input subspaces and output representations, while the last one is to disperse the positions attended by multiple heads with element-wise multiplication of the corresponding attention matrices. The three regularization terms can be either used individually or in combination.

To address the second problem, we replace the standard linear transformation in conventional multi-head attention [5] with an advanced routing-by-agreement algorithm, to better aggregate the diverse information distributed in the extracted partial-representations. Specifically, we cast information aggregation as the assigning-parts-to-wholes problem [8], and investigate the effectiveness of the routing-by-agreement algorithm, which is an appealing alternative to solving this problem [9], [10]. The routing algorithm iteratively updates the proportion of how much a part should be assigned to a whole, based on the agreement between parts and wholes.

In addition, it is natural to combine the two types of approaches and apply them simultaneously, since the former focuses on extracting more diverse information while the latter aims to better aggregate the extracted information. We apply them simultaneously by modifying both the training objective and network architecture.

We evaluate the performance of the proposed approaches on three representative NLP tasks: machine translation, sentence encoding, and logical inference tasks. For machine translation, we validate our approaches on top of the advanced Transformer model [5] on both WMT14 English $\Rightarrow$ German and WMT17 Chinese $\Rightarrow$ English data. Experimental results show that our approaches consistently improve the translation performance across language pairs while keeping the computational efficiency. For sentence encoding, we evaluate with the linguistic probing tasks [11], which consist of 10 classification problems to study what linguistic properties are captured by input encoding representations. Probing analysis shows that our approaches indeed produce more informative representation, which embeds more syntactic and semantic information. Experiments on logical inference further demonstrate the ability of modeling hierarchical structure. Precisely, our study reveals that:

•
Directly applying disagreement regularization on the output representations of multiple attention heads is most effective.
•
The EM routing algorithm shows its superiority on information aggregation over the standard linear transformation and other aggregation algorithms.
•
Disagreement regularization and advanced information aggregation are complementary to each other, as indicated from analyses in machine translation and sentence encoding.

This paper combines and extends results presented at the 2018 Conference on Empirical Methods in Natural Language Processing (entitled “Multi-Head Attention with Disagreement Regularization” [12]) and at the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (entitled “Information Aggregation for Multi-Head Attention with Routing-by-Agreement” [13]). The extensions include

1.
We further refine our proposed model by combining the two sorts of strategies and exploiting the advantages of simultaneously applying them (Section 3.3). We demonstrate the effectiveness of the combined method in experiments (Table 5).
2.
We carry out more experiments and in-depth analyses to validate the effectiveness of our approaches on more tasks, including linguistic probing tasks (Section 4.2) and and logical inference tasks (Section 4.3). Results on linguistic probing tasks prove the superiority of our approach on capturing surface, syntactic and semantic information. Results on logical inference tasks show that the proposed approach performs better at modeling hierarchical structure.
3.
We present a more comprehensive description of the proposed models and algorithms (Section 3).
4.
For reproducibility, we release the source code, preprocessed data, and trained models, which make it easy to reproduce the experiments in this work.¹

Section snippets

Background

Attention mechanism aims at modeling the relevance between representation pairs, thus a representation is allowed to build a direct relation with another representation. Instead of performing a single attention function, Vaswani et al. [5] found it is beneficial to capture different context features with multiple individual attention functions, namely multi-head attention. Fig. 1 shows an example of a two–head attention model. For the query word “Bush”, green and red heads pay attention to

Approach

In this work, we propose to better exploit the diversity of multi-head attention from two perspectives:

•
Disagreement Regularization: Conventional multi-head attention conducts multiple attention functions in parallel (Eq. 2), while there is no mechanism to guarantee that different attention heads indeed capture distinct information. In response to this problem, we introduce disagreement regularizations to explicitly encourage different attention heads to extract distinct information (Section 3.1

Experiments

In this section, we validate the effectiveness of our approaches on machine translation tasks (Section 4.1), sentence encoding tasks (Section 4.2), and logical inference tasks(Section 4.3). We conduct ablation study of the proposed approaches on the benchmark machine translation tasks, and carry out final evaluation on all the other tasks.

Multi-head attention

Multi-head attention has shown promising empirical results in many NLP tasks, such as machine translation [5], [39], semantic role labeling [40], dialog [41], subject-verb agreement task [26]. The strength of multi-head attention lies in the rich expressiveness by using multiple attention functions in different representation subspaces.

Previous work show that multi-head attention can be further enhanced by encouraging individual attention heads to extract distinct information. For example, Lin

Conclusion

In this work, we propose to better exploit the diversity of multi-head attention by incorporating disagreement regularization and employing advanced information aggregation. To this end, we propose several effective and efficient strategies to implement the disagreement regularization and advanced information aggregation. We find that the output disagreement term and EM routing algorithm yield the best performances, and are complementary to each other. Experimental results on machine

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We thank the anonymous reviewers for their insightful comments and suggestions.

Jian Li is a Ph.D. candidate at The Chinese University of Hong Kong. He received his bachelor degree at University of Electronic Science and Technology of China, Chengdu, China, in 2015. His research interests include machine translation, question answering, and information retrieval.

References (53)

D. Bahdanau, K. Cho, Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, in: ICLR,...
M.-T. Luong, H. Pham, C. D. Manning, Effective Approaches to Attention-based Neural Machine Translation, in: EMNLP,...
K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, Y. Bengio, Show, Attend and Tell:...
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio, Attention-based Models for Speech Recognition, in: NIPS,...
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention Is All You...
A. Raganato, J. Tiedemann, An Analysis of Encoder Representations in Transformer-Based Machine Translation, in: EMNLP...
K. Ahmed, N. S. Keskar, R. Socher, Weighted Transformer Network for Machine Translation, in: arXiv preprint...
G. E. Hinton, A. Krizhevsky, S. D. Wang, Transforming Auto-encoders, in: ICANN,...
S. Sabour, N. Frosst, G. E. Hinton, Dynamic Routing Between Capsules, in: NIPS,...
G. E. Hinton, S. Sabour, N. Frosst, Matrix Capsules with EM Routing, in: ICLR,...

A. Conneau, G. Kruszewski, G. Lample, L. Barrault, M. Baroni, What You Can Cram into A Single $&!#* Vector: Probing...

J. Li, Z. Tu, B. Yang, M. R. Lyu, T. Zhang, Multi-Head Attention with Disagreement Regularization, in: EMNLP,...

J. Li, B. Yang, Z.-Y. Dou, X. Wang, M. R. Lyu, Z. Tu, Information aggregation for multi-head attention with...

P. Liang, B. Taskar, D. Klein, Alignment by agreement, in: NAACL,...

Y. Cheng, S. Shen, Z. He, W. He, H. Wu, M. Sun, Y. Liu, Agreement-based joint training for bidirectional...

A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal Compact Bilinear Pooling for Visual...

H. Ben-Younes, R. Cadene, M. Cord, N. Thome, Mutan: Multimodal Tucker Fusion for Visual Question Answering, in: ICCV,...

Z. Dou, Z. Tu, X. Wang, S. Shi, T. Zhang, Exploiting deep representations for neural machine translation, in: EMNLP,...

W. Jiang, L. Ma, Y.-G. Jiang, W. Liu, T. Zhang, Recurrent fusion network for image captioning, arXiv preprint...

R. Sennrich, B. Haddow, A. Birch, Neural Machine Translation of Rare Words with Subword Units, in: ACL,...

K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: A method for Automatic Evaluation of Machine Translation, in: ACL,...

P. Koehn, Statistical Significance Tests for Machine Translation Evaluation, in: EMNLP,...

D. P. Kingma, J. Ba, Adam: A method for stochastic optimization,...

C. Szegedy et al.

Rethinking the inception architecture for computer vision

L. v. d. Maaten, G. Hinton, Visualizing data using t-sne, Journal of machine learning research 9 (Nov) (2008)...

G. Tang, M. Müller, A. Rios, R. Sennrich, Why Self-Attention? A Targeted Evaluation of Neural Machine Translation...

Cited by (27)

Elegans-AI: How the connectome of a living organism could model artificial neural networks
2024, Neurocomputing
This paper introduces Elegans-AI models, a class of neural networks that leverage the connectome topology of the Caenorhabditis elegans to design deep and reservoir architectures. Utilizing deep learning models inspired by the connectome, this paper leverages the evolutionary selection process to consolidate the functional arrangement of biological neurons within their networks. The initial goal involves the conversion of natural connectomes into artificial representations. The second objective centers on embedding the complex circuitry topology of artificial connectomes into both deep learning and deep reservoir networks, highlighting their neural-dynamic short-term and long-term memory and learning capabilities. Lastly, our third objective aims to establish structural explainability by examining the heterophilic/homophilic properties within the connectome and their impact on learning capabilities. In our study, the Elegans-AI models demonstrate superior performance compared to similar models that utilize either randomly rewired artificial connectomes or simulated bio-plausible ones. Notably, these Elegans-AI models achieve a top-1 accuracy of 99.99% on both Cifar10 and Cifar100, and 99.84% on MNIST Unsup. They do this with significantly fewer learning parameters, particularly when reservoir configurations of the connectome are used. Our findings indicate a clear connection between bio-plausible network patterns, the small-world characteristic, and learning outcomes, emphasizing the significant role of evolutionary optimization in shaping the topology of artificial neural networks for improved learning performance.
Power transmission system's fault location, detection, and classification: Pay close attention to transmission nodes
2024, International Journal of Electrical Power and Energy Systems
For transmission systems to operate safely and reliably, fault identification and classification are essential. However, power network physical architecture and data information cannot be fully utilized by conventional intelligent approaches. This study, therefore, presents a fault localization, detection, and classification model for transmission systems that concentrate on the key distribution nodes. The model makes use of a deep graph neural network with multi-scale attention and multi-linear perceptron block which accounts for the power network's structural composition during learning. The model's capacity to manage unusual data input and unidentified application situations is improved by the inclusion of multi-scale attention. Furthermore, it enables the model to precisely pinpoint fault areas by identifying patterns and connections among system parts, concentrating on specific areas or nodes. In addition, a multi-linear perceptron block is designed to enhance the capturing of amplitude information and increase comprehension. The efficiency and generalizability of the proposed model are improved by the implementation of a multi-task training approach for locating faults and their type. With the use of two IEEE 13-Bus systems and the PSS/E 23-Bus system, the proposed fault diagnosis model is tested. Examining various setups for fault analysis allows for a more thorough evaluation of the model's ability to generalize and disturbance resilience. Experimental findings show that the proposed model outperforms existing cutting-edge techniques in terms of efficacy with a balanced accuracy of 0.8204 for classification, 0.556 for localization, and a Macro MAE of 38.780 for detection.
A novel featurization methodology using JaGen algorithm for time series forecasting with deep learning techniques
2024, Expert Systems with Applications
Accurate time series forecasting is crucial in various fields, including finance, economics, healthcare, transportation, and energy. Recently, deep learning methods have gained significant attention in time series forecasting. Despite their ability to model complex and nonlinear data patterns, their performance may be degraded when there is not sufficient input data. To address this issue, in this study, we propose a methodology that performs feature-based representation through time series featurization and adds the relevant features as an auxiliary input to deep learning models. Specifically, after conducting the featurization step, as another contribution, we develop a novel combined feature selection algorithm called JAYA-Genetic (JaGen) and employ it as the feature selection technique. The JaGen algorithm takes advantage of both the JAYA and Genetic algorithms to make balance between exploring and exploitation processes. To assess the efficacy of the suggested methodology, two deep learning models, a convolutional network and a multi-head attention mechanism are used. According to the findings of experiments carried out on ten Covid-19 datasets, two product demand time series datasets, and two public time series datasets, the derived models outperform their regular counterparts and current models in terms of performance metrics like symmetric mean absolute percentage error (SMAPE) and root mean square error (RMSE).
Transformer-based forecasting for intraday trading in the Shanghai crude oil market: Analyzing open-high-low-close prices
2023, Energy Economics
The Shanghai crude oil futures market exudes distinct speculative attributes, underscoring the pivotal significance of precise price forecasts. Accurate forecasting of Shanghai crude oil futures prices assumes vital importance for investors to optimize their portfolios profitably, for producers to mitigate production risks in the crude oil spot market, and for providing cogent decision-making support to government entities. This study implements a groundbreaking unbiased structural forecasting for Shanghai crude oil futures' open-high-low-close (OHLC) prices leveraging the Transformer framework coupled with the model-driven and penalty term-based loss function designs. Based on OHLC forecasts, this study devises three intraday trading strategies. Notably, our results evince that the forecasting accuracy of the Transformer outperforms the Naïve method, vector autoregression (VAR) and vector error correction model (VECM), multiple linear regression (MLR), support vector regression (SVR), and long short-term memory (LSTM) neural network when applied to different temporal granularities of Shanghai crude oil futures OHLC data. Furthermore, the three proposed intraday trading strategies exhibit higher annualized return rates and Sharpe ratios, alongside lower maximum drawdowns and standard deviations of returns in comparison to the conventional close-to-close strategy relying solely on the close price. Remarkably, the forecasting process and intraday trading strategies explicated in this study can equally apply to other futures products in the energy sector, including electricity, coal, and natural gas.
Multi-expert attention network for long-term dam displacement prediction
2023, Advanced Engineering Informatics
Monitoring and predicting the dam displacement of concrete dams has attracted increasing attention for ensuring the long-term health conditions. Most existing models focus on just temporal features and ignore the spatial features relations of monitoring data. To address these problems, a dam displacement prediction model based on multi-expert network is developed. In the proposed model, the long short-term memory network is employed to extract the temporal features of each monitoring sensor. Then the multi-head attention network is employed to obtain the spatial features and the adjacency values. The multi-expert graph is employed to describe the spatial relations between different monitoring sensors. The graph convolutional network is employed to integrate the spatio-temporal features and predict the long-term dam displacement. Through a real-world comparative study against eight prediction models, the proposed model performs better than other models in both cyclical and non-cyclical time series. Therefore, the proposed model is suitable for evaluating the dam health condition.
A novel XGBoost-based featurization approach to forecast renewable energy consumption with deep learning models
2023, Sustainable Computing: Informatics and Systems
For energy suppliers, forecasting the energy demand with accuracy is essential. The current studies in the literature have employed various statistical and machine/deep learning forecasting methods to predict energy consumption. Although deep learning methods have been successfully applied in this context, their performance can be improved by incorporating statistical features representing the characteristics of time series. This study proposes a novel two-stage forecasting framework composed of data preprocessing and model building. The data preprocessing component extracts statistical features from the input data, and then an XGBoost regressor is utilized to obtain the importance of each feature. The model-building component uses the obtained features and the original input data to construct the forecasting model. We implement three forecasting models based on the proposed approach using two state-of-the-art deep learning models, including the temporal convolution neural network and Multi-head Attention. We empirically evaluate the proposed approach on two renewable energy consumption datasets. The results of experiments indicate that incorporating features is beneficial for temporal convolution neural network-based and Multi-head Attention-based deep learning models performance. This study significantly contributes to the existing models in the literature, as the combined methods improve on their regular variants and the benchmark models.

View all citing articles on Scopus

Xing Wang is a researcher with the Tencent AI Lab, Shenzhen, China. He received his Ph.D. degree from Soochow University in 2018. His research interests include statistical machine translation and neural machine translation.

Zhaopeng Tu is a Principal Researcher with the Tencent AI Lab, Shenzhen, China. He received his Ph.D. degree from Institute of Computing Technology, Chinese Academy of Sciences in 2013. He was a Postdoctoral Researcher at University of California at Davis from 2013 to 2014. He was a researcher at Huawei Noahs Ark Lab, Hong Kong from 2014 to 2017. His research focuses on deep learning for natural language processing.

Michael R. Lyu received the B.S. degree in electrical engineering from National Taiwan University, Taipei, Taiwan, the M.S. degree in computer engineering from the University of California at Santa Barbara, Santa Barbara, CA, USA, and the Ph.D. degree in computer engineering from the University of California at Los Angeles, Los Angeles, CA, USA. He was with the Jet Propulsion Laboratory, Pasadena, CA, USA, Telcordia Technologies, Piscataway, NJ, USA, and the Bell Laboratory, Murray Hill, NJ, USA, and taught at The University of Iowa, Iowa City, IA, USA. He is currently a Professor with the Computer Science and Engineering Department, The Chinese University of Hong Kong, Hong Kong. He has participated in more than 30 industrial projects and authored more than 500 papers. Dr. Lyu is a fellow of The Institute of Electrical and Electronics Engineers (IEEE) the The Association for Computing Machinery (ACM). His current research interests include software engineering, distributed systems, multimedia technologies, machine learning, social computing.

View full text