BERT-JAM: Maximizing the utilization of BERT for neural machine translation
Introduction
Pre-training has been demonstrated as a highly effective method for boosting the performance of many natural language processing (NLP) tasks, such as question answering, text classification, and so on. By training on massive unlabeled text data, pre-trained language models are able to learn the contextual representations of input tokens, which are extremely helpful for accomplishing downstream tasks. BERT [1], as one of the most widely used pre-trained language models, is trained using two unsupervised tasks, namely, mask language modeling and next sentence prediction. By adding a few layers on top, BERT can be easily adapted into a task-specific model, which is then fine-tuned on the labeled data to achieve optimal performance. Such a practice has been exercised in various NLP scenarios and has achieved many state-of-the-art (SOTA) results.
The study of BERT-enhanced NMT, i.e., leveraging pre-trained BERT for neural machine translation, has received much research interest. However, exploiting BERT for NMT cannot be easily accomplished by adding a few layers atop BERT and then fine-tuning, as in monolingual NLP tasks. The challenge is brought by the complexity of NMT models which typically consist of an encoder that transforms the source language tokens into a hidden representation and a decoder that predicts the target language tokens based on the hidden representation.
Although various methods for integrating NMT with BERT have been proposed, it remains an open question how to maximize the utilization of BERT for NMT. We identify three aspects of the research gap. Firstly, most previous studies on BERT-enhanced NMT only utilize BERT’s last-layer representation without making full use of the intermediate layers. It has been shown that BERT’s different layers encode different linguistic features [2], which can be fully used to boost the performance of NMT models. The dynamic fusion method proposed by Weng et al. [3], to our knowledge, is the only work on the utilization of BERT’s multi-layer representations for NMT. However, their method conditions the weights for combining the multi-layer representations on the instance-specific representations of the encoder layers. Consequently, it cannot be applied to the decoder layers because the ground-truth decoder representations are not available at the inference stage. Hence, how to make the most of BERT’s intermediate layers for both the NMT encoder and decoder requires further exploration.
Secondly, there is much more to learn about how to allow the NMT encoder and decoder to utilize the BERT representation in an efficient manner. Various methods have been proposed, e.g., transforming the encoder representation by adding the BERT representation to it [3], using distillation to transfer the knowledge from BERT to the NMT model [4], etc. The recently proposed BERT-fused model [5] employs attention mechanisms to fuse BERT representations with the encoder/decoder layers and achieves state-of-the-art results. Specifically, they introduce a BERT-encoder cross-attention module into each encoder layer and average its output with the self-attention module, as illustrated in Fig. 1(a). However, we find the approach of averaging the outputs unable to fully utilize the BERT representation when more attention should be paid to the BERT representation.
Thirdly, previous works propose to keep the BERT parameters fixed during training to avoid catastrophic forgetting [4], [6]. Catastrophic forgetting refers to the problem that the knowledge pre-trained by BERT is gradually lost during fine-tuning. Although catastrophic forgetting is a common problem that comes with fine-tuning, it is exacerbated in the field of BERT-enhanced NMT, which can be attributed to the fact that NMT models take much more iterations to converge. However, giving up fine-tuning BERT because of the forgetting problem prevents us from fully optimizing the model.
To fill the research gap, we propose a BERT-enhanced NMT model named BERT-JAM, which stands for BERT-fused Joint-Attention Model. We seek to maximize the utilization of BERT for NMT in three ways. Firstly, we incorporate a fusion module into each encoder/decoder layer used for composing BERT’s multi-layer representations into a fused representation. In contrast to the dynamic fusion method [3], we let the weights for combining the multi-layer representations be shared among different instances. Not only can our fusion method be applied to both the encoder and the decoder, but also it allows each encoder/decoder layer to have a consistent preference for certain BERT layers. Secondly, we propose a joint-attention module used for integrating the fused BERT representation with encoder and decoder layers, as shown in Fig. 1(b). Different from the BERT-fused model which uses separate attention modules (i.e., cross-attention module and self-attention module) whose outputs are averaged, the joint-attention module simultaneously performs the functions of different attention modules and dynamically allocate attention between the BERT representation and the encoder/decoder representation. Thirdly, we train BERT-JAM with a three-phase optimization strategy that progressively unfreezes different parts of the model to overcome the problem of catastrophic forgetting. The strategy allows the model to benefit from the performance boost brought by fine-tuning BERT.
We summarize the contributions of this paper as follows:
- •
We propose a novel BERT-enhanced NMT model named BERT-JAM which is equipped with fusion modules and joint-attention modules. The fusion module composes BERT’s multi-layer representations into a fused representation and allows BERT-JAM to benefit from the linguistic features encoded by BERT’s intermediate layers. And the joint-attention module integrates the functions of different attention modules and dynamically allocates attention between different representations.
- •
We propose a three-phase optimization strategy to train BERT-JAM that allows us to overcome the catastrophic forgetting problem while fine-tuning.
- •
This is the first work that studies how the performance of BERT-enhanced NMT models varies with the size of BERT. We find out that, with constrained memory usage and latency bound, deeper BERT models (more layers) are better at boosting the translation performance compared with wider ones (larger dimension).
- •
We evaluate the proposed BERT-JAM model on several widely used translation tasks. Experimental results show that BERT-JAM achieves SOTA scores on multiple benchmarks, demonstrating the effectiveness of our method.
The rest of this paper is organized as follows. In Section 2, we detail our approach in terms of module description, model architecture, and optimization strategy. Section 3 introduces the experimental setups. In Section 4, we conduct comprehensive experiments with BERT-JAM and discuss the results. We give a review of related works in Section 5 and the conclusions are drawn in Section 6.
Section snippets
Approach
This section presents our proposed approach to boosting machine translation performance with BERT. We begin by introducing the backgrounds of BERT-enhanced NMT. Then, we introduce two building blocks of our model, i.e., the joint-attention module and the fusion module. Next, we elaborate on the architecture of BERT-JAM. Finally, we describe the three-phase optimization strategy used to train the model.
Experimental setup
This section details the experimental setup in terms of data preparation, model configuration, training settings, and evaluation metrics. We implement our model upon the Fairseq repository.1 The source code for our work has been made publicly available.2
Experiments and results
This section introduces the experiments we conduct to evaluate our method. We first report the results on the benchmark translation tasks in both low-resource and high-resource scenarios. Then we work on the IWSLT’14 En-De task where we vary the size of BERT to study how it affects the translation performance. Finally, we conduct ablation studies to justify the design choices in our proposed model.
Neural machine translation
Neural machine translation (NMT), which aims to use artificial neural networks for the translation between human languages, has drawn continuous research attention over the past decade. NMT is proposed in contrast to traditional statistical machine translation (SMT) such as phrase-based SMT. Instead of trying to tune many sub-components which requires heavy engineering, NMT uses a simpler end-to-end model to generate translations word by word. The main difference that distinguishes an NMT model
Conclusions
In this work, we identify the research gap in maximizing the utilization of BERT for translation tasks and propose a BERT-enhanced NMT model called BERT-JAM to fill this gap in three ways. Firstly, we equip BERT-JAM with fusion modules for fusing BERT’s multi-layer representations to allow the model to leverage the linguistic features encoded by BERT’s intermediate layers. Secondly, we propose a joint-attention module to allow the encoder/decoder layers to dynamically allocate attention between
CRediT authorship contribution statement
Zhebin Zhang: Conceptualization, Methodology, Software, Writing - original draft. Sai Wu: Data curation, Investigation, Validation. Dawei Jiang: Writing - review & editing, Formal analysis, Visualization. Gang Chen: Supervision, Project administration, Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported by the Key Research and Development Program of Zhejiang Province of China (Grant No. 2020C01024); the Natural Science Foundation of Zhejiang Province of China (Grant No. LY18F020005); and the National Natural Science Foundation of China (Grant No. 61872315).
Zhebin Zhang received his B.S. degree from Zhejiang University in 2016. Currently, he is pursuing the Ph.D. degree in computer science and technology at the same university. His research interests include natural language processing and machine learning, mainly focusing on machine translation.
References (42)
- J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language...
- G. Jawahar, B. Sagot, D. Seddah, What Does BERT Learn about the Structure of Language?, in: Proceedings of the 57th...
- et al.
Acquiring knowledge from pre-trained model to neural machine translation
- J. Yang, M. Wang, H. Zhou, C. Zhao, W. Zhang, Y. Yu, L. Li, Towards Making the Most of BERT in Neural Machine...
- J. Zhu, Y. Xia, L. Wu, D. He, T. Qin, W. Zhou, H. Li, T. Liu, Incorporating BERT into neural machine translation, in:...
- I.J. Goodfellow, M. Mirza, D. Xiao, A. Courville, Y. Bengio, An Empirical Investigation of Catastrophic Forgetting in...
- et al.
Attention is all you need
- et al.
Language Modeling with Gated Convolutional Networks
(2017) - K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in: 2016 IEEE Conference on Computer...
- L.J. Ba, J.R. Kiros, G.E. Hinton, Layer...
WIT3: Web Inventory of Transcribed and Translated Talks
Neural machine translation of rare words with subword units
Adam: A method for stochastic optimization
Pay less attention with lightweight and dynamic convolutions
Multi-branch Attentive Transformer
Cited by (0)
Zhebin Zhang received his B.S. degree from Zhejiang University in 2016. Currently, he is pursuing the Ph.D. degree in computer science and technology at the same university. His research interests include natural language processing and machine learning, mainly focusing on machine translation.
Sai Wu received his Ph.D. degree from National University of Singapore (NUS) in 2011 and is currently an associate professor at College of Computer Science, Zhejiang University. His research interests include distributed databases, cloud systems, indexing techniques and AI-powered databases. He has served as a Program Committee member for VLDB, ICDE, SIGMOD, KDD and CIKM.
Dawei Jiang received his Ph.D. degree from Southeast University in 2008 and is currently a senior researcher at Zhejiang University. His research interests include distributed and parallel database systems, cloud data management, big data management and financial technology. He has published over 20 research articles in international journals and conference proceedings.
Gang Chen received his Ph.D. degree from Zhejiang University in 1998. He is currently a professor at College of Computer Science, Zhejiang University. He has successfully led the investigation in research projects aimed at building China’s indigenous database management systems. His research interests range from relational database systems to largescale data management technologies. He is a member of ACM and a senior member of CCF.