Elsevier

Neurocomputing

Volume 460, 14 October 2021, Pages 84-94
Neurocomputing

BERT-JAM: Maximizing the utilization of BERT for neural machine translation

https://doi.org/10.1016/j.neucom.2021.07.002Get rights and content

Highlights

  • Employs joint attention for the incorporation of BERT into NMT models.

  • Makes use of the representations of BERT’s intermediate layers.

  • Employs a three-phase optimization strategy to overcome catastrophic forgetting.

  • Studies how the size of BERT impacts the performance of NMT models.

Abstract

Pre-training based approaches have been demonstrated effective for a wide range of natural language processing tasks. Leveraging BERT for neural machine translation (NMT), which we refer to as BERT-enhanced NMT, has received increasing interest in recent years. However, there still exists a research gap in studying how to maximize the utilization of BERT for NMT tasks. Firstly, previous studies mostly focus on utilizing BERT’s last-layer representation, neglecting the linguistic features encoded by the intermediate layers. Secondly, it requires further architectural exploration to integrate the BERT representation with the NMT encoder/decoder layers efficiently. And thirdly, existing methods keep the BERT parameters fixed during training to avoid the catastrophic forgetting problem, wasting the chances of boosting the performance via fine-tuning. In this paper, we propose BERT-JAM to fill the research gap from three aspects: 1) we equip BERT-JAM with fusion modules for composing BERT’s multi-layer representations into a fused representation that can be leveraged by the NMT model, 2) BERT-JAM utilizes joint-attention modules to allow the BERT representation to be dynamically integrated with the encoder/decoder representations, and 3) we train BERT-JAM with a three-phase optimization strategy that progressively unfreezes different components to overcome catastrophic forgetting during fine-tuning. Experimental results show that BERT-JAM achieves state-of-the-art BLEU scores on multiple translation tasks.

Introduction

Pre-training has been demonstrated as a highly effective method for boosting the performance of many natural language processing (NLP) tasks, such as question answering, text classification, and so on. By training on massive unlabeled text data, pre-trained language models are able to learn the contextual representations of input tokens, which are extremely helpful for accomplishing downstream tasks. BERT [1], as one of the most widely used pre-trained language models, is trained using two unsupervised tasks, namely, mask language modeling and next sentence prediction. By adding a few layers on top, BERT can be easily adapted into a task-specific model, which is then fine-tuned on the labeled data to achieve optimal performance. Such a practice has been exercised in various NLP scenarios and has achieved many state-of-the-art (SOTA) results.

The study of BERT-enhanced NMT, i.e., leveraging pre-trained BERT for neural machine translation, has received much research interest. However, exploiting BERT for NMT cannot be easily accomplished by adding a few layers atop BERT and then fine-tuning, as in monolingual NLP tasks. The challenge is brought by the complexity of NMT models which typically consist of an encoder that transforms the source language tokens into a hidden representation and a decoder that predicts the target language tokens based on the hidden representation.

Although various methods for integrating NMT with BERT have been proposed, it remains an open question how to maximize the utilization of BERT for NMT. We identify three aspects of the research gap. Firstly, most previous studies on BERT-enhanced NMT only utilize BERT’s last-layer representation without making full use of the intermediate layers. It has been shown that BERT’s different layers encode different linguistic features [2], which can be fully used to boost the performance of NMT models. The dynamic fusion method proposed by Weng et al. [3], to our knowledge, is the only work on the utilization of BERT’s multi-layer representations for NMT. However, their method conditions the weights for combining the multi-layer representations on the instance-specific representations of the encoder layers. Consequently, it cannot be applied to the decoder layers because the ground-truth decoder representations are not available at the inference stage. Hence, how to make the most of BERT’s intermediate layers for both the NMT encoder and decoder requires further exploration.

Secondly, there is much more to learn about how to allow the NMT encoder and decoder to utilize the BERT representation in an efficient manner. Various methods have been proposed, e.g., transforming the encoder representation by adding the BERT representation to it [3], using distillation to transfer the knowledge from BERT to the NMT model [4], etc. The recently proposed BERT-fused model [5] employs attention mechanisms to fuse BERT representations with the encoder/decoder layers and achieves state-of-the-art results. Specifically, they introduce a BERT-encoder cross-attention module into each encoder layer and average its output with the self-attention module, as illustrated in Fig. 1(a). However, we find the approach of averaging the outputs unable to fully utilize the BERT representation when more attention should be paid to the BERT representation.

Thirdly, previous works propose to keep the BERT parameters fixed during training to avoid catastrophic forgetting [4], [6]. Catastrophic forgetting refers to the problem that the knowledge pre-trained by BERT is gradually lost during fine-tuning. Although catastrophic forgetting is a common problem that comes with fine-tuning, it is exacerbated in the field of BERT-enhanced NMT, which can be attributed to the fact that NMT models take much more iterations to converge. However, giving up fine-tuning BERT because of the forgetting problem prevents us from fully optimizing the model.

To fill the research gap, we propose a BERT-enhanced NMT model named BERT-JAM, which stands for BERT-fused Joint-Attention Model. We seek to maximize the utilization of BERT for NMT in three ways. Firstly, we incorporate a fusion module into each encoder/decoder layer used for composing BERT’s multi-layer representations into a fused representation. In contrast to the dynamic fusion method [3], we let the weights for combining the multi-layer representations be shared among different instances. Not only can our fusion method be applied to both the encoder and the decoder, but also it allows each encoder/decoder layer to have a consistent preference for certain BERT layers. Secondly, we propose a joint-attention module used for integrating the fused BERT representation with encoder and decoder layers, as shown in Fig. 1(b). Different from the BERT-fused model which uses separate attention modules (i.e., cross-attention module and self-attention module) whose outputs are averaged, the joint-attention module simultaneously performs the functions of different attention modules and dynamically allocate attention between the BERT representation and the encoder/decoder representation. Thirdly, we train BERT-JAM with a three-phase optimization strategy that progressively unfreezes different parts of the model to overcome the problem of catastrophic forgetting. The strategy allows the model to benefit from the performance boost brought by fine-tuning BERT.

We summarize the contributions of this paper as follows:

  • We propose a novel BERT-enhanced NMT model named BERT-JAM which is equipped with fusion modules and joint-attention modules. The fusion module composes BERT’s multi-layer representations into a fused representation and allows BERT-JAM to benefit from the linguistic features encoded by BERT’s intermediate layers. And the joint-attention module integrates the functions of different attention modules and dynamically allocates attention between different representations.

  • We propose a three-phase optimization strategy to train BERT-JAM that allows us to overcome the catastrophic forgetting problem while fine-tuning.

  • This is the first work that studies how the performance of BERT-enhanced NMT models varies with the size of BERT. We find out that, with constrained memory usage and latency bound, deeper BERT models (more layers) are better at boosting the translation performance compared with wider ones (larger dimension).

  • We evaluate the proposed BERT-JAM model on several widely used translation tasks. Experimental results show that BERT-JAM achieves SOTA scores on multiple benchmarks, demonstrating the effectiveness of our method.

The rest of this paper is organized as follows. In Section 2, we detail our approach in terms of module description, model architecture, and optimization strategy. Section 3 introduces the experimental setups. In Section 4, we conduct comprehensive experiments with BERT-JAM and discuss the results. We give a review of related works in Section 5 and the conclusions are drawn in Section 6.

Section snippets

Approach

This section presents our proposed approach to boosting machine translation performance with BERT. We begin by introducing the backgrounds of BERT-enhanced NMT. Then, we introduce two building blocks of our model, i.e., the joint-attention module and the fusion module. Next, we elaborate on the architecture of BERT-JAM. Finally, we describe the three-phase optimization strategy used to train the model.

Experimental setup

This section details the experimental setup in terms of data preparation, model configuration, training settings, and evaluation metrics. We implement our model upon the Fairseq repository.1 The source code for our work has been made publicly available.2

Experiments and results

This section introduces the experiments we conduct to evaluate our method. We first report the results on the benchmark translation tasks in both low-resource and high-resource scenarios. Then we work on the IWSLT’14 En-De task where we vary the size of BERT to study how it affects the translation performance. Finally, we conduct ablation studies to justify the design choices in our proposed model.

Neural machine translation

Neural machine translation (NMT), which aims to use artificial neural networks for the translation between human languages, has drawn continuous research attention over the past decade. NMT is proposed in contrast to traditional statistical machine translation (SMT) such as phrase-based SMT. Instead of trying to tune many sub-components which requires heavy engineering, NMT uses a simpler end-to-end model to generate translations word by word. The main difference that distinguishes an NMT model

Conclusions

In this work, we identify the research gap in maximizing the utilization of BERT for translation tasks and propose a BERT-enhanced NMT model called BERT-JAM to fill this gap in three ways. Firstly, we equip BERT-JAM with fusion modules for fusing BERT’s multi-layer representations to allow the model to leverage the linguistic features encoded by BERT’s intermediate layers. Secondly, we propose a joint-attention module to allow the encoder/decoder layers to dynamically allocate attention between

CRediT authorship contribution statement

Zhebin Zhang: Conceptualization, Methodology, Software, Writing - original draft. Sai Wu: Data curation, Investigation, Validation. Dawei Jiang: Writing - review & editing, Formal analysis, Visualization. Gang Chen: Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by the Key Research and Development Program of Zhejiang Province of China (Grant No. 2020C01024); the Natural Science Foundation of Zhejiang Province of China (Grant No. LY18F020005); and the National Natural Science Foundation of China (Grant No. 61872315).

Zhebin Zhang received his B.S. degree from Zhejiang University in 2016. Currently, he is pursuing the Ph.D. degree in computer science and technology at the same university. His research interests include natural language processing and machine learning, mainly focusing on machine translation.

References (42)

  • J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language...
  • G. Jawahar, B. Sagot, D. Seddah, What Does BERT Learn about the Structure of Language?, in: Proceedings of the 57th...
  • R. Weng et al.

    Acquiring knowledge from pre-trained model to neural machine translation

  • J. Yang, M. Wang, H. Zhou, C. Zhao, W. Zhang, Y. Yu, L. Li, Towards Making the Most of BERT in Neural Machine...
  • J. Zhu, Y. Xia, L. Wu, D. He, T. Qin, W. Zhou, H. Li, T. Liu, Incorporating BERT into neural machine translation, in:...
  • I.J. Goodfellow, M. Mirza, D. Xiao, A. Courville, Y. Bengio, An Empirical Investigation of Catastrophic Forgetting in...
  • A. Vaswani et al.

    Attention is all you need

  • Y.N. Dauphin et al.

    Language Modeling with Gated Convolutional Networks

    (2017)
  • K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in: 2016 IEEE Conference on Computer...
  • L.J. Ba, J.R. Kiros, G.E. Hinton, Layer...
  • J. Howard, S. Ruder, Universal Language Model Fine-tuning for Text Classification, in: Proceedings of the 56th Annual...
  • M. Cettolo et al.

    WIT3: Web Inventory of Transcribed and Translated Talks

  • P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C....
  • R. Sennrich et al.

    Neural machine translation of rare words with subword units

  • D.P. Kingma et al.

    Adam: A method for stochastic optimization

  • F. Wu et al.

    Pay less attention with lightweight and dynamic convolutions

  • K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in:...
  • M. Post, A Call for clarity in reporting BLEU scores, in: Proceedings of the 3rd Conference on Machine Translation:...
  • Y. Fan et al.

    Multi-branch Attentive Transformer

    (2020)
  • G. Zhao, X. Sun, J. Xu, Z. Zhang, L. Luo, MUSE: Parallel Multi-Scale Attention for Sequence to Sequence...
  • N. Iyer, V. Thejas, N. Kwatra, R. Ramjee, M. Sivathanu, Wide-minima Density Hypothesis and the Explore-Exploit Learning...
  • Cited by (0)

    Zhebin Zhang received his B.S. degree from Zhejiang University in 2016. Currently, he is pursuing the Ph.D. degree in computer science and technology at the same university. His research interests include natural language processing and machine learning, mainly focusing on machine translation.

    Sai Wu received his Ph.D. degree from National University of Singapore (NUS) in 2011 and is currently an associate professor at College of Computer Science, Zhejiang University. His research interests include distributed databases, cloud systems, indexing techniques and AI-powered databases. He has served as a Program Committee member for VLDB, ICDE, SIGMOD, KDD and CIKM.

    Dawei Jiang received his Ph.D. degree from Southeast University in 2008 and is currently a senior researcher at Zhejiang University. His research interests include distributed and parallel database systems, cloud data management, big data management and financial technology. He has published over 20 research articles in international journals and conference proceedings.

    Gang Chen received his Ph.D. degree from Zhejiang University in 1998. He is currently a professor at College of Computer Science, Zhejiang University. He has successfully led the investigation in research projects aimed at building China’s indigenous database management systems. His research interests range from relational database systems to largescale data management technologies. He is a member of ACM and a senior member of CCF.

    View full text