Skip to main content
Log in

Joint source–target encoding with pervasive attention

  • Published:
Machine Translation

Abstract

The pervasive attention model is a sequence-to-sequence model that addresses the issue of source–target interaction in encoder–decoder models by jointly encoding the two sequences with a two-dimensional convolutional neural network. We investigate different design choices for each building block of Pervasive Attention and study their impact to improve the predictive strength of the model. These include different types of layer connectivity, depth of the networks, the filter sizes, and source aggregation mechanisms. Machine translation experiments on the IWSLT’14 De\(\rightarrow\)En, IWSLT’15 En\(\rightarrow\)Vi, WMT’16 En\(\rightarrow\)Ro and WMT’15 De\(\rightarrow\)En datasets show results competitive with state-of-the-art encoder–decoder models, outperforming Transformer models on three of the four tested datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Assuming the maps \(\text {F}_n\) surrounded by the residual connections maintain their inputs’ variance, then this multiplier guarantees the variance remains the same after the summation.

  2. http://www.statmt.org/wmt16.

  3. http://www.statmt.org/wmt15.

  4. In a convolutional block, the first depth-wise convolution has \(k^2d + d\) parameters, the second point-wise convolution has \(d^2+d\) parameters and the feed-forward network first maps to \({\mathbb {R}}^{4d}\) with \(4d^2+4d\) parameters and maps back to \({\mathbb {R}}^d\) with \(4d^2 + d\) parameters.

  5. Evaluated as the total number of generated tokens divided by the total number of reference tokens.

References

  • Arivazhagan N, Cherry C, Macherey W, Chiu C-C, Yavuz S, Pang R, Li W, Raffel C (2019) Monotonic infinite lookback attention for simultaneous machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp 1313–1323, Florence, Italy

  • Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. ArXiv preprint. arXiv:1607.06450

  • Bahar P, Brix C, Ney H (2018) Towards two-dimensional sequence to sequence model in neural machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp 3009–3015, Brussels, Belgium, Oct.–Nov. 2018

  • Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR (2015) Conference Track Proceedings. San Diego, CA, USA, p 15

  • Cettolo M, Niehues J, Stüker S, Bentivogli L, Federico M (2014) Report on the 11th IWSLT evaluation campaign, IWSLT 2014. In: Proceedings of the International Workshop on Spoken Language Translation, vol 57, Hanoi, Vietnam. 16pp

  • Chen T, Xu B, Zhang C, Guestrin C (2016) Training deep nets with sublinear memory cost. ArXiv preprint. arXiv:1604.06174

  • Chollet F, (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. CVPR, (2017) Honolulu. HI, USA, pp 1800–1807

  • Dakwale P, Monz C (2017) Convolutional over recurrent encoder for neural machine translation. Prague Bull Math Linguist. 108(1):37

    Article  Google Scholar 

  • Dauphin YN, Fan A, Auli M, Grangier D (2017) Language modeling with gated convolutional networks. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, volume 70 of Proceedings of Machine Learning Research, pp 933–941, Sydney, NSW, Australia

  • Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1 (Long and Short Papers), pp 4171–4186, Minneapolis, Minnesota

  • Elbayad M, Besacier L, Verbeek J (2018) Pervasive attention: 2D convolutional neural networks for sequence-to-sequence prediction. In: Proceedings of the 22nd Conference on Computational Natural Language Learning, pp 97–107, Brussels, Belgium

  • Elbayad M, Besacier L, Verbeek J, (2020a) Efficient wait-k models for simultaneous machine translation. In: Proceedings of the Interspeech (2020) 21st Annual Conference of the International Speech Communication Association. Virtual Event, Shanghai, China, pp 1461–1465

  • Elbayad M, Ustaszewski M, Esperança-Rodier E, Brunet-Manquat F, Verbeek J, Besacier L (2020b) Online versus offline NMT quality: An in-depth analysis on English-German and German-English. In: Proceedings of the 28th International Conference on Computational Linguistics, pp 5047–5058, Barcelona, Spain (Online), Dec. 2020

  • Fonollosa JAR, Casas N, Costa-jussà MR (2019) Joint source–target self attention with locality constraints. ArXiv preprint. arXiv:1905.06596

  • Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN (2017) Convolutional sequence to sequence learning. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, volume 70 of Proceedings of Machine Learning Research, pp 1243–1252, Sydney, NSW, Australia

  • He H, Lin J (2016) Pairwise word interaction modeling with deep neural networks for semantic similarity measurement. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 937–948, San Diego, California

  • He H, Boyd-Graber J, Daumé III H(2016a) Interpretese vs. translationese: The uniqueness of human strategies in simultaneous interpretation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 971–976, San Diego, California

  • He K, Zhang X, Ren S, Sun J (2016b) Identity mappings in deep residual networks. In Computer Vision—ECCV 2016—14th European Conference, Proceedings, Part IV, volume 9908 of Lecture Notes in Computer Science, pp 630–645, Amsterdam, The Netherlands. Springer

  • He T, Tan X, Xia Y, He D, Qin T, Chen Z, Liu T, (2018) Layer-wise coordination between encoder and decoder for neural machine translation. In: Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, (2018) NeurIPS 2018. Montréal, Canada, pp 7955–7965

  • Hu B, Lu Z, Li H, Chen Q (2014) Convolutional neural network architectures for matching natural language sentences. In: Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014. Montreal, Quebec, Canada, pp 2042–2050

  • Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), ppp ges 4700–4708, Honolulu, HI, USA

  • Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, volume 37 of JMLR Workshop and Conference Proceedings, pp 448–456, Lille, France

  • Kalchbrenner N, Danihelka I, Graves A (2016) Grid long short-term memory. In: 4th International Conference on Learning Representations, ICLR (2016) Conference Track Proceedings. San Juan, Puerto Rico, p 15

  • Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR (2015) Conference Track Proceedings. San Diego, CA, USA, p 15

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X: Papers, pp 79–86, Phuket, Thailand

  • Lee J, Mansimov E, Cho K (2018) Deterministic non-autoregressive neural sequence modeling by iterative refinement. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp 1173–1182, Brussels, Belgium

  • Levy D, Wolf L (2017) Learning to align the source code to the compiled object code. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, volume 70 of Proceedings of Machine Learning Research, pp 2043–2051, Sydney, NSW, Australia

  • Luong M-T, Manning C (2015) Stanford neural machine translation systems for spoken language domains. In: Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign, pp 76–79, Da Nang, Vietnam

  • Ma M, Huang L, Xiong H, Zheng R, Liu K, Zheng B, Zhang C, He Z, Liu H, Li X, Wu H, Wang H (2019a) STACL: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp 3025–3036, Florence, Italy

  • Ma X, Zhou C, Li X, Neubig G, Hovy E (2019b) FlowSeq: Non-autoregressive conditional sequence generation with generative flow. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 4282–4292, Hong Kong, China

  • Ma X, Pino JM, Cross J, Puzon L, Gu J (2020) Monotonic multihead attention. In: Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia. 11pp

  • Michel P, Levy O, Neubig G, (2019) Are sixteen heads really better than one? In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, (2019) NeurIPS 2019. Vancouver, BC, Canada, pp 14014–14024

  • Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp 807–814, Haifa, Israel

  • Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, Grangier D, Auli M (2019) fairseq: A fast, extensible toolkit for sequence modeling. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp 48–53, Minneapolis, Minnesota, 2019

  • Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp 311–318, Philadelphia, Pennsylvania, USA

  • Pleiss G, Chen D, Huang G, Li T, van der Maaten L, Weinberger KQ (2017) Memory-efficient implementation of DenseNets. ArXiv preprint. arXiv:1707.06990

  • Post M (2018) A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers, pp 186–191, Brussels, Belgium

  • Raison M, Mazaré P-E, Das R, Bordes A (2018) Weaver: deep co-encoding of questions and documents for machine reading. ArXiv preprint. arXiv:1804.10490

  • Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1715–1725, Berlin, Germany

  • Smith JR, Saint-Amand H, Plamada M, Koehn P, Callison-Burch C, Lopez A (2013) Dirt cheap web-scale parallel text from the Common Crawl. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1374–1383, Sofia, Bulgaria

  • Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015. Montreal, Quebec, Canada, pp 2377–2385

  • Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z, (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. CVPR (2016) Las Vegas. NV, USA, pp 2818–2826

  • Tang G, Sennrich R, Nivre J (2018) An analysis of attention mechanisms: The case of word sense disambiguation in neural machine translation. In: Proceedings of the Third Conference on Machine Translation: Research Papers, pp 26–35, Brussels, Belgium

  • Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pp 2214–2218, Istanbul, Turkey

  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. Long Beach, CA, USA, pp 5998–6008

  • Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2015. MA, USA, Boston, pp 3156–3164

  • Voita E, Talbot D, Moiseev F, Sennrich R, Titov I (2019) Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp 5797–5808, Florence, Italy

  • Wan S, Lan Y, Guo J, Xu J, Pang L, Cheng X (2016) A deep architecture for semantic matching with multiple positional sentence representations. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp 2835–2841, Phoenix, Arizona, USA

  • Wu Y, He K (2018) Group normalization. In: Computer Vision—ECCV 2018—15th European Conference, Proceedings, Part XIII, volume 11217 of Lecture Notes in Computer Science, pp 3–19, Munich, Germany. Springer

  • Wu X, Fan A, Baevski A, Dauphin YN, Auli M (2019) Pay less attention with lightweight and dynamic convolutions. InL Proceedings of the 7th International Conference on Learning Representations, ICLR (2019) New Orleans. LA, USA, p 14

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maha Elbayad.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Elbayad, M., Besacier, L. & Verbeek, J. Joint source–target encoding with pervasive attention. Machine Translation 35, 637–659 (2021). https://doi.org/10.1007/s10590-021-09289-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-021-09289-7

Keywords

Navigation