Abstract
The pervasive attention model is a sequence-to-sequence model that addresses the issue of source–target interaction in encoder–decoder models by jointly encoding the two sequences with a two-dimensional convolutional neural network. We investigate different design choices for each building block of Pervasive Attention and study their impact to improve the predictive strength of the model. These include different types of layer connectivity, depth of the networks, the filter sizes, and source aggregation mechanisms. Machine translation experiments on the IWSLT’14 De\(\rightarrow\)En, IWSLT’15 En\(\rightarrow\)Vi, WMT’16 En\(\rightarrow\)Ro and WMT’15 De\(\rightarrow\)En datasets show results competitive with state-of-the-art encoder–decoder models, outperforming Transformer models on three of the four tested datasets.
Similar content being viewed by others
Notes
Assuming the maps \(\text {F}_n\) surrounded by the residual connections maintain their inputs’ variance, then this multiplier guarantees the variance remains the same after the summation.
In a convolutional block, the first depth-wise convolution has \(k^2d + d\) parameters, the second point-wise convolution has \(d^2+d\) parameters and the feed-forward network first maps to \({\mathbb {R}}^{4d}\) with \(4d^2+4d\) parameters and maps back to \({\mathbb {R}}^d\) with \(4d^2 + d\) parameters.
Evaluated as the total number of generated tokens divided by the total number of reference tokens.
References
Arivazhagan N, Cherry C, Macherey W, Chiu C-C, Yavuz S, Pang R, Li W, Raffel C (2019) Monotonic infinite lookback attention for simultaneous machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp 1313–1323, Florence, Italy
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. ArXiv preprint. arXiv:1607.06450
Bahar P, Brix C, Ney H (2018) Towards two-dimensional sequence to sequence model in neural machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp 3009–3015, Brussels, Belgium, Oct.–Nov. 2018
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR (2015) Conference Track Proceedings. San Diego, CA, USA, p 15
Cettolo M, Niehues J, Stüker S, Bentivogli L, Federico M (2014) Report on the 11th IWSLT evaluation campaign, IWSLT 2014. In: Proceedings of the International Workshop on Spoken Language Translation, vol 57, Hanoi, Vietnam. 16pp
Chen T, Xu B, Zhang C, Guestrin C (2016) Training deep nets with sublinear memory cost. ArXiv preprint. arXiv:1604.06174
Chollet F, (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. CVPR, (2017) Honolulu. HI, USA, pp 1800–1807
Dakwale P, Monz C (2017) Convolutional over recurrent encoder for neural machine translation. Prague Bull Math Linguist. 108(1):37
Dauphin YN, Fan A, Auli M, Grangier D (2017) Language modeling with gated convolutional networks. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, volume 70 of Proceedings of Machine Learning Research, pp 933–941, Sydney, NSW, Australia
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1 (Long and Short Papers), pp 4171–4186, Minneapolis, Minnesota
Elbayad M, Besacier L, Verbeek J (2018) Pervasive attention: 2D convolutional neural networks for sequence-to-sequence prediction. In: Proceedings of the 22nd Conference on Computational Natural Language Learning, pp 97–107, Brussels, Belgium
Elbayad M, Besacier L, Verbeek J, (2020a) Efficient wait-k models for simultaneous machine translation. In: Proceedings of the Interspeech (2020) 21st Annual Conference of the International Speech Communication Association. Virtual Event, Shanghai, China, pp 1461–1465
Elbayad M, Ustaszewski M, Esperança-Rodier E, Brunet-Manquat F, Verbeek J, Besacier L (2020b) Online versus offline NMT quality: An in-depth analysis on English-German and German-English. In: Proceedings of the 28th International Conference on Computational Linguistics, pp 5047–5058, Barcelona, Spain (Online), Dec. 2020
Fonollosa JAR, Casas N, Costa-jussà MR (2019) Joint source–target self attention with locality constraints. ArXiv preprint. arXiv:1905.06596
Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN (2017) Convolutional sequence to sequence learning. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, volume 70 of Proceedings of Machine Learning Research, pp 1243–1252, Sydney, NSW, Australia
He H, Lin J (2016) Pairwise word interaction modeling with deep neural networks for semantic similarity measurement. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 937–948, San Diego, California
He H, Boyd-Graber J, Daumé III H(2016a) Interpretese vs. translationese: The uniqueness of human strategies in simultaneous interpretation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 971–976, San Diego, California
He K, Zhang X, Ren S, Sun J (2016b) Identity mappings in deep residual networks. In Computer Vision—ECCV 2016—14th European Conference, Proceedings, Part IV, volume 9908 of Lecture Notes in Computer Science, pp 630–645, Amsterdam, The Netherlands. Springer
He T, Tan X, Xia Y, He D, Qin T, Chen Z, Liu T, (2018) Layer-wise coordination between encoder and decoder for neural machine translation. In: Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, (2018) NeurIPS 2018. Montréal, Canada, pp 7955–7965
Hu B, Lu Z, Li H, Chen Q (2014) Convolutional neural network architectures for matching natural language sentences. In: Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014. Montreal, Quebec, Canada, pp 2042–2050
Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), ppp ges 4700–4708, Honolulu, HI, USA
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, volume 37 of JMLR Workshop and Conference Proceedings, pp 448–456, Lille, France
Kalchbrenner N, Danihelka I, Graves A (2016) Grid long short-term memory. In: 4th International Conference on Learning Representations, ICLR (2016) Conference Track Proceedings. San Juan, Puerto Rico, p 15
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR (2015) Conference Track Proceedings. San Diego, CA, USA, p 15
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X: Papers, pp 79–86, Phuket, Thailand
Lee J, Mansimov E, Cho K (2018) Deterministic non-autoregressive neural sequence modeling by iterative refinement. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp 1173–1182, Brussels, Belgium
Levy D, Wolf L (2017) Learning to align the source code to the compiled object code. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, volume 70 of Proceedings of Machine Learning Research, pp 2043–2051, Sydney, NSW, Australia
Luong M-T, Manning C (2015) Stanford neural machine translation systems for spoken language domains. In: Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign, pp 76–79, Da Nang, Vietnam
Ma M, Huang L, Xiong H, Zheng R, Liu K, Zheng B, Zhang C, He Z, Liu H, Li X, Wu H, Wang H (2019a) STACL: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp 3025–3036, Florence, Italy
Ma X, Zhou C, Li X, Neubig G, Hovy E (2019b) FlowSeq: Non-autoregressive conditional sequence generation with generative flow. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 4282–4292, Hong Kong, China
Ma X, Pino JM, Cross J, Puzon L, Gu J (2020) Monotonic multihead attention. In: Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia. 11pp
Michel P, Levy O, Neubig G, (2019) Are sixteen heads really better than one? In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, (2019) NeurIPS 2019. Vancouver, BC, Canada, pp 14014–14024
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp 807–814, Haifa, Israel
Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, Grangier D, Auli M (2019) fairseq: A fast, extensible toolkit for sequence modeling. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp 48–53, Minneapolis, Minnesota, 2019
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp 311–318, Philadelphia, Pennsylvania, USA
Pleiss G, Chen D, Huang G, Li T, van der Maaten L, Weinberger KQ (2017) Memory-efficient implementation of DenseNets. ArXiv preprint. arXiv:1707.06990
Post M (2018) A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers, pp 186–191, Brussels, Belgium
Raison M, Mazaré P-E, Das R, Bordes A (2018) Weaver: deep co-encoding of questions and documents for machine reading. ArXiv preprint. arXiv:1804.10490
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1715–1725, Berlin, Germany
Smith JR, Saint-Amand H, Plamada M, Koehn P, Callison-Burch C, Lopez A (2013) Dirt cheap web-scale parallel text from the Common Crawl. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1374–1383, Sofia, Bulgaria
Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015. Montreal, Quebec, Canada, pp 2377–2385
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z, (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. CVPR (2016) Las Vegas. NV, USA, pp 2818–2826
Tang G, Sennrich R, Nivre J (2018) An analysis of attention mechanisms: The case of word sense disambiguation in neural machine translation. In: Proceedings of the Third Conference on Machine Translation: Research Papers, pp 26–35, Brussels, Belgium
Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pp 2214–2218, Istanbul, Turkey
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. Long Beach, CA, USA, pp 5998–6008
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2015. MA, USA, Boston, pp 3156–3164
Voita E, Talbot D, Moiseev F, Sennrich R, Titov I (2019) Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp 5797–5808, Florence, Italy
Wan S, Lan Y, Guo J, Xu J, Pang L, Cheng X (2016) A deep architecture for semantic matching with multiple positional sentence representations. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp 2835–2841, Phoenix, Arizona, USA
Wu Y, He K (2018) Group normalization. In: Computer Vision—ECCV 2018—15th European Conference, Proceedings, Part XIII, volume 11217 of Lecture Notes in Computer Science, pp 3–19, Munich, Germany. Springer
Wu X, Fan A, Baevski A, Dauphin YN, Auli M (2019) Pay less attention with lightweight and dynamic convolutions. InL Proceedings of the 7th International Conference on Learning Representations, ICLR (2019) New Orleans. LA, USA, p 14
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Elbayad, M., Besacier, L. & Verbeek, J. Joint source–target encoding with pervasive attention. Machine Translation 35, 637–659 (2021). https://doi.org/10.1007/s10590-021-09289-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-021-09289-7