Abstract
The Transformer architecture has, since its conception, led to numerous breakthrough advancements in natural language processing. We are interested in finding out whether its success is primarily due to its capacity to learn the various generic language rules, or whether the architecture leverages some memorized constructs without understanding their structure. We conduct a series of experiments in which we modify the training dataset to prevent the model from memorizing bigrams of words that are needed by the test data. We find out that while such a model performs worse than its unrestricted counterpart, the findings do not indicate that the Transformers’ success is solely due to its memorization capacity. In a small qualitative analysis, we demonstrate that a human translator lacking the necessary terminological knowledge would likely struggle in a similar way.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
http://www.stat.org/wmt13 to wmt20.
- 3.
We use the common technique of subword units as described below, but we nevertheless decide to study the memorization effect on sentence syntax rather than on word formation.
- 4.
We lemmatize and tag all our data using UDPipe [8].
References
Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota (Volume 1: Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
W Foundation: ACL 2019 fourth conference on machine translation (WMT19), shared task: machine translation of news. http://www.statmt.org/wmt19/translation-task.html
Kocmi, T., Popel, M., Bojar, O.: Announcing CzEng 2.0 parallel corpus with over 2 gigawords. arXiv preprint arXiv:2007.03006 (2020)
Popel, M., et al.: Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. Nat. Commun. 11(4381), 1–15 (2020). https://doi.org/10.1038/s41467-020-18073-9. https://www.nature.com/articles/s41467-020-18073-9
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019). https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Sennrich, R., Zhang, B.: Revisiting low-resource neural machine translation: a case study. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 211–221. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/P19-1021. https://aclanthology.org/P19-1021
Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium, pp. 197–207. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/K18-2020. https://aclanthology.org/K18-2020
Varis, D., Bojar, O.: Sequence length is a domain: length-based overfitting in transformer models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.emnlp-main.650
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 6000–6010. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Acknowledgements
This work was supported by the grant 19-26934X (NEUREM3) of the Grant Agency of the Czech Republic.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Trebuňa, F., Szabová, K., Bojar, O. (2023). Searching for Reasons of Transformers’ Success: Memorization vs Generalization. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-40498-6_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40497-9
Online ISBN: 978-3-031-40498-6
eBook Packages: Computer ScienceComputer Science (R0)