Searching for Reasons of Transformers’ Success: Memorization vs Generalization

Trebuňa, František; Szabová, Kristína; Bojar, Ondřej

doi:10.1007/978-3-031-40498-6_3

František Trebuňa¹⁰,
Kristína Szabová¹⁰ &
Ondřej Bojar¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14102))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

433 Accesses

Abstract

The Transformer architecture has, since its conception, led to numerous breakthrough advancements in natural language processing. We are interested in finding out whether its success is primarily due to its capacity to learn the various generic language rules, or whether the architecture leverages some memorized constructs without understanding their structure. We conduct a series of experiments in which we modify the training dataset to prevent the model from memorizing bigrams of words that are needed by the test data. We find out that while such a model performs worse than its unrestricted counterpart, the findings do not indicate that the Transformers’ success is solely due to its memorization capacity. In a small qualitative analysis, we demonstrate that a human translator lacking the necessary terminological knowledge would likely struggle in a similar way.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
[1, 2, 6].
2.
http://www.stat.org/wmt13 to wmt20.
3.
We use the common technique of subword units as described below, but we nevertheless decide to study the memorization effect on sentence syntax rather than on word formation.
4.
We lemmatize and tag all our data using UDPipe [8].

References

Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota (Volume 1: Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
W Foundation: ACL 2019 fourth conference on machine translation (WMT19), shared task: machine translation of news. http://www.statmt.org/wmt19/translation-task.html
Kocmi, T., Popel, M., Bojar, O.: Announcing CzEng 2.0 parallel corpus with over 2 gigawords. arXiv preprint arXiv:2007.03006 (2020)
Popel, M., et al.: Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. Nat. Commun. 11(4381), 1–15 (2020). https://doi.org/10.1038/s41467-020-18073-9. https://www.nature.com/articles/s41467-020-18073-9
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019). https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Sennrich, R., Zhang, B.: Revisiting low-resource neural machine translation: a case study. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 211–221. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/P19-1021. https://aclanthology.org/P19-1021
Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium, pp. 197–207. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/K18-2020. https://aclanthology.org/K18-2020
Varis, D., Bojar, O.: Sequence length is a domain: length-based overfitting in transformer models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.emnlp-main.650
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 6000–6010. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Download references

Acknowledgements

This work was supported by the grant 19-26934X (NEUREM3) of the Grant Agency of the Czech Republic.

Author information

Authors and Affiliations

Institute of Formal and Applied Linguistics, Charles University, Prague, Czechia
František Trebuňa, Kristína Szabová & Ondřej Bojar

Authors

František Trebuňa
View author publications
You can also search for this author in PubMed Google Scholar
Kristína Szabová
View author publications
You can also search for this author in PubMed Google Scholar
Ondřej Bojar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kristína Szabová .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
František Pártl
University of West Bohemia, Pilsen, Czech Republic
Miloslav Konopík

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Trebuňa, F., Szabová, K., Bojar, O. (2023). Searching for Reasons of Transformers’ Success: Memorization vs Generalization. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-40498-6_3
Published: 23 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40497-9
Online ISBN: 978-3-031-40498-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Searching for Reasons of Transformers’ Success: Memorization vs Generalization