Skip to main content

Searching for Reasons of Transformers’ Success: Memorization vs Generalization

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14102))

Included in the following conference series:

  • 433 Accesses

Abstract

The Transformer architecture has, since its conception, led to numerous breakthrough advancements in natural language processing. We are interested in finding out whether its success is primarily due to its capacity to learn the various generic language rules, or whether the architecture leverages some memorized constructs without understanding their structure. We conduct a series of experiments in which we modify the training dataset to prevent the model from memorizing bigrams of words that are needed by the test data. We find out that while such a model performs worse than its unrestricted counterpart, the findings do not indicate that the Transformers’ success is solely due to its memorization capacity. In a small qualitative analysis, we demonstrate that a human translator lacking the necessary terminological knowledge would likely struggle in a similar way.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    [1, 2, 6].

  2. 2.

    http://www.stat.org/wmt13 to wmt20.

  3. 3.

    We use the common technique of subword units as described below, but we nevertheless decide to study the memorization effect on sentence syntax rather than on word formation.

  4. 4.

    We lemmatize and tag all our data using UDPipe [8].

References

  1. Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020). https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

  2. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota (Volume 1: Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423

  3. W Foundation: ACL 2019 fourth conference on machine translation (WMT19), shared task: machine translation of news. http://www.statmt.org/wmt19/translation-task.html

  4. Kocmi, T., Popel, M., Bojar, O.: Announcing CzEng 2.0 parallel corpus with over 2 gigawords. arXiv preprint arXiv:2007.03006 (2020)

  5. Popel, M., et al.: Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals. Nat. Commun. 11(4381), 1–15 (2020). https://doi.org/10.1038/s41467-020-18073-9. https://www.nature.com/articles/s41467-020-18073-9

  6. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019). https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

  7. Sennrich, R., Zhang, B.: Revisiting low-resource neural machine translation: a case study. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 211–221. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/P19-1021. https://aclanthology.org/P19-1021

  8. Straka, M.: UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium, pp. 197–207. Association for Computational Linguistics (2018). https://doi.org/10.18653/v1/K18-2020. https://aclanthology.org/K18-2020

  9. Varis, D., Bojar, O.: Sequence length is a domain: length-based overfitting in transformer models. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.emnlp-main.650

  10. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 6000–6010. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Download references

Acknowledgements

This work was supported by the grant 19-26934X (NEUREM3) of the Grant Agency of the Czech Republic.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kristína Szabová .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Trebuňa, F., Szabová, K., Bojar, O. (2023). Searching for Reasons of Transformers’ Success: Memorization vs Generalization. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40498-6_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40497-9

  • Online ISBN: 978-3-031-40498-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics