Skip to main content

Transformers: “The End of History” for Natural Language Processing?

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Research Track (ECML PKDD 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12977))

Abstract

Recent advances in neural architectures, such as the Transformer, coupled with the emergence of large-scale pre-trained models such as BERT, have revolutionized the field of Natural Language Processing (NLP), pushing the state of the art for a number of NLP tasks. A rich family of variations of these models has been proposed, such as RoBERTa, ALBERT, and XLNet, but fundamentally, they all remain limited in their ability to model certain kinds of information, and they cannot cope with certain information sources, which was easy for pre-existing models. Thus, here we aim to shed light on some important theoretical limitations of pre-trained BERT-style models that are inherent in the general Transformer architecture. First, we demonstrate in practice on two general types of tasks—segmentation and segment labeling—and on four datasets that these limitations are indeed harmful and that addressing them, even in some very simple and naïve ways, can yield sizable improvements over vanilla RoBERTa and XLNet models. Then, we offer a more general discussion on desiderata for future additions to the Transformer architecture that would increase its expressiveness, which we hope could help in the design of the next generation of deep NLP architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A notable previous promising attempt was ELMo [21], but it became largely outdated in less than a year.

  2. 2.

    http://en.wikipedia.org/wiki/The_End_of_History_and_the_Last_Man.

  3. 3.

    Some solutions were proposed such as Longformer [3], Performer [4], Linformer [33], Linear Transformer [15], and Big Bird [35].

  4. 4.

    The official task webpage: http://propaganda.qcri.org/semeval2020-task11/.

  5. 5.

    http://github.com/huggingface/transformers.

References

  1. Arkhipov, M., Trofimova, M., Kuratov, Y., Sorokin, A.: Tuning multilingual transformers for language-specific named entity recognition. In: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing (BSNLP 2019), pp. 89–93. Florence, Italy (2019)

    Google Scholar 

  2. Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval 2017), pp. 546–555. Vancouver, Canada (2017)

    Google Scholar 

  3. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. In: ArXiv (2020)

    Google Scholar 

  4. Choromanski, K., et al.: Rethinking attention with performers. In: Proceedings of the 9th International Conference on Learning Representations (ICLR 2021) (2021)

    Google Scholar 

  5. Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? An analysis of BERT’s attention, ArXiv (2019)

    Google Scholar 

  6. Da San Martino, G., Barrón-Cedeño, A., Wachsmuth, H., Petrov, R., Nakov, P.: SemEval-2020 task 11: detection of propaganda techniques in news articles. In: Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval 2020), Barcelona, Spain (2020)

    Google Scholar 

  7. Da San Martino, G., Yu, S., Barrón-Cedeño, A., Petrov, R., Nakov, P.: Fine-grained analysis of propaganda in news article. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), pp. 5636–5646. Hong Kong, China (2019)

    Google Scholar 

  8. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., Salakhutdinov, R.: TransformerXL: attentive language models beyond a fixed-length context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), pp. 2978–2988. Florence, Italy (2019)

    Google Scholar 

  9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), pp. 4171–4186. Minneapolis, MN, USA (2019)

    Google Scholar 

  10. Durrani, N., Dalvi, F., Sajjad, H., Belinkov, Y., Nakov, P.: One size does not fit all: Comparing NMT representations of different granularities. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), pp. 1504–1516. Minneapolis, MN, USA (2019)

    Google Scholar 

  11. Ettinger, A.: What BERT is not: lessons from a new suite of psycholinguistic diagnostics for language models. Trans. Assoc. Comput. Linguist. 8, 34–48 (2020)

    Article  Google Scholar 

  12. Goldberg, Y.: Assessing bert’s syntactic abilities (2019)

    Google Scholar 

  13. Jawahar, G., Sagot, B., Seddah, D.: What does BERT learn about the structure of language? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), pp. 3651–3657. Florence, Italy (2019)

    Google Scholar 

  14. Jin, D., Jin, Z., Zhou, J.T., Szolovits, P.: Is BERT really robust? a strong baseline for natural language attack on text classification and entailment. In: Proceedings of the 34th Conference on Artificial Intelligence (AAAI 2020), pp. 8018–8025 (2019)

    Google Scholar 

  15. Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: fast autoregressive transformers with linear attention. In: Proceedings of the 37th International Conference on Machine Learning (ICML 2020), pp. 5156–5165 (2020)

    Google Scholar 

  16. Kovaleva, O., Romanov, A., Rogers, A., Rumshisky, A.: Revealing the dark secrets of BERT. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), pp. 4365–4374. Hong Kong, China (2019)

    Google Scholar 

  17. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pp. 282–289. Williamstown, MA, USA (2001)

    Google Scholar 

  18. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: ArXiv (2019)

    Google Scholar 

  19. Liu, N.F., Gardner, M., Belinkov, Y., Peters, M.E., Smith, N.A.: Linguistic knowledge and transferability of contextual representations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), pp. 1073–1094. Minneapolis, MN, USA (2019)

    Google Scholar 

  20. Liu, Y., et al.: RoBERTa: A robustly optimized BERT pretraining approach. In: ArXiv (2019)

    Google Scholar 

  21. Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018), pp. 2227–2237. New Orleans, LA, USA (2018)

    Google Scholar 

  22. Peters, M.E., et al.: Knowledge enhanced contextual word representations. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), pp. 43–54. Hong Kong, China (2019)

    Google Scholar 

  23. Popel, M., Bojar, O.: Training tips for the transformer model. Prague Bull. Math. Linguist. 110(1), 43–70 (2018)

    Article  Google Scholar 

  24. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  25. Ratinov, L.A., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009), pp. 147–155. Boulder, CO, USA (2009)

    Google Scholar 

  26. Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: what we know about how BERT works. Trans. Assoc. Comput. Linguist. 8, 842–866 (2020)

    Article  Google Scholar 

  27. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: ArXiv (2019)

    Google Scholar 

  28. Souza, F., Nogueira, R., Lotufo, R.: Portuguese named entity recognition using BERT-CRF. In: Arxiv (2019)

    Google Scholar 

  29. Sun, L., et al.: Adv-BERT: BERT is not robust on misspellings! generating nature adversarial samples on BERT. In: Arxiv (2020)

    Google Scholar 

  30. Tenney, I., et al.: What do you learn from context? Probing for sentence structure in contextualized word representations. In: Arxiv (2019)

    Google Scholar 

  31. Vaswani, A., et al.: Attention is all you need. In: Arxiv (2017)

    Google Scholar 

  32. Wallace, E., Wang, Y., Li, S., Singh, S., Gardner, M.: Do NLP models know numbers? probing numeracy in embeddings. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), pp. 5307–5315. Hong Kong, China (2019)

    Google Scholar 

  33. Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: self-attention with linear complexity. In: Arxiv (2020)

    Google Scholar 

  34. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS 2019), pp. 5753–5763 (2019)

    Google Scholar 

  35. Zaheer, M., et al.: Big bird: transformers for longer sequences. In: Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS 2020) (2020)

    Google Scholar 

Download references

Acknowledgments

Anton Chernyavskiy and Dmitry Ilvovsky performed this research within the framework of the HSE University Basic Research Program.

Preslav Nakov contributed as part of the Tanbih mega-project (http://tanbih.qcri.org/), which is developed at the Qatar Computing Research Institute, HBKU, and aims to limit the impact of “fake news,” propaganda, and media bias by making users aware of what they are reading.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anton Chernyavskiy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chernyavskiy, A., Ilvovsky, D., Nakov, P. (2021). Transformers: “The End of History” for Natural Language Processing?. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12977. Springer, Cham. https://doi.org/10.1007/978-3-030-86523-8_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86523-8_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86522-1

  • Online ISBN: 978-3-030-86523-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics