skip to main content
research-article
Free Access

Transformers aftermath: current research and rising trends

Published:22 March 2021Publication History
Skip Abstract Section

Abstract

Attention, particularly self-attention, is a standard in current NLP literature, but to achieve meaningful models, attention is not enough.

References

  1. Annervaz, K.M., Chowdhury, S.B.R. and Dukkipati, A. Learning beyond datasets: Knowledge graph augmented neural networks for natural language processing. NAACL-HLT, 2018.Google ScholarGoogle Scholar
  2. Bahdanau, D., Cho, K. and Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473, 2014.Google ScholarGoogle Scholar
  3. Brown, T.B.B. et al. Language models are few-shot learners. 2020; arXiv:2005.14165 (2020).Google ScholarGoogle Scholar
  4. Cho, K., van Merriënboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoder--decoder approaches. In Proceedings of the 2014 Workshop on Syntax, Semantics and Structure in Statistical Translation. Association for Computational Linguistics, Doha, Qatar, 103--111 Google ScholarGoogle ScholarCross RefCross Ref
  5. Cho, K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation; arXiv:1406.1078 (2014).Google ScholarGoogle Scholar
  6. Dai, Z., Yang, Z., Yang,Y., Carbonell, J.G., Le, Q.V., and Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. ACL (2019).Google ScholarGoogle Scholar
  7. Das, R., Munkhdalai, T., Yuan, X., Trischler, A. and McCallum, A. Building dynamic knowledge graphs from text using machine reading comprehension; arXiv:1810.05682 (2018).Google ScholarGoogle Scholar
  8. Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J. and Kaiser, L. Universal transformers; arXiv:1807.03819 (2018).Google ScholarGoogle Scholar
  9. Devlin, J., Chang, M-W, Lee, K. and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conf. North American Chapter of the ACL: Human Language Technologies 1. Association for Computational Linguistics, Minneapolis, MN, 4171--4186 Google ScholarGoogle ScholarCross RefCross Ref
  10. Gehring, J., Auli, M., Grangier, D., Yarats, D. and Dauphin, Y.N. Convolutional sequence to sequence learning. In Proceedings of the 34th Intern. Conf. Machine Learning 70. JMLR. org, 2017, 1243--1252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., and Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. 51, 5, Article 93 (Aug. 2018) Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hinton, G., Vinyals, O. and Dean, J. Distilling the knowledge in a neural network. In Proceedings of the 2015 NIPS Deep Learning and Representation Learning Workshop.Google ScholarGoogle Scholar
  13. Jain, S. and Wallace, B.C. Attention is not explanation. NAACL-HLT, 2019.Google ScholarGoogle Scholar
  14. Kitchenham, B. and Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering. Technical Report. Keele University, 2007.Google ScholarGoogle Scholar
  15. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceeding of the Intern. Conf. Learning Representations. (2020)Google ScholarGoogle Scholar
  16. Liang, Y. et al. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. To be published; https://bit.ly/3m1OLW7Google ScholarGoogle Scholar
  17. Liu, X., He, P., Chen, W. and Gao, J. Improving multi-task deep neural networks via knowledge distillation for natural language understanding; arXiv:1904.09482 (2019).Google ScholarGoogle Scholar
  18. Liu, X., He, P., Chen, W. and Gao, J. Multi-task deep neural networks for natural language understanding. ACL.2019.Google ScholarGoogle Scholar
  19. Liu, Y., Che, W., Zhao, H., Qin, B. and Liu, T. Distilling knowledge for search-based structured prediction. In Proceedings of the 56th Annual Meeting of the ACL 1. Association for Computational Linguistics, 2018, Melbourne, Australia, 1393--1402 Google ScholarGoogle ScholarCross RefCross Ref
  20. Liu, Y. et al. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019).Google ScholarGoogle Scholar
  21. Lu, J., Batra, D., Parikh, D. and Lee, S. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, 2019; arXiv:cs.CV/1908.02265Google ScholarGoogle Scholar
  22. Mikolov, T., Chen, K., Corrado, G.S. and Dean, J. Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013).Google ScholarGoogle Scholar
  23. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conf. North American Chapter of the ACL: Human Language Technologies 1. Association for Computational Linguistics, New Orleans, LA, 2227--2237 Google ScholarGoogle ScholarCross RefCross Ref
  24. Radford, A., Narasimhan, K., Salimans, T and Sutskever, Improving language understanding by generative pre-training, 2018; https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/languageunderstandingpaper (2018).Google ScholarGoogle Scholar
  25. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskeve, I. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019).Google ScholarGoogle Scholar
  26. Rajani, N.F., McCann, B., Xiong, C. and Socher, R. Explain yourself! Leveraging language models for commonsense reasoning. ACL, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  27. Sanh, V., Debut, L., Chaumond, J. and Wolf, T. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter; arXiv:1910.01108 (2019).Google ScholarGoogle Scholar
  28. Keskar, N.S., McCann, B., Varshney, L.R., Xiong, C. and Socher, R. CTRL: A conditional transformer language model for controllable generation, 2019, arXiv:1909.05858.Google ScholarGoogle Scholar
  29. Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J. and Catanzaro, B. Megatron-LM: Training multi-billion parameter language models using model parallelism, 2019; arXiv:cs.CL/1909.08053Google ScholarGoogle Scholar
  30. Song, K., Tan, X., Qin, T., Lu, J. and Liu, T-Y. MASS: Masked sequence to sequence pre-training for language Ggeneration. ICML, 2019; https://bit.ly/3j90xMNGoogle ScholarGoogle Scholar
  31. Sutskever, I., Vinyals, O. and Le, Q.V. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 2014, 3104--3112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Tang, T., Lu, Y., Liu, L., Mou, L., Vechtomova, O. and Lin, J. Distilling task-specific knowledge from BERT into simple neural networks. arXiv:1903.12136 (2019).Google ScholarGoogle Scholar
  33. Vaswani, A. et al. Attention is all you need. Advances in Neural Information Processing Systems, 2017, 5998--6008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Voita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. ACL, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  35. Wang, A. et al . SuperGLUE: A stickier benchmark for general-purpose language understanding systems. CoRR abs/1905.00537 (2019). arXiv:1905.00537 http://arxiv.org/abs/1905.00537Google ScholarGoogle Scholar
  36. Wang, A., Singh, A., Michael, J., Hill, F. Levy, O. and Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR abs/1804.07461 (2018).Google ScholarGoogle Scholar
  37. Wu, Y et al. Google's neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016).Google ScholarGoogle Scholar
  38. Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R. and Le, Q.V. XLNet: Generalized autoregressive pretraining for language understanding. NeurIPS, 2019.Google ScholarGoogle Scholar
  39. You, Y. et al. Large batch optimization for deep learning: Training BERT in 76 minutes. In Proceedings of the 2019 Intern. Conf. Learning Representations.Google ScholarGoogle Scholar
  40. Yu, A.W. et al. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv:1804.09541 (2018).Google ScholarGoogle Scholar

Index Terms

  1. Transformers aftermath: current research and rising trends

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Communications of the ACM
        Communications of the ACM  Volume 64, Issue 4
        April 2021
        164 pages
        ISSN:0001-0782
        EISSN:1557-7317
        DOI:10.1145/3458337
        Issue’s Table of Contents

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 March 2021

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Popular
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format