Abstract
Attention, particularly self-attention, is a standard in current NLP literature, but to achieve meaningful models, attention is not enough.
- Annervaz, K.M., Chowdhury, S.B.R. and Dukkipati, A. Learning beyond datasets: Knowledge graph augmented neural networks for natural language processing. NAACL-HLT, 2018.Google Scholar
- Bahdanau, D., Cho, K. and Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473, 2014.Google Scholar
- Brown, T.B.B. et al. Language models are few-shot learners. 2020; arXiv:2005.14165 (2020).Google Scholar
- Cho, K., van Merriënboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoder--decoder approaches. In Proceedings of the 2014 Workshop on Syntax, Semantics and Structure in Statistical Translation. Association for Computational Linguistics, Doha, Qatar, 103--111 Google ScholarCross Ref
- Cho, K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation; arXiv:1406.1078 (2014).Google Scholar
- Dai, Z., Yang, Z., Yang,Y., Carbonell, J.G., Le, Q.V., and Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. ACL (2019).Google Scholar
- Das, R., Munkhdalai, T., Yuan, X., Trischler, A. and McCallum, A. Building dynamic knowledge graphs from text using machine reading comprehension; arXiv:1810.05682 (2018).Google Scholar
- Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J. and Kaiser, L. Universal transformers; arXiv:1807.03819 (2018).Google Scholar
- Devlin, J., Chang, M-W, Lee, K. and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conf. North American Chapter of the ACL: Human Language Technologies 1. Association for Computational Linguistics, Minneapolis, MN, 4171--4186 Google ScholarCross Ref
- Gehring, J., Auli, M., Grangier, D., Yarats, D. and Dauphin, Y.N. Convolutional sequence to sequence learning. In Proceedings of the 34th Intern. Conf. Machine Learning 70. JMLR. org, 2017, 1243--1252. Google ScholarDigital Library
- Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., and Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. 51, 5, Article 93 (Aug. 2018) Google ScholarDigital Library
- Hinton, G., Vinyals, O. and Dean, J. Distilling the knowledge in a neural network. In Proceedings of the 2015 NIPS Deep Learning and Representation Learning Workshop.Google Scholar
- Jain, S. and Wallace, B.C. Attention is not explanation. NAACL-HLT, 2019.Google Scholar
- Kitchenham, B. and Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering. Technical Report. Keele University, 2007.Google Scholar
- Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceeding of the Intern. Conf. Learning Representations. (2020)Google Scholar
- Liang, Y. et al. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. To be published; https://bit.ly/3m1OLW7Google Scholar
- Liu, X., He, P., Chen, W. and Gao, J. Improving multi-task deep neural networks via knowledge distillation for natural language understanding; arXiv:1904.09482 (2019).Google Scholar
- Liu, X., He, P., Chen, W. and Gao, J. Multi-task deep neural networks for natural language understanding. ACL.2019.Google Scholar
- Liu, Y., Che, W., Zhao, H., Qin, B. and Liu, T. Distilling knowledge for search-based structured prediction. In Proceedings of the 56th Annual Meeting of the ACL 1. Association for Computational Linguistics, 2018, Melbourne, Australia, 1393--1402 Google ScholarCross Ref
- Liu, Y. et al. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019).Google Scholar
- Lu, J., Batra, D., Parikh, D. and Lee, S. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, 2019; arXiv:cs.CV/1908.02265Google Scholar
- Mikolov, T., Chen, K., Corrado, G.S. and Dean, J. Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013).Google Scholar
- Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K. and Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conf. North American Chapter of the ACL: Human Language Technologies 1. Association for Computational Linguistics, New Orleans, LA, 2227--2237 Google ScholarCross Ref
- Radford, A., Narasimhan, K., Salimans, T and Sutskever, Improving language understanding by generative pre-training, 2018; https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/languageunderstandingpaper (2018).Google Scholar
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskeve, I. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019).Google Scholar
- Rajani, N.F., McCann, B., Xiong, C. and Socher, R. Explain yourself! Leveraging language models for commonsense reasoning. ACL, 2019.Google ScholarCross Ref
- Sanh, V., Debut, L., Chaumond, J. and Wolf, T. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter; arXiv:1910.01108 (2019).Google Scholar
- Keskar, N.S., McCann, B., Varshney, L.R., Xiong, C. and Socher, R. CTRL: A conditional transformer language model for controllable generation, 2019, arXiv:1909.05858.Google Scholar
- Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J. and Catanzaro, B. Megatron-LM: Training multi-billion parameter language models using model parallelism, 2019; arXiv:cs.CL/1909.08053Google Scholar
- Song, K., Tan, X., Qin, T., Lu, J. and Liu, T-Y. MASS: Masked sequence to sequence pre-training for language Ggeneration. ICML, 2019; https://bit.ly/3j90xMNGoogle Scholar
- Sutskever, I., Vinyals, O. and Le, Q.V. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 2014, 3104--3112. Google ScholarDigital Library
- Tang, T., Lu, Y., Liu, L., Mou, L., Vechtomova, O. and Lin, J. Distilling task-specific knowledge from BERT into simple neural networks. arXiv:1903.12136 (2019).Google Scholar
- Vaswani, A. et al. Attention is all you need. Advances in Neural Information Processing Systems, 2017, 5998--6008. Google ScholarDigital Library
- Voita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. ACL, 2019.Google ScholarCross Ref
- Wang, A. et al . SuperGLUE: A stickier benchmark for general-purpose language understanding systems. CoRR abs/1905.00537 (2019). arXiv:1905.00537 http://arxiv.org/abs/1905.00537Google Scholar
- Wang, A., Singh, A., Michael, J., Hill, F. Levy, O. and Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR abs/1804.07461 (2018).Google Scholar
- Wu, Y et al. Google's neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016).Google Scholar
- Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R. and Le, Q.V. XLNet: Generalized autoregressive pretraining for language understanding. NeurIPS, 2019.Google Scholar
- You, Y. et al. Large batch optimization for deep learning: Training BERT in 76 minutes. In Proceedings of the 2019 Intern. Conf. Learning Representations.Google Scholar
- Yu, A.W. et al. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv:1804.09541 (2018).Google Scholar
Index Terms
Transformers aftermath: current research and rising trends
Recommendations
Space or time for video classification transformers
AbstractSpatial and temporal attention plays an important role in video classification tasks. However, there are few studies about the mechanism of spatial and temporal attention behind classification problems. Transformer owns excellent capabilities at ...
Comments