Skip to main content
Log in

From static to dynamic word representations: a survey

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

In the history of natural language processing (NLP) development, the representation of words has always been a significant research topic. In this survey, we provide a comprehensive typology of word representation models from a novel perspective that the development from static to dynamic embeddings can effectively address the polysemy problem, which has been a great challenge in this field. Then the survey covers the main evaluation metrics and applications of these word embeddings. And, we further discuss the development of word embeddings from static to dynamic in cross-lingual scenario. Finally, we point out some open issues and future works.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. These embeddings are contextualized or dynamic as opposed to the traditional ones.

  2. https://wordnet.princeton.edu/.

  3. Please refer to [27] for more detailed comparison and analysis of these distributional representation models.

  4. We will use \({\varvec{C}}(w)\) to denote the distributed embedding of word w in the rest of this paper.

  5. Please refer to the paper [87] for detailed results.

  6. Please refer to the paper [90] for a detailed description of the pre-processing techniques.

  7. Such functional tokens are also used by GPT but only introduced while finetuning.

  8. Check https://github.com/thunlp/PLMpapers and https://github.com/cedrickchee/awesome-bert-nlp for latest progress of dynamic word representation.

  9. Please refer to the paper [124] for implementation details.

References

  1. Almuhareb A (2006) Attributes in lexical acquisition. PhD thesis, University of Essex

  2. Artetxe M, Ruder S, Yogatama D (2019) On the cross-lingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856

  3. Bakarov A (2018) A survey of word embeddings evaluation methods. arXiv preprint arXiv:1801.09536

  4. Baroni M, Evert S, Lenci A (2008) Bridging the gap between semantic theory and computational simulations. In: Proc. of the esslli workshop on distributional lexical semantic. FOLLI, Hamburg

  5. Baroni M, Murphy B, Barbu E, Poesio M (2010) Strudel: a corpus-based semantic model based on properties and types. Cogn Sci 34:222–254

    Article  Google Scholar 

  6. Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Baltimore, Maryland, pp 238–247. https://doi.org/10.3115/v1/P14-1023. https://www.aclweb.org/anthology/P14-1023

  7. Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155

    MATH  Google Scholar 

  8. Blei DM, Ng AY, Jordan MI, Lafferty J (2003) Latent dirichlet allocation. J Mach Learn Res 3:2003

    Google Scholar 

  9. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146

    Article  Google Scholar 

  10. Bowman SR, Angeli G, Potts C, Manning CD (2015) A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326

  11. Brown PF, deSouza PV, Mercer RL, Pietra VJD, Lai JC (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479

    Google Scholar 

  12. Bruni E, Tran NK, Baroni M (2014) Multimodal distributional semantics. J Artif Int Res pp 1–47

  13. Chen D, Manning C (2014) A fast and accurate dependency parser using neural networks. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 740–750 https://doi.org/10.3115/v1/D14-1082. https://www.aclweb.org/anthology/D14-1082

  14. Chen X, Cardie C (2018) Unsupervised multilingual word embeddings. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, Belgium, pp 261–270. https://doi.org/10.18653/v1/D18-1024. https://www.aclweb.org/anthology/D18-1024

  15. Chen X, Liu Z, Sun M (2014) A unified model for word sense representation and disambiguation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1025–1035. https://doi.org/10.3115/v1/D14-1110. https://www.aclweb.org/anthology/D14-1110

  16. Cho K, van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder–decoder approaches. In: Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation. Association for Computational Linguistics, Doha, Qatar, pp 103–111.https://doi.org/10.3115/v1/W14-4012. https://www.aclweb.org/anthology/W14-4012

  17. Clark K, Luong MT, Manning CD, Le QV (2018) Semi-supervised sequence modeling with cross-view training. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, Belgium, pp 1914–1925. https://doi.org/10.18653/v1/D18-1217. https://www.aclweb.org/anthology/D18-1217

  18. Clark K, Khandelwal U, Levy O, Manning CD (2019a) What does BERT look at? An analysis of BERT’s attention. In: Proceedings of the 2019 ACL Workshop BlackboxNLP: analyzing and interpreting neural networks for NLP. Association for Computational Linguistics, Florence, Italy, pp 276–286. https://doi.org/10.18653/v1/W19-4828. https://www.aclweb.org/anthology/W19-4828

  19. Clark K, Luong MT, Khandelwal U, Manning CD, Le QV (2019b) BAM! born-again multi-task networks for natural language understanding. In: Proc. of ACL, pp 5931–5937

  20. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537

    MATH  Google Scholar 

  21. Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116

  22. Cui Y, Che W, Liu T, Qin B, Yang Z, Wang S, Hu G (2019) Pre-training with whole word masking for chinese bert. arXiv preprint arXiv:1906.08101

  23. Dagan I, Pereira F, Lee L (1994) Similarity-based estimation of word cooccurrence probabilities. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Las Cruces, New Mexico, USA, pp 272–278. https://doi.org/10.3115/981732.981770

  24. Dai Z, Yang Z, Yang Y, Cohen WW, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860

  25. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407

    Article  Google Scholar 

  26. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  27. Dinu G, Lapata M (2010) Measuring distributional similarity in context. In: Proc. of EMNLP, pp 1162–1172

  28. Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika 1(3):211–218

    Article  Google Scholar 

  29. Faruqui M, Dyer C (2014) Improving vector space word representations using multilingual correlation. In: Proceedings of the 14th conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Gothenburg, Sweden, pp 462–471. https://doi.org/10.3115/v1/E14-1049. https://www.aclweb.org/anthology/E14-1049

  30. Fei-Fei L (2006) Knowledge transfer in learning to recognize visual objects classes. In: International Conference on Development and Learning. Department of Psychological and Brain Sciences, Indiana University, pp 1–8

  31. Fink M (2005) Object classification from a single example utilizing class relevance metrics. In: Saul LK, Weiss Y, Bottou L (eds) Advances in neural information processing systems. MIT Press, pp 449–456. http://papers.nips.cc/paper/2576-object-classification-from-a-single-example-utilizing-class-relevance-metrics.pdf

  32. Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2001) Placing search in context: the concept revisited. ACM Trans Inf Syst. https://doi.org/10.1145/503104.503110

    Article  Google Scholar 

  33. Firth JR (1957) A synopsis of linguistic theory 1930–1955. In: Studies in linguistic analysis (special volume of the Philological Society), vol 1952–1959. The Philological Society, Oxford, pp 1–32. https://www.bibsonomy.org/bibtex/25e3d6c72cdd123a638f71886d78f3c1e/brightbyte

  34. Gao B, Bian J, Liu TY (2014) Wordrep: A benchmark for research on learning word representations. arXiv preprint arXiv:1407.1640

  35. Gerz D, Vulić I, Hill F, Reichart R, Korhonen A (2016) SimVerb-3500: A large-scale evaluation set of verb similarity. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Association for Computational Linguistics, Austin, Texas, pp 2173–2182. https://doi.org/10.18653/v1/D16-1235. https://www.aclweb.org/anthology/D16-1235

  36. Ghannay S, Favre B, Estève Y, Camelin N (2016) Word embedding evaluation and combination. In: Proceedings of the tenth international conference on language resources and evaluation (LREC'16). Portorož, Slovenia, pp 300–305 https://www.aclweb.org/anthology/L16-1046

  37. Gladkova A, Drozd A (2016) Intrinsic evaluations of word embeddings: what can we do better? In: Proceedings of the 1st workshop on evaluating vector-space representations for NLP. Association for Computational Linguistics, Berlin, Germany, pp 36–42. https://doi.org/10.18653/v1/W16-2507. https://www.aclweb.org/anthology/W16-2507

  38. Gladkova A, Drozd A, Matsuoka S (2016) Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In: Proceedings of the NAACL student research workshop. Association for Computational Linguistics, San Diego, California, pp 8–15. https://doi.org/10.18653/v1/N16-2002. https://www.aclweb.org/anthology/N16-2002

  39. Golovanov S, Kurbanov R, Nikolenko S, Truskovskyi K, Tselousov A, Wolf T (2019) Large-scale transfer learning for natural language generation. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 6053–6058. https://doi.org/10.18653/v1/P19-1608. https://www.aclweb.org/anthology/P19-1608

  40. Greenberg C, Demberg V, Sayeed A (2015) Verb polysemy and frequency effects in thematic fit modeling. In: Proceedings of the 6th workshop on cognitive modeling and computational linguistics. Association for Computational Linguistics, Denver, Colorado, pp 48–57. https://doi.org/10.3115/v1/W15-1106. https://www.aclweb.org/anthology/W15-1106

  41. Guo J, Che W, Wang H, Liu T (2014) Learning sense-specific word embeddings by exploiting bilingual resources. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers. Dublin City University and Association for Computational Linguistics, Dublin, Ireland, pp 497–507. https://www.aclweb.org/anthology/C14-1048

  42. Guo J, Che W, Yarowsky D, Wang H, Liu T (2015) Cross-lingual dependency parsing based on distributed representations. In: Proc. of ACL and IJCNLP, pp 1234–1244

  43. Guo J, Che W, Yarowsky D, Wang H, Liu T (2016a) A distributed representation-based framework for cross-lingual transfer parsing. J Artif Int Res 55(1):995–1023

    MathSciNet  Google Scholar 

  44. Guo J, Che W, Yarowsky D, Wang H, Liu T (2016b) A representation learning framework for multi-source transfer parsing. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, AAAI’16. AAAI Press, Phoenix, Arizona, pp 2734–2740.

  45. Hermann KM, Blunsom P (2014) Multilingual models for compositional distributed semantics. In: Proc. of ACL, pp 58–68

  46. Hewitt J, Manning CD (2019) A structural probe for finding syntax in word representations. In: Proc. of NAACL, pp 4129–4138, https://doi.org/10.18653/v1/N19-1419

  47. Heyman G, Verreet B, Vulić I, Moens MF (2019) Learning unsupervised multilingual word embeddings with incremental multilingual hubs. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies, vol 1 (long and short papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 1890–1902. https://doi.org/10.18653/v1/N19-1188. https://www.aclweb.org/anthology/N19-1188

  48. Hou Y, Zhou Z, Liu Y, Wang N, Che W, Liu H, Liu T (2019) Few-shot sequence labeling with label dependency transfer. arXiv preprint arXiv:1906.08711

  49. Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. In: Proceedings of the 56th annual meeting of the association for computational linguistics (vol 1: long papers), Association for Computational Linguistics, Melbourne, Australia, pp 328–339. https://doi.org/10.18653/v1/P18-1031. https://www.aclweb.org/anthology/P18-1031

  50. Huang E, Socher R, Manning C, Ng A (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 56th annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 873–882

  51. Huang F, Yates A (2009) Distributional representations for handling sparsity in supervised sequence-labeling. In: Proc. of ACL and IJCNLP, pp 495–503

  52. Iacobacci I, Pilehvar MT, Navigli R (2016) Embeddings for word sense disambiguation: an evaluation study. In: Proc. of ACL, pp 897–907

  53. Jarmasz M, Szpakowicz S (2003) Roget’s thesaurus and semantic similarity. In: Proc. of RANLP, pp 212–219

  54. Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2019) Spanbert: improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529

  55. Joulin A, Grave E, Bojanowski P, Mikolov T (2017) Bag of tricks for efficient text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, vol 2, short papers. Association for Computational Linguistics, Valencia, Spain, pp 427–431. https://www.aclweb.org/anthology/E17-2068

  56. Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Proceedings of COLING 2012. The COLING 2012 Organizing Committee, Mumbai, India, pp 1459–1474. https://www.aclweb.org/anthology/C12-1089

  57. Kočiský T, Hermann KM, Blunsom P (2014) Learning bilingual word representations by marginalizing alignments. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, vol 2 short papers. Association for Computational Linguistics, Baltimore, Maryland, pp 224–229. https://doi.org/10.3115/v1/P14-2037. https://www.aclweb.org/anthology/P14-2037

  58. Lample G, Conneau A (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291

  59. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. In: Proc. of NAACL, pp 260–270

  60. Lample G, Conneau A, Ranzato M, Denoyer L, Jégou H (2018) Word translation without parallel data. In: Proc. of ICLR

  61. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942

  62. Landauer TK, Dutnais ST (1997) A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol Rev 211–240

  63. Lazaridou A, Dinu G, Baroni M (2015) Hubness and pollution: Delving into cross-space mapping for zero-shot learning. In: Proc. of ACL and IJCNLP, pp 270–280

  64. Liu NF, Gardner M, Belinkov Y, Peters ME, Smith NA (2019a) Linguistic knowledge and transferability of contextual representations. In: Proc. of NAACL, pp 1073–1094, https://doi.org/10.18653/v1/N19-1112

  65. Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, Wang P (2019b) K-bert: Enabling language representation with knowledge graph. arXiv preprint arXiv:1909.07606

  66. Liu X, He P, Chen W, Gao J (2019c) Multi-task deep neural networks for natural language understanding. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 4487–4496. https://doi.org/10.18653/v1/P19-1441. https://www.aclweb.org/anthology/P19-1441

  67. Lu A, Wang W, Bansal M, Gimpel K, Livescu K (2015) Deep multilingual correlation for improved word embeddings. In: Proceedings of the 2015 conference of the north american chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, Denver, Colorado, pp 250–256. https://doi.org/10.3115/v1/N15-1028. https://www.aclweb.org/anthology/N15-1028

  68. Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Wallach H, Larochelle H, Beygelzimer A, d' Alché-Buc F, Fox E, Garnett R (eds) Advances in Neural Information Processing Systems. Curran Associates, Inc., pp 13–23. http://papers.nips.cc/paper/8297-vilbert-pretraining-task-agnostic-visiolinguistic-representations-for-vision-and-language-tasks.pdf

  69. Luong T, Pham H, Manning CD (2015) Bilingual word representations with monolingual quality in mind. In: Proceedings of the 1st workshop on vector space modeling for natural language processing. Association for Computational Linguistics, Denver, Colorado, pp 151–159. https://doi.org/10.3115/v1/W15-1521. https://www.aclweb.org/anthology/W15-1521

  70. McCallum A, Freitag D, Pereira FCN (2000) Maximum entropy markov models for information extraction and segmentation. In: Proceedings of the seventeenth international conference on machine learning, ICML ’00. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 591–598

  71. McCann B, Bradbury J, Xiong C, Socher R (2017) Learned in translation: contextualized word vectors. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural Information processing systems 30. Curran Associates, Inc., pp 6294–6305. http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf

  72. McRae K, Ferretti TR, Amyote L (1997) Thematic roles as verb-specific concepts. Lang Cogn Process 12(2–3):137–176

    Article  Google Scholar 

  73. Mikolov T, Karafiát M, Burget L, Cernocky J, Khudanpur S (2010) Recurrent neural network based language model. In: Kobayashi T, Hirose K, Nakamura S (eds) INTERSPEECH. ISCA, pp 1045–1048. https://www.bibsonomy.org/bibtex/2aee1e280d06e82474b17c4996aaea076/dblp

  74. Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. ICLR Workshop

  75. Mikolov T, Le QV, Sutskever I (2013b) Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168

  76. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013c) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  77. Mikolov T, Yih Wt, Zweig G (2013d) Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the north American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, Atlanta, Georgia, pp 746–751. https://www.aclweb.org/anthology/N13-1090

  78. Miller GA (1995) Wordnet: A lexical database for english. Commun ACM 39–41

  79. Mnih A, Hinton G (2007) Three new graphical models for statistical language modelling. In: Proceedings of the 24th international conference on machine learning, ICML ’07. Association for Computing Machinery, Corvalis, Oregon, USA, pp 641–648. https://doi.org/10.1145/1273496.1273577

  80. Mnih A, Hinton GE (2009) A scalable hierarchical distributed language model. Adv Neural Inf Process Syst 21:1081–1088

    Google Scholar 

  81. Mulcaire P, Kasai J, Smith N (2019a) Polyglot contextual representations improve crosslingual transfer. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 3912–3918. https://doi.org/10.18653/v1/N19-1392. https://www.aclweb.org/anthology/N19-1392

  82. Mulcaire P, Kasai J, Smith NA (2019b) Low-resource parsing with crosslingual contextualized representations. In: Proceedings of the 23rd conference on computational natural language learning (CoNLL), Association for Computational Linguistics, Hong Kong, China, pp 304–315. https://doi.org/10.18653/v1/K19-1029. https://www.aclweb.org/anthology/K19-1029

  83. Neelakantan A, Shankar J, Passos A, McCallum A (2014) Efficient non-parametric estimation of multiple embeddings per word in vector space. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1059–1069. https://doi.org/10.3115/v1/D14-1113. https://www.aclweb.org/anthology/D14-1113

  84. Niven T, Kao H (2019) Probing neural network comprehension of natural language arguments. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 4658–4664. https://doi.org/10.18653/v1/P19-1459. https://www.aclweb.org/anthology/P19-1459

  85. Padó S, Lapata M (2007) Dependency-based construction of semantic space models. Comput Linguist 33(2):161–199

    Article  Google Scholar 

  86. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543. https://doi.org/10.3115/v1/D14-1162. https://www.aclweb.org/anthology/D14-1162

  87. Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: human language technologies, vol 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, pp 2227–2237. https://doi.org/10.18653/v1/N18-1202. https://www.aclweb.org/anthology/N18-1202

  88. Peters ME, Neumann M, Logan R, Schwartz R, Joshi V, Singh S, Smith NA (2019) Knowledge enhanced contextual word representations. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 43–54. https://doi.org/10.18653/v1/D19-1005 https://www.aclweb.org/anthology/D19-1005

  89. Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual bert? arXiv preprint arXiv:1906.01502

  90. Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training. https://s3-us-west-2amazonaws.com/openai-assets/research-covers/languageunsupervised/language/understanding/paper/pdf

  91. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8)

  92. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683

  93. Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100,000+ questions for machine comprehension of text. In: Proc. of EMNLP, pp 2383–2392

  94. Rajpurkar P, Jia R, Liang P (2018) Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822

  95. Reisinger J, Mooney RJ (2010) Multi-prototype vector-space models of word meaning. In: Proc. of HLT-NAACL, pp 109–117

  96. Ruder S, Vulic I, Søgaard A (2017) A survey of cross-lingual embedding models. arXiv preprint arXiv:1706.04902

  97. Schnabel T, Labutov I, Mimno D, Joachims T (2015) Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, Portugal, pp 298–307. https://doi.org/10.18653/v1/D15-1036. https://www.aclweb.org/anthology/D15-1036

  98. Schuster T, Ram O, Barzilay R, Globerson A (2019) Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 1599–1613. https://doi.org/10.18653/v1/N19-1162. https://www.aclweb.org/anthology/N19-1162

  99. Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proc. of ACL, pp 2556–2565, https://doi.org/10.18653/v1/P18-1238

  100. Smith SL, Turban DHP, Hamblin S, Hammerla NY (2017) Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In: Proc. of ICLR

  101. Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems, pp 4077–4087

  102. Song K, Tan X, Qin T, Lu J, Liu T (2019) MASS: masked sequence to sequence pre-training for language generation. In: Proc. of ICML, pp 5926–5936

  103. Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2019) Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530

  104. Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019a) Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766

  105. Sun Y, Wang S, Li Y, Feng S, Chen X, Zhang H, Tian X, Zhu D, Tian H, Wu H (2019b) Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223

  106. Sun Y, Wang S, Li Y, Feng S, Tian H, Wu H, Wang H (2019c) Ernie 2.0: a continual pre-training framework for language understanding. arXiv preprint arXiv:1907.12412

  107. Tenney I, Das D, Pavlick E (2019) BERT rediscovers the classical NLP pipeline. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 4593–4601. https://doi.org/10.18653/v1/P19-1452

  108. Tian F, Dai H, Bian J, Gao B, Zhang R, Chen E, Liu TY (2014) A probabilistic model for learning multi-prototype word embeddings. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers. Dublin City University and Association for Computational Linguistics, Dublin, Ireland, pp 151–160. https://www.aclweb.org/anthology/C14-1016

  109. Tsvetkov Y, Faruqui M, Ling W, Lample G, Dyer C (2015) Evaluation of word vector representations by subspace alignment. In: Proceedings of the 2015 conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, Portugal, pp 2049–2054. https://doi.org/10.18653/v1/D15-1243. https://www.aclweb.org/anthology/D15-1243

  110. Turian J, Ratinov LA, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Uppsala, Sweden, pp 384–394. https://www.aclweb.org/anthology/P10-1040

  111. Turney PD (2001a) Mining the web for synonyms: Pmi-ir versus lsa on toefl. In: De Raedt L, Flach P (eds) Machine learning: ECML 2001. Springer, Berlin Heidelberg, pp 491–502

    Chapter  Google Scholar 

  112. Turney PD (2001b) Mining the web for synonyms: Pmi-ir versus lsa on toefl. In: De Raedt L, Flach P (eds) Machine learning: ECML 2001. Springer, Berlin Heidelberg, pp 491–502

    Chapter  Google Scholar 

  113. Upadhyay S, Faruqui M, Dyer C, Roth D (2016) Cross-lingual models of word embeddings: an empirical comparison. In: Proceedings of the 54th annual meeting of the association for computational linguistics, vol 1: long papers. Association for Computational Linguistics, Berlin, Germany, pp 1661–1670. https://doi.org/10.18653/v1/P16-1157. https://www.aclweb.org/anthology/P16-1157

  114. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc., pp 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

  115. Vinyals O, Blundell C, Lillicrap T, Kavukcuoglu K, Wierstra D (2016) Matching networks for one shot learning. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc., pp 3630–3638. http://papers.nips.cc/paper/6385-matching-networks-for-one-shot-learning.pdf

  116. Vulić I, Moens MF (2015) Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, vol 2 (short papers). Association for Computational Linguistics, Beijing, China, pp 719–725. https://doi.org/10.3115/v1/P15-2118. https://www.aclweb.org/anthology/P15-2118

  117. Wallace E, Feng S, Kandpal N, Gardner M, Singh S (2019) Universal adversarial triggers for attacking and analyzing NLP. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 2153–2162. https://doi.org/10.18653/v1/D19-1221. https://www.aclweb.org/anthology/D19-1221

  118. Wang A, Cho K (2019) BERT has a mouth, and it must speak: BERT as a Markov random field language model. In: Proc. of NeuralGen, pp 30–36, https://doi.org/10.18653/v1/W19-2304

  119. Wang P, Qian Y, Soong FK, He L, Zhao H (2015) Part-of-speech tagging with bidirectional long short-term memory recurrent neural network. arXiv preprint arXiv:1510.06168

  120. Wang Y, Che W, Guo J, Liu Y, Liu T (2019) Cross-lingual BERT transformation for zero-shot dependency parsing. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 5721–5727. https://doi.org/10.18653/v1/D19-1575. https://www.aclweb.org/anthology/D19-1575

  121. Williams A, Nangia N, Bowman SR (2017) A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426

  122. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144

  123. Xing C, Wang D, Liu C, Lin Y (2015) Normalized word embedding and orthogonal transform for bilingual word translation. In: Proceedings of the 2015 conference of the north American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, Denver, Colorado, pp 1006–1011. https://doi.org/10.3115/v1/N15-1104. https://www.aclweb.org/anthology/N15-1104

  124. Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV (2019) Xlnet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237

  125. Yijia l (2019) Sentence-level language analysis with contextualized word embeddings. Ph.D. thesis, Harbin Institute of Technology

  126. Zhang H, Gong Y, Yan Y, Duan N, Xu J, Wang J, Gong M, Zhou M (2019a) Pretraining-based natural language generation for text summarization. arXiv preprint arXiv:1902.09243

  127. Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y (2019b) Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675

  128. Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q (2019c) ERNIE: enhanced language representation with informative entities. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 1441–1451. https://doi.org/10.18653/v1/P19-1139. https://www.aclweb.org/anthology/P19-1139

  129. Zhou J, Xu W (2015) End-to-end learning of semantic role labeling using recurrent neural networks. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, vol 1 (long papers). Association for Computational Linguistics, Beijing, China, pp 1127–1137. https://doi.org/10.3115/v1/P15-1109. https://www.aclweb.org/anthology/P15-1109

  130. Zou WY, Socher R, Cer D, Manning CD (2013) Bilingual word embeddings for phrase-based machine translation. In: Proceedings of the 2013 conference on empirical methods in natural language processing. Association for Computational Linguistics, Seattle, Washington, USA, pp 1393–1398. https://www.aclweb.org/anthology/D13-1141

Download references

Funding

This work was supported by the National Natural Science Foundation of China (NSFC) via grant 61976072, 61632011 and 61772153.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wanxiang Che.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Hou, Y., Che, W. et al. From static to dynamic word representations: a survey. Int. J. Mach. Learn. & Cyber. 11, 1611–1630 (2020). https://doi.org/10.1007/s13042-020-01069-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-020-01069-8

Keywords

Navigation