Abstract
In the history of natural language processing (NLP) development, the representation of words has always been a significant research topic. In this survey, we provide a comprehensive typology of word representation models from a novel perspective that the development from static to dynamic embeddings can effectively address the polysemy problem, which has been a great challenge in this field. Then the survey covers the main evaluation metrics and applications of these word embeddings. And, we further discuss the development of word embeddings from static to dynamic in cross-lingual scenario. Finally, we point out some open issues and future works.
Similar content being viewed by others
Notes
These embeddings are contextualized or dynamic as opposed to the traditional ones.
Please refer to [27] for more detailed comparison and analysis of these distributional representation models.
We will use \({\varvec{C}}(w)\) to denote the distributed embedding of word w in the rest of this paper.
Please refer to the paper [87] for detailed results.
Please refer to the paper [90] for a detailed description of the pre-processing techniques.
Such functional tokens are also used by GPT but only introduced while finetuning.
Check https://github.com/thunlp/PLMpapers and https://github.com/cedrickchee/awesome-bert-nlp for latest progress of dynamic word representation.
Please refer to the paper [124] for implementation details.
References
Almuhareb A (2006) Attributes in lexical acquisition. PhD thesis, University of Essex
Artetxe M, Ruder S, Yogatama D (2019) On the cross-lingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856
Bakarov A (2018) A survey of word embeddings evaluation methods. arXiv preprint arXiv:1801.09536
Baroni M, Evert S, Lenci A (2008) Bridging the gap between semantic theory and computational simulations. In: Proc. of the esslli workshop on distributional lexical semantic. FOLLI, Hamburg
Baroni M, Murphy B, Barbu E, Poesio M (2010) Strudel: a corpus-based semantic model based on properties and types. Cogn Sci 34:222–254
Baroni M, Dinu G, Kruszewski G (2014) Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Baltimore, Maryland, pp 238–247. https://doi.org/10.3115/v1/P14-1023. https://www.aclweb.org/anthology/P14-1023
Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
Blei DM, Ng AY, Jordan MI, Lafferty J (2003) Latent dirichlet allocation. J Mach Learn Res 3:2003
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Bowman SR, Angeli G, Potts C, Manning CD (2015) A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326
Brown PF, deSouza PV, Mercer RL, Pietra VJD, Lai JC (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–479
Bruni E, Tran NK, Baroni M (2014) Multimodal distributional semantics. J Artif Int Res pp 1–47
Chen D, Manning C (2014) A fast and accurate dependency parser using neural networks. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 740–750 https://doi.org/10.3115/v1/D14-1082. https://www.aclweb.org/anthology/D14-1082
Chen X, Cardie C (2018) Unsupervised multilingual word embeddings. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, Belgium, pp 261–270. https://doi.org/10.18653/v1/D18-1024. https://www.aclweb.org/anthology/D18-1024
Chen X, Liu Z, Sun M (2014) A unified model for word sense representation and disambiguation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1025–1035. https://doi.org/10.3115/v1/D14-1110. https://www.aclweb.org/anthology/D14-1110
Cho K, van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder–decoder approaches. In: Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation. Association for Computational Linguistics, Doha, Qatar, pp 103–111.https://doi.org/10.3115/v1/W14-4012. https://www.aclweb.org/anthology/W14-4012
Clark K, Luong MT, Manning CD, Le QV (2018) Semi-supervised sequence modeling with cross-view training. In: Proceedings of the 2018 conference on empirical methods in natural language processing. Association for Computational Linguistics, Brussels, Belgium, pp 1914–1925. https://doi.org/10.18653/v1/D18-1217. https://www.aclweb.org/anthology/D18-1217
Clark K, Khandelwal U, Levy O, Manning CD (2019a) What does BERT look at? An analysis of BERT’s attention. In: Proceedings of the 2019 ACL Workshop BlackboxNLP: analyzing and interpreting neural networks for NLP. Association for Computational Linguistics, Florence, Italy, pp 276–286. https://doi.org/10.18653/v1/W19-4828. https://www.aclweb.org/anthology/W19-4828
Clark K, Luong MT, Khandelwal U, Manning CD, Le QV (2019b) BAM! born-again multi-task networks for natural language understanding. In: Proc. of ACL, pp 5931–5937
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116
Cui Y, Che W, Liu T, Qin B, Yang Z, Wang S, Hu G (2019) Pre-training with whole word masking for chinese bert. arXiv preprint arXiv:1906.08101
Dagan I, Pereira F, Lee L (1994) Similarity-based estimation of word cooccurrence probabilities. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Las Cruces, New Mexico, USA, pp 272–278. https://doi.org/10.3115/981732.981770
Dai Z, Yang Z, Yang Y, Cohen WW, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Dinu G, Lapata M (2010) Measuring distributional similarity in context. In: Proc. of EMNLP, pp 1162–1172
Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika 1(3):211–218
Faruqui M, Dyer C (2014) Improving vector space word representations using multilingual correlation. In: Proceedings of the 14th conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Gothenburg, Sweden, pp 462–471. https://doi.org/10.3115/v1/E14-1049. https://www.aclweb.org/anthology/E14-1049
Fei-Fei L (2006) Knowledge transfer in learning to recognize visual objects classes. In: International Conference on Development and Learning. Department of Psychological and Brain Sciences, Indiana University, pp 1–8
Fink M (2005) Object classification from a single example utilizing class relevance metrics. In: Saul LK, Weiss Y, Bottou L (eds) Advances in neural information processing systems. MIT Press, pp 449–456. http://papers.nips.cc/paper/2576-object-classification-from-a-single-example-utilizing-class-relevance-metrics.pdf
Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2001) Placing search in context: the concept revisited. ACM Trans Inf Syst. https://doi.org/10.1145/503104.503110
Firth JR (1957) A synopsis of linguistic theory 1930–1955. In: Studies in linguistic analysis (special volume of the Philological Society), vol 1952–1959. The Philological Society, Oxford, pp 1–32. https://www.bibsonomy.org/bibtex/25e3d6c72cdd123a638f71886d78f3c1e/brightbyte
Gao B, Bian J, Liu TY (2014) Wordrep: A benchmark for research on learning word representations. arXiv preprint arXiv:1407.1640
Gerz D, Vulić I, Hill F, Reichart R, Korhonen A (2016) SimVerb-3500: A large-scale evaluation set of verb similarity. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Association for Computational Linguistics, Austin, Texas, pp 2173–2182. https://doi.org/10.18653/v1/D16-1235. https://www.aclweb.org/anthology/D16-1235
Ghannay S, Favre B, Estève Y, Camelin N (2016) Word embedding evaluation and combination. In: Proceedings of the tenth international conference on language resources and evaluation (LREC'16). Portorož, Slovenia, pp 300–305 https://www.aclweb.org/anthology/L16-1046
Gladkova A, Drozd A (2016) Intrinsic evaluations of word embeddings: what can we do better? In: Proceedings of the 1st workshop on evaluating vector-space representations for NLP. Association for Computational Linguistics, Berlin, Germany, pp 36–42. https://doi.org/10.18653/v1/W16-2507. https://www.aclweb.org/anthology/W16-2507
Gladkova A, Drozd A, Matsuoka S (2016) Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In: Proceedings of the NAACL student research workshop. Association for Computational Linguistics, San Diego, California, pp 8–15. https://doi.org/10.18653/v1/N16-2002. https://www.aclweb.org/anthology/N16-2002
Golovanov S, Kurbanov R, Nikolenko S, Truskovskyi K, Tselousov A, Wolf T (2019) Large-scale transfer learning for natural language generation. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 6053–6058. https://doi.org/10.18653/v1/P19-1608. https://www.aclweb.org/anthology/P19-1608
Greenberg C, Demberg V, Sayeed A (2015) Verb polysemy and frequency effects in thematic fit modeling. In: Proceedings of the 6th workshop on cognitive modeling and computational linguistics. Association for Computational Linguistics, Denver, Colorado, pp 48–57. https://doi.org/10.3115/v1/W15-1106. https://www.aclweb.org/anthology/W15-1106
Guo J, Che W, Wang H, Liu T (2014) Learning sense-specific word embeddings by exploiting bilingual resources. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers. Dublin City University and Association for Computational Linguistics, Dublin, Ireland, pp 497–507. https://www.aclweb.org/anthology/C14-1048
Guo J, Che W, Yarowsky D, Wang H, Liu T (2015) Cross-lingual dependency parsing based on distributed representations. In: Proc. of ACL and IJCNLP, pp 1234–1244
Guo J, Che W, Yarowsky D, Wang H, Liu T (2016a) A distributed representation-based framework for cross-lingual transfer parsing. J Artif Int Res 55(1):995–1023
Guo J, Che W, Yarowsky D, Wang H, Liu T (2016b) A representation learning framework for multi-source transfer parsing. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, AAAI’16. AAAI Press, Phoenix, Arizona, pp 2734–2740.
Hermann KM, Blunsom P (2014) Multilingual models for compositional distributed semantics. In: Proc. of ACL, pp 58–68
Hewitt J, Manning CD (2019) A structural probe for finding syntax in word representations. In: Proc. of NAACL, pp 4129–4138, https://doi.org/10.18653/v1/N19-1419
Heyman G, Verreet B, Vulić I, Moens MF (2019) Learning unsupervised multilingual word embeddings with incremental multilingual hubs. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies, vol 1 (long and short papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 1890–1902. https://doi.org/10.18653/v1/N19-1188. https://www.aclweb.org/anthology/N19-1188
Hou Y, Zhou Z, Liu Y, Wang N, Che W, Liu H, Liu T (2019) Few-shot sequence labeling with label dependency transfer. arXiv preprint arXiv:1906.08711
Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. In: Proceedings of the 56th annual meeting of the association for computational linguistics (vol 1: long papers), Association for Computational Linguistics, Melbourne, Australia, pp 328–339. https://doi.org/10.18653/v1/P18-1031. https://www.aclweb.org/anthology/P18-1031
Huang E, Socher R, Manning C, Ng A (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 56th annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 873–882
Huang F, Yates A (2009) Distributional representations for handling sparsity in supervised sequence-labeling. In: Proc. of ACL and IJCNLP, pp 495–503
Iacobacci I, Pilehvar MT, Navigli R (2016) Embeddings for word sense disambiguation: an evaluation study. In: Proc. of ACL, pp 897–907
Jarmasz M, Szpakowicz S (2003) Roget’s thesaurus and semantic similarity. In: Proc. of RANLP, pp 212–219
Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2019) Spanbert: improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529
Joulin A, Grave E, Bojanowski P, Mikolov T (2017) Bag of tricks for efficient text classification. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, vol 2, short papers. Association for Computational Linguistics, Valencia, Spain, pp 427–431. https://www.aclweb.org/anthology/E17-2068
Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Proceedings of COLING 2012. The COLING 2012 Organizing Committee, Mumbai, India, pp 1459–1474. https://www.aclweb.org/anthology/C12-1089
Kočiský T, Hermann KM, Blunsom P (2014) Learning bilingual word representations by marginalizing alignments. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, vol 2 short papers. Association for Computational Linguistics, Baltimore, Maryland, pp 224–229. https://doi.org/10.3115/v1/P14-2037. https://www.aclweb.org/anthology/P14-2037
Lample G, Conneau A (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. In: Proc. of NAACL, pp 260–270
Lample G, Conneau A, Ranzato M, Denoyer L, Jégou H (2018) Word translation without parallel data. In: Proc. of ICLR
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942
Landauer TK, Dutnais ST (1997) A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol Rev 211–240
Lazaridou A, Dinu G, Baroni M (2015) Hubness and pollution: Delving into cross-space mapping for zero-shot learning. In: Proc. of ACL and IJCNLP, pp 270–280
Liu NF, Gardner M, Belinkov Y, Peters ME, Smith NA (2019a) Linguistic knowledge and transferability of contextual representations. In: Proc. of NAACL, pp 1073–1094, https://doi.org/10.18653/v1/N19-1112
Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, Wang P (2019b) K-bert: Enabling language representation with knowledge graph. arXiv preprint arXiv:1909.07606
Liu X, He P, Chen W, Gao J (2019c) Multi-task deep neural networks for natural language understanding. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 4487–4496. https://doi.org/10.18653/v1/P19-1441. https://www.aclweb.org/anthology/P19-1441
Lu A, Wang W, Bansal M, Gimpel K, Livescu K (2015) Deep multilingual correlation for improved word embeddings. In: Proceedings of the 2015 conference of the north american chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, Denver, Colorado, pp 250–256. https://doi.org/10.3115/v1/N15-1028. https://www.aclweb.org/anthology/N15-1028
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Wallach H, Larochelle H, Beygelzimer A, d' Alché-Buc F, Fox E, Garnett R (eds) Advances in Neural Information Processing Systems. Curran Associates, Inc., pp 13–23. http://papers.nips.cc/paper/8297-vilbert-pretraining-task-agnostic-visiolinguistic-representations-for-vision-and-language-tasks.pdf
Luong T, Pham H, Manning CD (2015) Bilingual word representations with monolingual quality in mind. In: Proceedings of the 1st workshop on vector space modeling for natural language processing. Association for Computational Linguistics, Denver, Colorado, pp 151–159. https://doi.org/10.3115/v1/W15-1521. https://www.aclweb.org/anthology/W15-1521
McCallum A, Freitag D, Pereira FCN (2000) Maximum entropy markov models for information extraction and segmentation. In: Proceedings of the seventeenth international conference on machine learning, ICML ’00. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 591–598
McCann B, Bradbury J, Xiong C, Socher R (2017) Learned in translation: contextualized word vectors. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural Information processing systems 30. Curran Associates, Inc., pp 6294–6305. http://papers.nips.cc/paper/7209-learned-in-translation-contextualized-word-vectors.pdf
McRae K, Ferretti TR, Amyote L (1997) Thematic roles as verb-specific concepts. Lang Cogn Process 12(2–3):137–176
Mikolov T, Karafiát M, Burget L, Cernocky J, Khudanpur S (2010) Recurrent neural network based language model. In: Kobayashi T, Hirose K, Nakamura S (eds) INTERSPEECH. ISCA, pp 1045–1048. https://www.bibsonomy.org/bibtex/2aee1e280d06e82474b17c4996aaea076/dblp
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. ICLR Workshop
Mikolov T, Le QV, Sutskever I (2013b) Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013c) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mikolov T, Yih Wt, Zweig G (2013d) Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the north American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, Atlanta, Georgia, pp 746–751. https://www.aclweb.org/anthology/N13-1090
Miller GA (1995) Wordnet: A lexical database for english. Commun ACM 39–41
Mnih A, Hinton G (2007) Three new graphical models for statistical language modelling. In: Proceedings of the 24th international conference on machine learning, ICML ’07. Association for Computing Machinery, Corvalis, Oregon, USA, pp 641–648. https://doi.org/10.1145/1273496.1273577
Mnih A, Hinton GE (2009) A scalable hierarchical distributed language model. Adv Neural Inf Process Syst 21:1081–1088
Mulcaire P, Kasai J, Smith N (2019a) Polyglot contextual representations improve crosslingual transfer. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 3912–3918. https://doi.org/10.18653/v1/N19-1392. https://www.aclweb.org/anthology/N19-1392
Mulcaire P, Kasai J, Smith NA (2019b) Low-resource parsing with crosslingual contextualized representations. In: Proceedings of the 23rd conference on computational natural language learning (CoNLL), Association for Computational Linguistics, Hong Kong, China, pp 304–315. https://doi.org/10.18653/v1/K19-1029. https://www.aclweb.org/anthology/K19-1029
Neelakantan A, Shankar J, Passos A, McCallum A (2014) Efficient non-parametric estimation of multiple embeddings per word in vector space. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1059–1069. https://doi.org/10.3115/v1/D14-1113. https://www.aclweb.org/anthology/D14-1113
Niven T, Kao H (2019) Probing neural network comprehension of natural language arguments. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 4658–4664. https://doi.org/10.18653/v1/P19-1459. https://www.aclweb.org/anthology/P19-1459
Padó S, Lapata M (2007) Dependency-based construction of semantic space models. Comput Linguist 33(2):161–199
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543. https://doi.org/10.3115/v1/D14-1162. https://www.aclweb.org/anthology/D14-1162
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the north American chapter of the association for computational linguistics: human language technologies, vol 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, pp 2227–2237. https://doi.org/10.18653/v1/N18-1202. https://www.aclweb.org/anthology/N18-1202
Peters ME, Neumann M, Logan R, Schwartz R, Joshi V, Singh S, Smith NA (2019) Knowledge enhanced contextual word representations. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 43–54. https://doi.org/10.18653/v1/D19-1005 https://www.aclweb.org/anthology/D19-1005
Pires T, Schlinger E, Garrette D (2019) How multilingual is multilingual bert? arXiv preprint arXiv:1906.01502
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training. https://s3-us-west-2amazonaws.com/openai-assets/research-covers/languageunsupervised/language/understanding/paper/pdf
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI Blog 1(8)
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683
Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100,000+ questions for machine comprehension of text. In: Proc. of EMNLP, pp 2383–2392
Rajpurkar P, Jia R, Liang P (2018) Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822
Reisinger J, Mooney RJ (2010) Multi-prototype vector-space models of word meaning. In: Proc. of HLT-NAACL, pp 109–117
Ruder S, Vulic I, Søgaard A (2017) A survey of cross-lingual embedding models. arXiv preprint arXiv:1706.04902
Schnabel T, Labutov I, Mimno D, Joachims T (2015) Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, Portugal, pp 298–307. https://doi.org/10.18653/v1/D15-1036. https://www.aclweb.org/anthology/D15-1036
Schuster T, Ram O, Barzilay R, Globerson A (2019) Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In: Proceedings of the 2019 conference of the north American chapter of the association for computational linguistics: human language technologies, vol 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 1599–1613. https://doi.org/10.18653/v1/N19-1162. https://www.aclweb.org/anthology/N19-1162
Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proc. of ACL, pp 2556–2565, https://doi.org/10.18653/v1/P18-1238
Smith SL, Turban DHP, Hamblin S, Hammerla NY (2017) Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In: Proc. of ICLR
Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems, pp 4077–4087
Song K, Tan X, Qin T, Lu J, Liu T (2019) MASS: masked sequence to sequence pre-training for language generation. In: Proc. of ICML, pp 5926–5936
Su W, Zhu X, Cao Y, Li B, Lu L, Wei F, Dai J (2019) Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530
Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019a) Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766
Sun Y, Wang S, Li Y, Feng S, Chen X, Zhang H, Tian X, Zhu D, Tian H, Wu H (2019b) Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223
Sun Y, Wang S, Li Y, Feng S, Tian H, Wu H, Wang H (2019c) Ernie 2.0: a continual pre-training framework for language understanding. arXiv preprint arXiv:1907.12412
Tenney I, Das D, Pavlick E (2019) BERT rediscovers the classical NLP pipeline. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 4593–4601. https://doi.org/10.18653/v1/P19-1452
Tian F, Dai H, Bian J, Gao B, Zhang R, Chen E, Liu TY (2014) A probabilistic model for learning multi-prototype word embeddings. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers. Dublin City University and Association for Computational Linguistics, Dublin, Ireland, pp 151–160. https://www.aclweb.org/anthology/C14-1016
Tsvetkov Y, Faruqui M, Ling W, Lample G, Dyer C (2015) Evaluation of word vector representations by subspace alignment. In: Proceedings of the 2015 conference on empirical methods in natural language processing. Association for Computational Linguistics, Lisbon, Portugal, pp 2049–2054. https://doi.org/10.18653/v1/D15-1243. https://www.aclweb.org/anthology/D15-1243
Turian J, Ratinov LA, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Uppsala, Sweden, pp 384–394. https://www.aclweb.org/anthology/P10-1040
Turney PD (2001a) Mining the web for synonyms: Pmi-ir versus lsa on toefl. In: De Raedt L, Flach P (eds) Machine learning: ECML 2001. Springer, Berlin Heidelberg, pp 491–502
Turney PD (2001b) Mining the web for synonyms: Pmi-ir versus lsa on toefl. In: De Raedt L, Flach P (eds) Machine learning: ECML 2001. Springer, Berlin Heidelberg, pp 491–502
Upadhyay S, Faruqui M, Dyer C, Roth D (2016) Cross-lingual models of word embeddings: an empirical comparison. In: Proceedings of the 54th annual meeting of the association for computational linguistics, vol 1: long papers. Association for Computational Linguistics, Berlin, Germany, pp 1661–1670. https://doi.org/10.18653/v1/P16-1157. https://www.aclweb.org/anthology/P16-1157
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc., pp 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Vinyals O, Blundell C, Lillicrap T, Kavukcuoglu K, Wierstra D (2016) Matching networks for one shot learning. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc., pp 3630–3638. http://papers.nips.cc/paper/6385-matching-networks-for-one-shot-learning.pdf
Vulić I, Moens MF (2015) Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, vol 2 (short papers). Association for Computational Linguistics, Beijing, China, pp 719–725. https://doi.org/10.3115/v1/P15-2118. https://www.aclweb.org/anthology/P15-2118
Wallace E, Feng S, Kandpal N, Gardner M, Singh S (2019) Universal adversarial triggers for attacking and analyzing NLP. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 2153–2162. https://doi.org/10.18653/v1/D19-1221. https://www.aclweb.org/anthology/D19-1221
Wang A, Cho K (2019) BERT has a mouth, and it must speak: BERT as a Markov random field language model. In: Proc. of NeuralGen, pp 30–36, https://doi.org/10.18653/v1/W19-2304
Wang P, Qian Y, Soong FK, He L, Zhao H (2015) Part-of-speech tagging with bidirectional long short-term memory recurrent neural network. arXiv preprint arXiv:1510.06168
Wang Y, Che W, Guo J, Liu Y, Liu T (2019) Cross-lingual BERT transformation for zero-shot dependency parsing. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 5721–5727. https://doi.org/10.18653/v1/D19-1575. https://www.aclweb.org/anthology/D19-1575
Williams A, Nangia N, Bowman SR (2017) A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144
Xing C, Wang D, Liu C, Lin Y (2015) Normalized word embedding and orthogonal transform for bilingual word translation. In: Proceedings of the 2015 conference of the north American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, Denver, Colorado, pp 1006–1011. https://doi.org/10.3115/v1/N15-1104. https://www.aclweb.org/anthology/N15-1104
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov R, Le QV (2019) Xlnet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237
Yijia l (2019) Sentence-level language analysis with contextualized word embeddings. Ph.D. thesis, Harbin Institute of Technology
Zhang H, Gong Y, Yan Y, Duan N, Xu J, Wang J, Gong M, Zhou M (2019a) Pretraining-based natural language generation for text summarization. arXiv preprint arXiv:1902.09243
Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y (2019b) Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675
Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q (2019c) ERNIE: enhanced language representation with informative entities. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 1441–1451. https://doi.org/10.18653/v1/P19-1139. https://www.aclweb.org/anthology/P19-1139
Zhou J, Xu W (2015) End-to-end learning of semantic role labeling using recurrent neural networks. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, vol 1 (long papers). Association for Computational Linguistics, Beijing, China, pp 1127–1137. https://doi.org/10.3115/v1/P15-1109. https://www.aclweb.org/anthology/P15-1109
Zou WY, Socher R, Cer D, Manning CD (2013) Bilingual word embeddings for phrase-based machine translation. In: Proceedings of the 2013 conference on empirical methods in natural language processing. Association for Computational Linguistics, Seattle, Washington, USA, pp 1393–1398. https://www.aclweb.org/anthology/D13-1141
Funding
This work was supported by the National Natural Science Foundation of China (NSFC) via grant 61976072, 61632011 and 61772153.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, Y., Hou, Y., Che, W. et al. From static to dynamic word representations: a survey. Int. J. Mach. Learn. & Cyber. 11, 1611–1630 (2020). https://doi.org/10.1007/s13042-020-01069-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-020-01069-8