Abstract
Cross-lingual word embeddings have been served as fundamental components for many Web-based applications. However, current models learn cross-lingual word embeddings based on projection of two pre-trained monolingual embeddings based on well-known models such as word2vec. This procedure makes it indiscriminative for some crucial factors of words such as homonymy and polysemy. In this paper, we propose a novel framework for learning better cross-lingual word embeddings with latent topics. In this framework, we firstly incorporate latent topical representations into the Skip-Gram model to learn high quality monolingual word embeddings. Then we use the supervised and unsupervised methods to train cross-lingual word embeddings with topical information. We evaluate our framework in the cross-lingual Web search tasks using the CLEF test collections. The results show that our framework outperforms previous state-of-the-art methods for generating cross-lingual word embeddings.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
English and Dutch are Germanic languages, Italian and French are Romance languages.
- 4.
References
Zhou, D., Truran, M., Brailsford, T., Wade, V., Ashman, H.: Translation techniques in cross-language information retrieval. ACM Comput. Surv. 45(1), 1–44 (2012)
Mikolov, T., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013b)
Ruder, S., Vulic, I., Søgaard, A.: A survey of cross-lingual word embedding models. Artif. Intell. Res. 1–55 (2018)
Upadhyay, S., Faruqui, M., Dyer, C., Roth, D. Cross-lingual models of word embeddings: an empirical comparison. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), volume 1, pp. 1661–1670 (2016)
Lazaridou, A., Dinu, G., Baroni, M.: Hubness and pollution: delving into cross-space mapping for zero-shot learning. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 270–280 (2015)
Heyman, G., Verreet, B., Vuli´c, I., Moens, M.F. Learning unsupervised multilingual word embeddings with incremental multilingual hubs. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1890–1902 (2019)
Smith, S.L., Turban, D.H.P., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In: Proceedings of the 5th International Conference on Learning Representations (2017)
Zhang, M., Liu, Y., Luan, H., Sun, M.: Adversarial training for unsupervised bilingual lexicon induction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (2017)
Conneau, A., Lample, G., Ranzato, M.A., Denoyer, L., Jégou, H.: Word translation without parallel data. In: Proceedings of the ICLR (2018)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Mach. Learn. Res. Archive 3, 993–1022 (2003)
Xu, G., Yang, S.H., Li, H.: Named entity mining from click-through data using weakly supervised latent dirichlet allocation. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374. ACM (2009)
Zhou, D., Wade, V.: Latent document Re-Ranking. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP), August 2009, Singapore, pp. 1571–1580 (2009)
Shi, B., Lam, W., Jameel, S., Schockaert, S., Kwun, P. L.: Jointly Learning Word Embeddings and Latent Topics. J. (2017)
Liu, Y., Liu, Z.Y., Tat-Seng, C., Maosong, S.: Topical word embeddings. In: Proceedings of the AAAI, pp. 2418–2424 (2015)
Li, C., Wang, H., Zhang, Z., Sun, A., Ma, Z.: Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR Conference, pp. 165–174. ACM (2016)
Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting Similarities among Languages for Machine Translation. arXiv:1309.4168 [cs]. (2013)
Vulić, I., Moens, S.: Monolingual and cross-lingual information retrieval models based on (Bilingual) word embeddings. In: Proceedings of the SIGIR, pp. 363–372 (2015)
Litschko, R., Glavaš, G., Ponzetto, S. P., Vulić, I.: Unsupervised cross-lingual information retrieval using monolingual data only. In: Proceedings of the SIGIR, pp. 1253–1256 (2018)
Hartmann, M., Kementchedjhieva, Y., Søgaard, A.: Why is unsupervised alignment of english embeddings from different algorithms so hard? In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp 582–586 (2018)
Acknowledgement
This work was supported by the National Natural Science Foundation of China under Project No. 61876062.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Peng, X., Zhou, D. (2020). A Framework for Learning Cross-Lingual Word Embedding with Topics. In: Wang, X., Zhang, R., Lee, YK., Sun, L., Moon, YS. (eds) Web and Big Data. APWeb-WAIM 2020. Lecture Notes in Computer Science(), vol 12318. Springer, Cham. https://doi.org/10.1007/978-3-030-60290-1_22
Download citation
DOI: https://doi.org/10.1007/978-3-030-60290-1_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60289-5
Online ISBN: 978-3-030-60290-1
eBook Packages: Computer ScienceComputer Science (R0)