Inducing Bilingual Word Representations for Non-isomorphic Spaces by an Unsupervised Way

Zhu, Shaolin; Mi, Chenggang; Zhang, Linlin

doi:10.1007/978-3-030-82136-4_37

Shaolin Zhu¹³,
Chenggang Mi¹⁴ &
Linlin Zhang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12815))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

2350 Accesses

Abstract

Bilingual word representations (BWRs) play a very key role in many natural language processing (NLP) tasks, especially cross-lingual applications such as machine translation and cross-lingual information retrieval et al. Most existing methods are based on offline unsupervised methods to learn BWRs. Those offline methods mainly rely on the isomorphic assumption that word representations have a similar distribution for different languages. Several authors also question this assumption and argue that word representation spaces are non-isomorphic for many language pairs. In this paper, we adopt a novel unsupervised method to implement joint training BWRs. We first use a dynamic programming algorithm to detect continuous bilingual segments. Then, we use the extracted bilingual data and monolingual corpora to train BWRs jointly. Experiments show that our approach improves the performance of BWRs compared with several baselines in the real-world dataset.(By unsupervised, we mean that no cross-lingual resources like parallel text or bilingual lexicons are directly used.)

Supported by Northwestern Polytechnical University and Zhejiang University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We mean that the etymologically close languages are closely-related languages such as English-French. The distant languages are etymologically different such as English-Chinese.
2.
https://code.google.com/p/word2vec/.
3.
Most of the related works extract parallel sentences to improve machine translation system. Recall of extracted parallel data also is important. Ours only consider obtaining some (rather than all) good quality parallel data (words or phrases), parallel sentences are not necessary.
4.
In this paper, we define a phrase that contains three words at least. We also test the different number of words how to affect the results in the experimental section.
5.
https://github.com/alex-berard/multivec.
6.
https://github.com/alex-berard/multivec.
7.
https://github.com/attardi/wikiextractor.
8.
https://github.com/facebookresearch/MUSE.

References

Artetxe, M., Labaka, G., Agirre, E.: Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2289–2294 (2016)
Google Scholar
Artetxe, M., Labaka, G., Agirre, E.: Learning bilingual word embeddings with (almost) no bilingual data. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers), pp. 451–462 (2017)
Google Scholar
Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers), pp. 789–798 (2018)
Google Scholar
Braune, F., Hangya, V., Eder, T., Fraser, A.: Evaluating bilingual word embeddings on the long tail. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 2 (Short Papers), pp. 188–193 (2018)
Google Scholar
Bérard, A., Servan, C., Pietquin, O., Besacier, L.: MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP. In: The 10th edition of the Language Resources and Evaluation Conference (LREC 2016), May 2016
Google Scholar
Gai, K., Qiu, M.: Optimal resource allocation using reinforcement learning for IoT content-centric services. Appl. Soft Comput. 70, 12–21 (2018)
Article Google Scholar
Gai, K., Qiu, M.: Reinforcement learning-based content-centric services in mobile sensing. IEEE Netw. 32(4), 34–39 (2018)
Article Google Scholar
Gai, K., Qiu, M., Zhao, H., Sun, X.: Resource management in sustainable cyber-physical systems using heterogeneous cloud computing. IEEE Trans. Sustain. Comput. 3(2), 60–72 (2017)
Article Google Scholar
Glavaš, G., Litschko, R., Ruder, S., Vulić, I.: How to (properly) evaluate cross-lingual word embeddings: on strong baselines, comparative analyses, and some misconceptions. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 710–721 (2019)
Google Scholar
Grave, E., Joulin, A., Berthet, Q.: Unsupervised alignment of embeddings with Wasserstein procrustes. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1880–1890 (2019)
Google Scholar
Hangya, V., Braune, F., Kalasouskaya, Y., Fraser, A.: Unsupervised parallel sentence extraction from comparable corpora (2018)
Google Scholar
Hangya, V., Fraser, A.: Unsupervised parallel sentence extraction with parallel segment detection helps machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1224–1234 (2019)
Google Scholar
Keung, P., Salazar, J., Lu, Y., Smith, N.A.: Unsupervised bitext mining and translation via self-trained contextual embeddings. arXiv preprint arXiv:2010.07761 (2020)
Lample, G., Conneau, A., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. In: International Conference on Learning Representations (2018)
Google Scholar
Lample, G., Ott, M., Conneau, A., Denoyer, L., Ranzato, M.: Phrase-based & neural unsupervised machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5039–5049 (2018)
Google Scholar
Litschko, R., Glavaš, G., Ponzetto, S.P., Vulić, I.: Unsupervised cross-lingual information retrieval using monolingual data only. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1253–1256 (2018)
Google Scholar
Litschko, R., Glavaš, G., Vulic, I., Dietz, L.: Evaluating resource-lean cross-lingual embedding models in unsupervised retrieval. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1109–1112 (2019)
Google Scholar
Luong, M.T., Pham, H., Manning, C.D.: Bilingual word representations with monolingual quality in mind. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 151–159 (2015)
Google Scholar
Marie, B., Fujita, A.: Efficient extraction of pseudo-parallel sentences from raw monolingual data using word embeddings. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 2: Short Papers, pp. 392–398 (2017)
Google Scholar
Marie, B., Fujita, A.: Unsupervised joint training of bilingual word embeddings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3224–3230 (2019)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Ormazabal, A., Artetxe, M., Labaka, G., Soroa, A., Agirre, E.: Analyzing the limitations of cross-lingual word embedding mappings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4990–4995 (2019)
Google Scholar
Patra, B., Moniz, J.R.A., Garg, S., Gormley, M.R., Neubig, G.: Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 184–193 (2019)
Google Scholar
Ren, S., Liu, S., Zhou, M., Ma, S.: A graph-based coarse-to-fine method for unsupervised bilingual lexicon induction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3476–3485 (2020)
Google Scholar
Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859 (2017)
Sun, H., Wang, R., Chen, K., Utiyama, M., Sumita, E., Zhao, T.: Unsupervised bilingual word embedding agreement for unsupervised neural machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1235–1245 (2019)
Google Scholar
Vulić, I., Glavaš, G., Reichart, R., Korhonen, A.: Do we really need fully unsupervised cross-lingual embeddings? In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4398–4409 (2019)
Google Scholar
Zhao, X., Wang, Z., Zhang, Y., Wu, H.: A relaxed matching procedure for unsupervised BLI. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3036–3041 (2020)
Google Scholar

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (61906158), the Project of Science and Technology Research in Henan Province (212102210075).

Author information

Authors and Affiliations

Zhengzhou University of light industry, Zhenghzou, China
Shaolin Zhu
Northwestern Polytechnical University, Xi’an, China
Chenggang Mi
Zhejiang University, hangzhou, China
Linlin Zhang

Authors

Shaolin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Chenggang Mi
View author publications
You can also search for this author in PubMed Google Scholar
Linlin Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenggang Mi .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Han Qiu
Ibaraki University, Hitachi, Japan
Cheng Zhang
University of Kentucky, Lexington, KY, USA
Zongming Fei
Texas A&M University – Commerce, Commerce, TX, USA
Meikang Qiu
Princeton University, Princeton, NJ, USA
Sun-Yuan Kung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, S., Mi, C., Zhang, L. (2021). Inducing Bilingual Word Representations for Non-isomorphic Spaces by an Unsupervised Way. In: Qiu, H., Zhang, C., Fei, Z., Qiu, M., Kung, SY. (eds) Knowledge Science, Engineering and Management. KSEM 2021. Lecture Notes in Computer Science(), vol 12815. Springer, Cham. https://doi.org/10.1007/978-3-030-82136-4_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-82136-4_37
Published: 07 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82135-7
Online ISBN: 978-3-030-82136-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Inducing Bilingual Word Representations for Non-isomorphic Spaces by an Unsupervised Way