Skip to main content

Inducing Bilingual Word Representations for Non-isomorphic Spaces by an Unsupervised Way

  • Conference paper
  • First Online:
Knowledge Science, Engineering and Management (KSEM 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12815))

  • 2350 Accesses

Abstract

Bilingual word representations (BWRs) play a very key role in many natural language processing (NLP) tasks, especially cross-lingual applications such as machine translation and cross-lingual information retrieval et al. Most existing methods are based on offline unsupervised methods to learn BWRs. Those offline methods mainly rely on the isomorphic assumption that word representations have a similar distribution for different languages. Several authors also question this assumption and argue that word representation spaces are non-isomorphic for many language pairs. In this paper, we adopt a novel unsupervised method to implement joint training BWRs. We first use a dynamic programming algorithm to detect continuous bilingual segments. Then, we use the extracted bilingual data and monolingual corpora to train BWRs jointly. Experiments show that our approach improves the performance of BWRs compared with several baselines in the real-world dataset.(By unsupervised, we mean that no cross-lingual resources like parallel text or bilingual lexicons are directly used.)

Supported by Northwestern Polytechnical University and Zhejiang University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We mean that the etymologically close languages are closely-related languages such as English-French. The distant languages are etymologically different such as English-Chinese.

  2. 2.

    https://code.google.com/p/word2vec/.

  3. 3.

    Most of the related works extract parallel sentences to improve machine translation system. Recall of extracted parallel data also is important. Ours only consider obtaining some (rather than all) good quality parallel data (words or phrases), parallel sentences are not necessary.

  4. 4.

    In this paper, we define a phrase that contains three words at least. We also test the different number of words how to affect the results in the experimental section.

  5. 5.

    https://github.com/alex-berard/multivec.

  6. 6.

    https://github.com/alex-berard/multivec.

  7. 7.

    https://github.com/attardi/wikiextractor.

  8. 8.

    https://github.com/facebookresearch/MUSE.

References

  1. Artetxe, M., Labaka, G., Agirre, E.: Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2289–2294 (2016)

    Google Scholar 

  2. Artetxe, M., Labaka, G., Agirre, E.: Learning bilingual word embeddings with (almost) no bilingual data. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers), pp. 451–462 (2017)

    Google Scholar 

  3. Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers), pp. 789–798 (2018)

    Google Scholar 

  4. Braune, F., Hangya, V., Eder, T., Fraser, A.: Evaluating bilingual word embeddings on the long tail. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 2 (Short Papers), pp. 188–193 (2018)

    Google Scholar 

  5. Bérard, A., Servan, C., Pietquin, O., Besacier, L.: MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP. In: The 10th edition of the Language Resources and Evaluation Conference (LREC 2016), May 2016

    Google Scholar 

  6. Gai, K., Qiu, M.: Optimal resource allocation using reinforcement learning for IoT content-centric services. Appl. Soft Comput. 70, 12–21 (2018)

    Article  Google Scholar 

  7. Gai, K., Qiu, M.: Reinforcement learning-based content-centric services in mobile sensing. IEEE Netw. 32(4), 34–39 (2018)

    Article  Google Scholar 

  8. Gai, K., Qiu, M., Zhao, H., Sun, X.: Resource management in sustainable cyber-physical systems using heterogeneous cloud computing. IEEE Trans. Sustain. Comput. 3(2), 60–72 (2017)

    Article  Google Scholar 

  9. Glavaš, G., Litschko, R., Ruder, S., Vulić, I.: How to (properly) evaluate cross-lingual word embeddings: on strong baselines, comparative analyses, and some misconceptions. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 710–721 (2019)

    Google Scholar 

  10. Grave, E., Joulin, A., Berthet, Q.: Unsupervised alignment of embeddings with Wasserstein procrustes. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1880–1890 (2019)

    Google Scholar 

  11. Hangya, V., Braune, F., Kalasouskaya, Y., Fraser, A.: Unsupervised parallel sentence extraction from comparable corpora (2018)

    Google Scholar 

  12. Hangya, V., Fraser, A.: Unsupervised parallel sentence extraction with parallel segment detection helps machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1224–1234 (2019)

    Google Scholar 

  13. Keung, P., Salazar, J., Lu, Y., Smith, N.A.: Unsupervised bitext mining and translation via self-trained contextual embeddings. arXiv preprint arXiv:2010.07761 (2020)

  14. Lample, G., Conneau, A., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. In: International Conference on Learning Representations (2018)

    Google Scholar 

  15. Lample, G., Ott, M., Conneau, A., Denoyer, L., Ranzato, M.: Phrase-based & neural unsupervised machine translation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5039–5049 (2018)

    Google Scholar 

  16. Litschko, R., Glavaš, G., Ponzetto, S.P., Vulić, I.: Unsupervised cross-lingual information retrieval using monolingual data only. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1253–1256 (2018)

    Google Scholar 

  17. Litschko, R., Glavaš, G., Vulic, I., Dietz, L.: Evaluating resource-lean cross-lingual embedding models in unsupervised retrieval. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1109–1112 (2019)

    Google Scholar 

  18. Luong, M.T., Pham, H., Manning, C.D.: Bilingual word representations with monolingual quality in mind. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 151–159 (2015)

    Google Scholar 

  19. Marie, B., Fujita, A.: Efficient extraction of pseudo-parallel sentences from raw monolingual data using word embeddings. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 2: Short Papers, pp. 392–398 (2017)

    Google Scholar 

  20. Marie, B., Fujita, A.: Unsupervised joint training of bilingual word embeddings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3224–3230 (2019)

    Google Scholar 

  21. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  22. Ormazabal, A., Artetxe, M., Labaka, G., Soroa, A., Agirre, E.: Analyzing the limitations of cross-lingual word embedding mappings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4990–4995 (2019)

    Google Scholar 

  23. Patra, B., Moniz, J.R.A., Garg, S., Gormley, M.R., Neubig, G.: Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 184–193 (2019)

    Google Scholar 

  24. Ren, S., Liu, S., Zhou, M., Ma, S.: A graph-based coarse-to-fine method for unsupervised bilingual lexicon induction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3476–3485 (2020)

    Google Scholar 

  25. Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859 (2017)

  26. Sun, H., Wang, R., Chen, K., Utiyama, M., Sumita, E., Zhao, T.: Unsupervised bilingual word embedding agreement for unsupervised neural machine translation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1235–1245 (2019)

    Google Scholar 

  27. Vulić, I., Glavaš, G., Reichart, R., Korhonen, A.: Do we really need fully unsupervised cross-lingual embeddings? In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4398–4409 (2019)

    Google Scholar 

  28. Zhao, X., Wang, Z., Zhang, Y., Wu, H.: A relaxed matching procedure for unsupervised BLI. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3036–3041 (2020)

    Google Scholar 

Download references

Acknowledgments

This work is supported by the National Natural Science Foundation of China (61906158), the Project of Science and Technology Research in Henan Province (212102210075).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenggang Mi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, S., Mi, C., Zhang, L. (2021). Inducing Bilingual Word Representations for Non-isomorphic Spaces by an Unsupervised Way. In: Qiu, H., Zhang, C., Fei, Z., Qiu, M., Kung, SY. (eds) Knowledge Science, Engineering and Management. KSEM 2021. Lecture Notes in Computer Science(), vol 12815. Springer, Cham. https://doi.org/10.1007/978-3-030-82136-4_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-82136-4_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-82135-7

  • Online ISBN: 978-3-030-82136-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics