Skip to main content

Learning Bilingual Lexicon for Low-Resource Language Pairs

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10619))

  • 3259 Accesses

Abstract

Learning bilingual lexicon from monolingual data is a novel idea in natural language process which can benefit many low-resource language pairs. In this paper, we present an approach for obtaining bilingual lexicon from monolingual data. Our method only requires a small seed bilingual lexicon and we use the Canonical Correlation Analysis to construct a shared latent space to explain two monolingual embeddings how to be linked. Experimental results show that a considerable precision and size bilingual lexicon can be learned in Chinese-Uyghur and Chinese-Kazakh monolingual data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Word2vec: https://code.google.com/p/word2vec/.

  2. 2.

    We omitted the derivation process, if you want to learn more, the work of Bach and Jordan (2005) is good for you.

  3. 3.

    Scrapy: https://pypi.python.org/pypi/Scrapy/1.4.0.

  4. 4.

    OpenCC: https://pypi.python.org/pypi/opencc-python/.

  5. 5.

    Jieba: https://pypi.python.org/pypi/jieba/.

  6. 6.

    BilBOWA: https://github.com/gouwsmeister/bilbowa.

References

  • Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31, 477–504 (2005)

    Article  Google Scholar 

  • Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR Workshop (2013a)

    Google Scholar 

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013b)

    Google Scholar 

  • Mikolov, T., Sutskever, I.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)

    Google Scholar 

  • Cao, H., Zhao, T., Zhang, S.: A distribution-based model to learn bilingualword embeddings. In: Proceedings of COLING (2016)

    Google Scholar 

  • Bach, F.R., Jordan, M.I.: A probabilistic interpretation of canonical correlation analysis (2005)

    Google Scholar 

  • Vulić, I., Moens, M.-F.: A study on bootstrapping bilingual vector spaces from non-parallel data (and nothing else). In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (2013)

    Google Scholar 

  • Gouws, S., Bengio, Y., Corrado, G.: BilBOWA: fast bilingual distributed representations without word alignments. In: JMLR (2015)

    Google Scholar 

  • Wushouer, M., Ishida, T., Lin, D.: Bilingual dictionary induction as an optimization problem. In: International Conference on Language Resources & Evaluation (2014)

    Google Scholar 

  • Zhang, M., Peng, H., Liu, Y.: Bilingual lexicon induction from non-parallel data with minimal supervision. In: AAAI (2017)

    Google Scholar 

  • Haghighi, A., Liang, P., Berg-Kirkpatrick, T.: Learning bilingual lexicons from monolingual corpora. In: ACL (2008)

    Google Scholar 

  • Shi, T., Liu, Z., Liu, Y.: Learning cross-lingual word embeddings via matrix co-factorization. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (2015)

    Google Scholar 

  • Vulić, I., Kiela, D., Clark, S.: Multi-modal representations for improved bilingual lexicon learning. In: ACL (2016)

    Google Scholar 

  • Vulić, I., Korhonen, A.: On the role of seed lexicons in learning bilingual word embeddings. In: ACL (2016)

    Google Scholar 

  • Vulić, I., Moens, M.-F.: Probabilistic models of cross-lingual semantic similarity in context based on latent cross-lingual concepts induced from comparable data. In: EMNLP (2014)

    Google Scholar 

  • Gouws, S., Søgaard, A.: Simple task-specific bilingual word embeddings. In: The 2015 Annual Conference of the North American Chapter of the ACL (2015)

    Google Scholar 

  • Liu, X., Duh, K., Matsumoto, Y.: Topic models + word alignment = a flexible framework for extracting bilingual dictionary from comparable corpus (2013)

    Google Scholar 

Download references

Acknowledgments

This work is supported by the Xinjiang Fun under Grant (No. 2015KL031), the West Light Foundation of The Chinese Academy of Sciences (No. 2015-XBQN-B-10), the Xinjiang Science and Technology Major Project (No. 2016A03007-3) and Natural Science Foundation of Xinjiang (No. 2015211B034).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to YaTing Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, S., Li, X., Yang, Y., Wang, L., Mi, C. (2018). Learning Bilingual Lexicon for Low-Resource Language Pairs. In: Huang, X., Jiang, J., Zhao, D., Feng, Y., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2017. Lecture Notes in Computer Science(), vol 10619. Springer, Cham. https://doi.org/10.1007/978-3-319-73618-1_66

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73618-1_66

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73617-4

  • Online ISBN: 978-3-319-73618-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics