Learning Bilingual Lexicon for Low-Resource Language Pairs

Zhu, ShaoLin; Li, Xiao; Yang, YaTing; Wang, Lei; Mi, ChengGang

doi:10.1007/978-3-319-73618-1_66

ShaoLin Zhu^18,19,20,
Xiao Li^18,19,
YaTing Yang^18,19,
Lei Wang^18,19 &
…
ChengGang Mi^18,19

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10619))

Included in the following conference series:

National CCF Conference on Natural Language Processing and Chinese Computing

3409 Accesses

Abstract

Learning bilingual lexicon from monolingual data is a novel idea in natural language process which can benefit many low-resource language pairs. In this paper, we present an approach for obtaining bilingual lexicon from monolingual data. Our method only requires a small seed bilingual lexicon and we use the Canonical Correlation Analysis to construct a shared latent space to explain two monolingual embeddings how to be linked. Experimental results show that a considerable precision and size bilingual lexicon can be learned in Chinese-Uyghur and Chinese-Kazakh monolingual data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improving Bilingual Lexicon Induction on Distant Language Pairs

Enhancing isomorphism between word embedding spaces for distant languages bilingual lexicon induction

Article 13 May 2024

Bilingual Lexicon Extraction with Forced Correlation from Comparable Corpora

Notes

1.
Word2vec: https://code.google.com/p/word2vec/.
2.
We omitted the derivation process, if you want to learn more, the work of Bach and Jordan (2005) is good for you.
3.
Scrapy: https://pypi.python.org/pypi/Scrapy/1.4.0.
4.
OpenCC: https://pypi.python.org/pypi/opencc-python/.
5.
Jieba: https://pypi.python.org/pypi/jieba/.
6.
BilBOWA: https://github.com/gouwsmeister/bilbowa.

References

Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31, 477–504 (2005)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR Workshop (2013a)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013b)
Google Scholar
Mikolov, T., Sutskever, I.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)
Google Scholar
Cao, H., Zhao, T., Zhang, S.: A distribution-based model to learn bilingualword embeddings. In: Proceedings of COLING (2016)
Google Scholar
Bach, F.R., Jordan, M.I.: A probabilistic interpretation of canonical correlation analysis (2005)
Google Scholar
Vulić, I., Moens, M.-F.: A study on bootstrapping bilingual vector spaces from non-parallel data (and nothing else). In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (2013)
Google Scholar
Gouws, S., Bengio, Y., Corrado, G.: BilBOWA: fast bilingual distributed representations without word alignments. In: JMLR (2015)
Google Scholar
Wushouer, M., Ishida, T., Lin, D.: Bilingual dictionary induction as an optimization problem. In: International Conference on Language Resources & Evaluation (2014)
Google Scholar
Zhang, M., Peng, H., Liu, Y.: Bilingual lexicon induction from non-parallel data with minimal supervision. In: AAAI (2017)
Google Scholar
Haghighi, A., Liang, P., Berg-Kirkpatrick, T.: Learning bilingual lexicons from monolingual corpora. In: ACL (2008)
Google Scholar
Shi, T., Liu, Z., Liu, Y.: Learning cross-lingual word embeddings via matrix co-factorization. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (2015)
Google Scholar
Vulić, I., Kiela, D., Clark, S.: Multi-modal representations for improved bilingual lexicon learning. In: ACL (2016)
Google Scholar
Vulić, I., Korhonen, A.: On the role of seed lexicons in learning bilingual word embeddings. In: ACL (2016)
Google Scholar
Vulić, I., Moens, M.-F.: Probabilistic models of cross-lingual semantic similarity in context based on latent cross-lingual concepts induced from comparable data. In: EMNLP (2014)
Google Scholar
Gouws, S., Søgaard, A.: Simple task-specific bilingual word embeddings. In: The 2015 Annual Conference of the North American Chapter of the ACL (2015)
Google Scholar
Liu, X., Duh, K., Matsumoto, Y.: Topic models + word alignment = a flexible framework for extracting bilingual dictionary from comparable corpus (2013)
Google Scholar

Download references

Acknowledgments

This work is supported by the Xinjiang Fun under Grant (No. 2015KL031), the West Light Foundation of The Chinese Academy of Sciences (No. 2015-XBQN-B-10), the Xinjiang Science and Technology Major Project (No. 2016A03007-3) and Natural Science Foundation of Xinjiang (No. 2015211B034).

Author information

Authors and Affiliations

The Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi, China
ShaoLin Zhu, Xiao Li, YaTing Yang, Lei Wang & ChengGang Mi
Key Laboratory of Speech Language Information Processing of Xinjiang, Urumqi, China
ShaoLin Zhu, Xiao Li, YaTing Yang, Lei Wang & ChengGang Mi
University of Chinese Academy of Sciences, Beijing, China
ShaoLin Zhu

Authors

ShaoLin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Li
View author publications
You can also search for this author in PubMed Google Scholar
YaTing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
ChengGang Mi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to YaTing Yang .

Editor information

Editors and Affiliations

Fudan University, Shanghai, China
Xuanjing Huang
Singapore Management University, Singapore, Singapore
Jing Jiang
Peking University, Beijing, China
Dongyan Zhao
Peking University, Beijing, China
Yansong Feng
Soochow University, Suzhou, China
Yu Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, S., Li, X., Yang, Y., Wang, L., Mi, C. (2018). Learning Bilingual Lexicon for Low-Resource Language Pairs. In: Huang, X., Jiang, J., Zhao, D., Feng, Y., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2017. Lecture Notes in Computer Science(), vol 10619. Springer, Cham. https://doi.org/10.1007/978-3-319-73618-1_66

Download citation

DOI: https://doi.org/10.1007/978-3-319-73618-1_66
Published: 05 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73617-4
Online ISBN: 978-3-319-73618-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics