Harvest Uyghur-Chinese Aligned-Sentences Bitexts from Multilingual Sites Based on Word Embedding

Zhu, ShaoLin; Li, Xiao; Yang, YaTing; Wang, Lei; Mi, ChengGang

doi:10.1007/978-3-319-69005-6_37

ShaoLin Zhu^17,18,19,
Xiao Li^17,18,
YaTing Yang^17,18,
Lei Wang^17,18 &
…
ChengGang Mi^17,18

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10565))

Included in the following conference series:

1914 Accesses

Abstract

Obtaining bilingual parallel data from the multilingual websites is a long-standing research problem, which is very benefit for resource-scarce languages. In this paper, we present an approach for obtaining parallel data based on word embedding, and our model only rely on a small scale of bilingual lexicon. Our approach benefit from the recent advances of continuous word representations, which can reveal more context information compared with traditional methods. Our experiments show that high-precision and sizable parallel Uyghur-Chinese data can be obtained for lacking bilingual lexicon.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Bitextor: https://sourceforge.net/projects/bitextor.
2.
ILSP_FC: http://nlp.ilsp.gr/redmine/projects/ilsp-fc.
3.
Scrapy: https://pypi.python.org/pypi/Scrapy/1.4.0.
4.
Word2vec: https://code.google.com/p/word2vec/.
5.
Scrapy: https://pypi.python.org/pypi/Scrapy/1.4.0.
6.
OpenCC: https://pypi.python.org/pypi/opencc-python/.
7.
Jieba: https://pypi.python.org/pypi/jieba/.

References

Espla-Gomis, M., Forcada, M.L.: Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with bitextor. Prague Bull. Math. Linguist. 93, 77–86 (2010)
Article Google Scholar
Zhang, Y., Wu, K., Gao, J., Vines, P.: Automatic acquisition of Chinese–English parallel corpus from the web. In: Advances in Information Retrieval, vol. 3936, pp. 420–431 (2006)
Google Scholar
San Vicente, I., Manterola, I.: PaCo2: a fully automated tool for gathering parallel corpora from the web. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, pp. 1–6 (2012)
Google Scholar
Resnik, P., Smith, N.A.: The Web as a parallel corpus. Comput. Linguist. 29, 349–380 (2003)
Article Google Scholar
Papavassiliou, V., Prokopidis, P., Thurmair, G.: A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, pp. 43–51 (2013)
Google Scholar
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31, 477–504 (2005a)
Article Google Scholar
Espla-Gomis, M.: Bitextor, a free/open-source software to harvest translation memories from multilingual websites. In: Beyond Translation Memories Workshop (MT Summit XII) (2009)
Google Scholar
Espla-Gomis, M., Forcada, M.L.: Bitextor’s participation in WMT’16: shared task on document alignment. In: Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pp. 685–691 (2016)
Google Scholar
Ma, X., Liberman, M.Y.: BITS: a method for bilingual text search over the web. Linguist. Data Consort., 538–542 (1999)
Google Scholar
Espla-Gomis, M., Klubicka, F., Ljube, N.: Comparing two acquisition systems for automatically building an English–Croatian parallel corpus from multilingual websites. In: LREC 2014 Proceedings, pp. 1252–1256 (2014)
Google Scholar
Nie, J.-Y., Simard, M., Isabelle, P., Durand, R.: Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 74–81 (1999)
Google Scholar
Ling, W., Marujo, L., Dyer, C., Black, A., Trancoso, I.: Crowdsourcing high-quality parallel data extraction from Twitter. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 426–436 (2014)
Google Scholar
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31, 477–504 (2005b)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR Workshop, pp. 1–12 (2013a)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013b)
Google Scholar

Download references

Acknowledgments

This work is supported by the Xinjiang Fun under Grant (No. 2015KL031), the West Light Foundation of The Chinese Academy of Sciences (No. 2015-XBQN-B-10), the Xinjiang Science and Technology Major Project (No. 2016A03007-3) and Natural Science Foundation of Xinjiang (No. 2015211B034)

Author information

Authors and Affiliations

The Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi, China
ShaoLin Zhu, Xiao Li, YaTing Yang, Lei Wang & ChengGang Mi
Key Laboratory of Speech Language Information Processing of Xinjiang, Urumqi, China
ShaoLin Zhu, Xiao Li, YaTing Yang, Lei Wang & ChengGang Mi
University of Chinese Academy of Sciences, Beijing, China
ShaoLin Zhu

Authors

ShaoLin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Li
View author publications
You can also search for this author in PubMed Google Scholar
YaTing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
ChengGang Mi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to YaTing Yang .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Maosong Sun
Beijing University of Posts and Telecommunications, Beijing, China
Xiaojie Wang
Peking University, Beijing, China
Baobao Chang
Soochow University, Suzhou, China
Deyi Xiong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, S., Li, X., Yang, Y., Wang, L., Mi, C. (2017). Harvest Uyghur-Chinese Aligned-Sentences Bitexts from Multilingual Sites Based on Word Embedding. In: Sun, M., Wang, X., Chang, B., Xiong, D. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2017 2017. Lecture Notes in Computer Science(), vol 10565. Springer, Cham. https://doi.org/10.1007/978-3-319-69005-6_37

Download citation

DOI: https://doi.org/10.1007/978-3-319-69005-6_37
Published: 07 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69004-9
Online ISBN: 978-3-319-69005-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics