Abstract
Obtaining bilingual parallel data from the multilingual websites is a long-standing research problem, which is very benefit for resource-scarce languages. In this paper, we present an approach for obtaining parallel data based on word embedding, and our model only rely on a small scale of bilingual lexicon. Our approach benefit from the recent advances of continuous word representations, which can reveal more context information compared with traditional methods. Our experiments show that high-precision and sizable parallel Uyghur-Chinese data can be obtained for lacking bilingual lexicon.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Bitextor: https://sourceforge.net/projects/bitextor.
- 2.
- 3.
- 4.
Word2vec: https://code.google.com/p/word2vec/.
- 5.
- 6.
- 7.
References
Espla-Gomis, M., Forcada, M.L.: Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with bitextor. Prague Bull. Math. Linguist. 93, 77–86 (2010)
Zhang, Y., Wu, K., Gao, J., Vines, P.: Automatic acquisition of Chinese–English parallel corpus from the web. In: Advances in Information Retrieval, vol. 3936, pp. 420–431 (2006)
San Vicente, I., Manterola, I.: PaCo2: a fully automated tool for gathering parallel corpora from the web. In: Proceedings of the 8th International Conference on Language Resources and Evaluation, pp. 1–6 (2012)
Resnik, P., Smith, N.A.: The Web as a parallel corpus. Comput. Linguist. 29, 349–380 (2003)
Papavassiliou, V., Prokopidis, P., Thurmair, G.: A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, pp. 43–51 (2013)
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31, 477–504 (2005a)
Espla-Gomis, M.: Bitextor, a free/open-source software to harvest translation memories from multilingual websites. In: Beyond Translation Memories Workshop (MT Summit XII) (2009)
Espla-Gomis, M., Forcada, M.L.: Bitextor’s participation in WMT’16: shared task on document alignment. In: Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pp. 685–691 (2016)
Ma, X., Liberman, M.Y.: BITS: a method for bilingual text search over the web. Linguist. Data Consort., 538–542 (1999)
Espla-Gomis, M., Klubicka, F., Ljube, N.: Comparing two acquisition systems for automatically building an English–Croatian parallel corpus from multilingual websites. In: LREC 2014 Proceedings, pp. 1252–1256 (2014)
Nie, J.-Y., Simard, M., Isabelle, P., Durand, R.: Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 74–81 (1999)
Ling, W., Marujo, L., Dyer, C., Black, A., Trancoso, I.: Crowdsourcing high-quality parallel data extraction from Twitter. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 426–436 (2014)
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31, 477–504 (2005b)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR Workshop, pp. 1–12 (2013a)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013b)
Acknowledgments
This work is supported by the Xinjiang Fun under Grant (No. 2015KL031), the West Light Foundation of The Chinese Academy of Sciences (No. 2015-XBQN-B-10), the Xinjiang Science and Technology Major Project (No. 2016A03007-3) and Natural Science Foundation of Xinjiang (No. 2015211B034)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Zhu, S., Li, X., Yang, Y., Wang, L., Mi, C. (2017). Harvest Uyghur-Chinese Aligned-Sentences Bitexts from Multilingual Sites Based on Word Embedding. In: Sun, M., Wang, X., Chang, B., Xiong, D. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2017 2017. Lecture Notes in Computer Science(), vol 10565. Springer, Cham. https://doi.org/10.1007/978-3-319-69005-6_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-69005-6_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69004-9
Online ISBN: 978-3-319-69005-6
eBook Packages: Computer ScienceComputer Science (R0)