Query Expansion for Mining Translation Knowledge from Comparable Data

Xiang, Lu; Zhou, Yu; Hao, Jie; Zhang, Dakun

doi:10.1007/978-3-319-12277-9_18

Lu Xiang²¹,
Yu Zhou²¹,
Jie Hao²² &
…
Dakun Zhang²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8801))

Included in the following conference series:

1575 Accesses

Abstract

When mining parallel text from comparable corpora, we confront vast search space since parallel sentence or sub-sentential fragments can be scattered throughout the source and target corpus. To reduce the search space, most previous approaches have tried to use heuristics to mine comparable documents. However, these heuristics are only available in few cases. Instead, we go on a different direction and adopt the cross-language information retrieval (CLIR) framework to find translation candidates directly at sentence level from comparable corpus. What’s more, for the sake of better retrieval result, two simple but effective query expansion methods are proposed. Experimental results show that using our query expansion methods can help to improve the recall significantly and obtain candidates of sentence pairs with high quality. Thus, our methods can help to make good preparation for extracting both parallel sentences and fragments subsequently.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22–29 (1990)
Google Scholar
Fung, P., Cheung, P.: Mining very non-parallel corpora: Parallel sentence and lexicon extraction vie bootstrapping and EM. In: EMNLP 2004, pp. 57–63 (2004a)
Google Scholar
Fung, P., Cheung, P.: Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In: COLING 2004, pp. 1051–1057 (2004b)
Google Scholar
Jang, M.-G., Myaeng, S.H., Park, S.Y.: Using Mutual Information to Resolve Query Translation Ambiguities and Query Term Weighting. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (1999)
Google Scholar
Liu, C., Liu, Q., Liu, Y., Sun, M.: THUTR: A Translation Retrieval System. In: Proceedings of the 24th International Conference on Computational Linguistics, pp. 321–328
Google Scholar
Adriani, M.: Using statistical term similarity for sense disambiguation in cross-language information retrieval. Information Retrieval 2(1), 71–82 (2000)
Article Google Scholar
Maeda, A., Sadat, F., Yoshikawa, M., Uemura, S.: Query term disambiguation for web cross-language information retrieval using a search engine. In: Proceedings of the Fifth International Workshop on Information Retrieval with Asian Languages, pp. 25–32. ACM (2000)
Google Scholar
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31(4), 477–504 (2005)
Article Google Scholar
Munteanu, D.S., Marcu, D.: Extracting parallel sub-sentential fragments from nonparallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 81–88 (2006)
Google Scholar
Rauf, S., Schwenk, H.: Parallel sentence generation from comparable corpora for improved SMT. Machine Translation 25(4), 341–375 (2011)
Article Google Scholar
Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Proceedings of the Human Language Technologies/North American Association for Computational Linguistics, pp. 403–411 (2010)
Google Scholar
Tillmann, C.: A Beam-Search extraction algorithm for comparable data. In: Proceedings of ACL, pp. 225–228 (2009)
Google Scholar
Ture, F., Lin, J.: Why not grab a free lunch? Mining large corpora for parallel sentences to improve translation modeling. In: HLT-NAACL, pp. 626–630 (2012)
Google Scholar
Xiang, L., Zhou, Y., Zong, C.: An Efficient Framework to Extract Parallel Units from Comparable Data. In: Natural Language Processing and Chinese Computing, pp. 151–163 (2013)
Google Scholar
Ştefănescu, D., Ion, R., Hunsicker, S.: Hybrid parallel sentence mining from comparable corpora. In: Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT 2012), Trento, Italy (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Lu Xiang & Yu Zhou
Toshiba (China) R&D Center, China
Jie Hao & Dakun Zhang

Authors

Lu Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jie Hao
View author publications
You can also search for this author in PubMed Google Scholar
Dakun Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Haidian District, 100084, Beijing, China
Maosong Sun & Yang Liu &
Chinese Academy of Sciences, Institute of Automation, 100190, Beijing, China
Jun Zhao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiang, L., Zhou, Y., Hao, J., Zhang, D. (2014). Query Expansion for Mining Translation Knowledge from Comparable Data. In: Sun, M., Liu, Y., Zhao, J. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2014 2014. Lecture Notes in Computer Science(), vol 8801. Springer, Cham. https://doi.org/10.1007/978-3-319-12277-9_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-12277-9_18
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12276-2
Online ISBN: 978-3-319-12277-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics