Mining Parallel Documents Using Low Bandwidth and High Precision CLIR from the Heterogeneous Web

Shi, Simon; Fung, Pascale

doi:10.1007/978-3-642-20128-8_2

Simon Shi⁵ &
Pascale Fung⁵

1200 Accesses

Abstract

We propose a content-based method of mining bilingual parallel documents from websites that are not necessarily structurally related to each other. There are two existing approaches for automatically mining parallel documents from the web. Structure based methods work only for parallel websites and most of content based methods are either requires large scale computational facilities, network bandwidth or not applicable to heterogeneous web. We propose a novel content based method using cross lingual information retrieval (CLIR) with query feedback and verification and supplemented with structural information, to mine parallel resources from the entire web using search engine APIs. The method goes beyond structural information to find parallel documents from non-parallel websites. We obtained a very high mining precision and extracted parallel sentences improved SMT performance, with higher BLEU score, is comparable to that obtained with high quality manually translated parallel sentences illustrating the excellent quality of the mined parallel materiel

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Unsupervised Construction of Quasi-comparable Corpora and Probing for Parallel Textual Data

Automatic Parallel Data Mining After Bilingual Document Alignment

Notes

1.
We knew the web was big $\ldots $ on the Official Google Blog. http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.
2.
Source: http://cn.reuters.com/article/CNTechNews/idCNCHINA-3233720101027 on May 10, 2011.
3.
http://www.elias.cn/En/ExtMainText/
4.
LDC Catalog Number: LDC2002L27.

References

Munteanu, D., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31:477–504 (2005)
Google Scholar
Nie, J.-Y., Simard, M., Isabelle, P., Durand, R.: Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in, Information Retrieval, pp. 74–81 (1999)
Google Scholar
Grefenstette, G.: Cross-Language Information Retrieval. Kluwer Academic, New York (1998)
Google Scholar
Munteanu, D., Marcu, D.: Extracting parallel sub-sentential fragments from nonparallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Morristown, NJ, USA, pp. 81–88 (2006)
Google Scholar
Uszkoreit, J., Ponte, J., Popat, A., Dubiner, M.: Large scale parallel document mining for machine translation. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, pp. 1101–1109 (2010)
Google Scholar
Resnik, P., Smith, N.: The web as a parallel corpus. Comput. Linguist. 29:349–380 (2003)
Google Scholar
Shi, L., Niu, C., Zhou, M., Gao, J.: A dom tree alignment model for mining parallel data from the web. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Morristown, NJ, USA, pp. 489–496 (2006)
Google Scholar
Ma, X.: Champollion: a robust parallel text sentence aligner. In: Proceedings of the Fifth International Conference On Language Resources and Evaluation (LREC 2006), ELRA. Genoa, Italy (2006)
Google Scholar
Chen, J., Nie, J.-Y.: Parallel web text mining for cross-language information retrieval. In: Recherche d’Informations Assistée par Ordinateur (RIAO), pp. 62–77 (2000)
Google Scholar
Jiang, X., Hu, Y., Li, H.: A ranking approach to keyphrase extraction. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’09, New York, NY, USA (2009)
Google Scholar
Hong, G., Li, C.-H., Zhou, M., Rim, H.-C.: An empirical study on web mining of parallel data. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING2010), Beijing, China, pp. 474–482 (2010)
Google Scholar
Cheung, C., Fung, P.: Unsupervised learning of a spontaneous and colloquial speech lexicon in Chinese. Int. J. Speech Technol. 7, 173–178 (2004)
Article Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual meeting of the Association for Computational Linguistics (ACL 2002), pp. 311–318 (2002)
Google Scholar
Prochasson, E., Fung, P.: Rare word translation extraction from aligned comparable documents. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, USA (2011)
Google Scholar
Carpuat, M., Fung, P., Ngai, G.: Aligning word senses using bilingual corpora. ACM Trans. Asian Lang. Inform. Process. 5(2):89–120 (2006)
Google Scholar
Abdul-Rauf, S., Schwenk, H.: On the use of comparable corpora to improve SMT performance. In: Proceedings of the 12th Conference of the European Chapter of the Association for, Computational Linguistics (EACL’06), pp. 16–23 (2006)
Google Scholar
Akamine, S., Kato, Y., Kawahara, D., Shinzato, K., Inui, K., Kurohashi, S., Kidawara, Y.: Development of a large-scale web crawler and search engine infrastructure. In: Proceedings of the 3rd international Universal Communication, Symposium (IUCS’09), pp. 126–131 (2009)
Google Scholar
Fung, P., Prochasson, E., Shi, S.: Trillions of comparable documents. In: Proceeding of the 3rd Workshop on Building and Using Comparable Corpora (BUCC’10), Language Resource and Evaluation Conference (LREC2010), Malta, pp. 26–34 (2010)
Google Scholar
Gleim, R., Mehler, A., Dehmer, M.: Web corpus mining by instance of wikipedia. In: Proceedings of the 2nd International Workshop on Web as Corpus (WAC’06), Morristown, NJ, USA, pp. 67–74 (2006)
Google Scholar

Download references

Acknowledgments

This project is partially funded by a subcontract from BBN, under the DARPA GALE project.

Author information

Authors and Affiliations

Human Language Technology Center, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
Simon Shi & Pascale Fung

Authors

Simon Shi
View author publications
You can also search for this author in PubMed Google Scholar
Pascale Fung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simon Shi .

Editor information

Editors and Affiliations

Centre for Translation Studies, University of Leeds, Leeds, United Kingdom
Serge Sharoff
University of Mainz, Mainz, Germany
Reinhard Rapp
Université de Paris-Sud LIMSI-CNRS, Orsay, France
Pierre Zweigenbaum
Electronic & Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China
Pascale Fung

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Shi, S., Fung, P. (2013). Mining Parallel Documents Using Low Bandwidth and High Precision CLIR from the Heterogeneous Web. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-20128-8_2
Published: 14 December 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics