Abstract
We propose a content-based method of mining bilingual parallel documents from websites that are not necessarily structurally related to each other. There are two existing approaches for automatically mining parallel documents from the web. Structure based methods work only for parallel websites and most of content based methods are either requires large scale computational facilities, network bandwidth or not applicable to heterogeneous web. We propose a novel content based method using cross lingual information retrieval (CLIR) with query feedback and verification and supplemented with structural information, to mine parallel resources from the entire web using search engine APIs. The method goes beyond structural information to find parallel documents from non-parallel websites. We obtained a very high mining precision and extracted parallel sentences improved SMT performance, with higher BLEU score, is comparable to that obtained with high quality manually translated parallel sentences illustrating the excellent quality of the mined parallel materiel
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We knew the web was big \(\ldots \) on the Official Google Blog. http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.
- 2.
Source: http://cn.reuters.com/article/CNTechNews/idCNCHINA-3233720101027 on May 10, 2011.
- 3.
- 4.
LDC Catalog Number: LDC2002L27.
References
Munteanu, D., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31:477–504 (2005)
Nie, J.-Y., Simard, M., Isabelle, P., Durand, R.: Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in, Information Retrieval, pp. 74–81 (1999)
Grefenstette, G.: Cross-Language Information Retrieval. Kluwer Academic, New York (1998)
Munteanu, D., Marcu, D.: Extracting parallel sub-sentential fragments from nonparallel corpora. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Morristown, NJ, USA, pp. 81–88 (2006)
Uszkoreit, J., Ponte, J., Popat, A., Dubiner, M.: Large scale parallel document mining for machine translation. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China, pp. 1101–1109 (2010)
Resnik, P., Smith, N.: The web as a parallel corpus. Comput. Linguist. 29:349–380 (2003)
Shi, L., Niu, C., Zhou, M., Gao, J.: A dom tree alignment model for mining parallel data from the web. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Morristown, NJ, USA, pp. 489–496 (2006)
Ma, X.: Champollion: a robust parallel text sentence aligner. In: Proceedings of the Fifth International Conference On Language Resources and Evaluation (LREC 2006), ELRA. Genoa, Italy (2006)
Chen, J., Nie, J.-Y.: Parallel web text mining for cross-language information retrieval. In: Recherche d’Informations Assistée par Ordinateur (RIAO), pp. 62–77 (2000)
Jiang, X., Hu, Y., Li, H.: A ranking approach to keyphrase extraction. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’09, New York, NY, USA (2009)
Hong, G., Li, C.-H., Zhou, M., Rim, H.-C.: An empirical study on web mining of parallel data. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING2010), Beijing, China, pp. 474–482 (2010)
Cheung, C., Fung, P.: Unsupervised learning of a spontaneous and colloquial speech lexicon in Chinese. Int. J. Speech Technol. 7, 173–178 (2004)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual meeting of the Association for Computational Linguistics (ACL 2002), pp. 311–318 (2002)
Prochasson, E., Fung, P.: Rare word translation extraction from aligned comparable documents. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, USA (2011)
Carpuat, M., Fung, P., Ngai, G.: Aligning word senses using bilingual corpora. ACM Trans. Asian Lang. Inform. Process. 5(2):89–120 (2006)
Abdul-Rauf, S., Schwenk, H.: On the use of comparable corpora to improve SMT performance. In: Proceedings of the 12th Conference of the European Chapter of the Association for, Computational Linguistics (EACL’06), pp. 16–23 (2006)
Akamine, S., Kato, Y., Kawahara, D., Shinzato, K., Inui, K., Kurohashi, S., Kidawara, Y.: Development of a large-scale web crawler and search engine infrastructure. In: Proceedings of the 3rd international Universal Communication, Symposium (IUCS’09), pp. 126–131 (2009)
Fung, P., Prochasson, E., Shi, S.: Trillions of comparable documents. In: Proceeding of the 3rd Workshop on Building and Using Comparable Corpora (BUCC’10), Language Resource and Evaluation Conference (LREC2010), Malta, pp. 26–34 (2010)
Gleim, R., Mehler, A., Dehmer, M.: Web corpus mining by instance of wikipedia. In: Proceedings of the 2nd International Workshop on Web as Corpus (WAC’06), Morristown, NJ, USA, pp. 67–74 (2006)
Acknowledgments
This project is partially funded by a subcontract from BBN, under the DARPA GALE project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Shi, S., Fung, P. (2013). Mining Parallel Documents Using Low Bandwidth and High Precision CLIR from the Heterogeneous Web. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-20128-8_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)