Abstract
Most methods on building parallel corpora often start from large scale bilingual websites that are not always an available resource for many language pairs. In this paper we present a novel method to mine parallel documents between English and other non-popular languages which are situated on different locations on the Internet. Our method is motivated by the observation that many non-popular language news are translated from popular English news websites. Given a news in a non-popular language, a method is proposed to search for its original English version located on another website using search engines. Experiments with English-Vietnamese show that our method can provide bilingual document pairs in science domain with precision around 90%. Our method is more flexible and scalable than traditional approaches that collect parallel texts from multilingual websites as its starting point is only a set of monolingual news. Furthermore, this method can be applied to mine parallel documents between non-popular languages pairs with scarce resources.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brown, P.F., Cocke, J., Pietra, S.A.D., Pietra, V.J.D., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Computational Linguistics 16(2), 79–85 (1990)
Gey, F.C., Kando, N., Peters, C.: Cross-language information retrieval: the way ahead. Inf. Process. Manage 41(3), 415–431 (2005)
Kumano, A. and Hirakawa, H.: Building an MT dictionary from parallel texts based on linguistic and statistical information. In: Proceedings of the 15th conference on Computational Linguistics, pp. 76–81 (1994).
Philip, R., Smith, N.A.: The web as a parallel corpus. Computational Linguistics 29(3), 349–380 (2003)
Ma, X., Liberman, D.Y.: BITS: A method for bilingual text search over the web. In: Proceedings of Machine Translation Summit VII, pp. 538–542 (1999)
Chen, J., Nie, J.Y.: Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 21–28 (2000)
Colleier, N., Hirakawa, H., Kumano, A.: Creating a noisy parallel corpus from newswire articles using cross-language information retrieval. Transactions of Information Procession Society of Japan 40(1), 351–361 (1999)
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31(4), 477–504 (2005)
Ruopp, A., Xia, F.: Finding parallel texts on the web using cross-language information retrieval. In: Proceedings of the Second International Workshop On Cross Lingual Information Access Addressing the Information Need of Multilingual Societies, pp. 18–25 (2008)
Jorg, T., Nygaard, L.: The OPUS corpus - parallel and free. In: Proceedings of LREC 2004, pp. 1183–1186 (2004)
Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of MT Summit, pp. 79–86 (2005)
Ralf, S., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, pp. 2142–2147 (2006)
Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pp. 121–124 (2009)
István, V., Shoichi, Y.: Bilingual dictionary generation for low-resourced language pairs. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 862–870 (2009)
Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13, 157–170 (2004)
Li, J., Ezeife, C.I.: Cleaning web pages for effective web content mining. In: Bressan, S., Küng, J., Wagner, R. (eds.) DEXA 2006. LNCS, vol. 4080, pp. 560–571. Springer, Heidelberg (2006)
Zhang, Y., Wu, K., Gao, J., Vines, P.: Automatic acquisition of Chinese-English parallel corpus from the web. In: Proceedings of the 28th European Conference on Information Retrieval, pp. 420–431 (2006)
Utiyama, M., Isahara, H.: A Japanese-English patent parallel corpus. In: Proceedings of Machine Translation Summit XI, pp. 475–482 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Khanh, P.N., Bao, H.T. (2010). Mining Parallel Documents across Web Sites. In: Cheng, PJ., Kan, MY., Lam, W., Nakov, P. (eds) Information Retrieval Technology. AIRS 2010. Lecture Notes in Computer Science, vol 6458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17187-1_52
Download citation
DOI: https://doi.org/10.1007/978-3-642-17187-1_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17186-4
Online ISBN: 978-3-642-17187-1
eBook Packages: Computer ScienceComputer Science (R0)