Mining Parallel Documents across Web Sites

Khanh, Pham Ngoc; Bao, Ho Tu

doi:10.1007/978-3-642-17187-1_52

Pham Ngoc Khanh²⁰ &
Ho Tu Bao²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6458))

Included in the following conference series:

Asia Information Retrieval Symposium

1395 Accesses

Abstract

Most methods on building parallel corpora often start from large scale bilingual websites that are not always an available resource for many language pairs. In this paper we present a novel method to mine parallel documents between English and other non-popular languages which are situated on different locations on the Internet. Our method is motivated by the observation that many non-popular language news are translated from popular English news websites. Given a news in a non-popular language, a method is proposed to search for its original English version located on another website using search engines. Experiments with English-Vietnamese show that our method can provide bilingual document pairs in science domain with precision around 90%. Our method is more flexible and scalable than traditional approaches that collect parallel texts from multilingual websites as its starting point is only a set of monolingual news. Furthermore, this method can be applied to mine parallel documents between non-popular languages pairs with scarce resources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Collecting Comparable Corpora

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Automatic Parallel Data Mining After Bilingual Document Alignment

References

Brown, P.F., Cocke, J., Pietra, S.A.D., Pietra, V.J.D., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Computational Linguistics 16(2), 79–85 (1990)
Google Scholar
Gey, F.C., Kando, N., Peters, C.: Cross-language information retrieval: the way ahead. Inf. Process. Manage 41(3), 415–431 (2005)
Article Google Scholar
Kumano, A. and Hirakawa, H.: Building an MT dictionary from parallel texts based on linguistic and statistical information. In: Proceedings of the 15th conference on Computational Linguistics, pp. 76–81 (1994).
Google Scholar
Philip, R., Smith, N.A.: The web as a parallel corpus. Computational Linguistics 29(3), 349–380 (2003)
Article Google Scholar
Ma, X., Liberman, D.Y.: BITS: A method for bilingual text search over the web. In: Proceedings of Machine Translation Summit VII, pp. 538–542 (1999)
Google Scholar
Chen, J., Nie, J.Y.: Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 21–28 (2000)
Google Scholar
Colleier, N., Hirakawa, H., Kumano, A.: Creating a noisy parallel corpus from newswire articles using cross-language information retrieval. Transactions of Information Procession Society of Japan 40(1), 351–361 (1999)
Google Scholar
Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31(4), 477–504 (2005)
Article Google Scholar
Ruopp, A., Xia, F.: Finding parallel texts on the web using cross-language information retrieval. In: Proceedings of the Second International Workshop On Cross Lingual Information Access Addressing the Information Need of Multilingual Societies, pp. 18–25 (2008)
Google Scholar
Jorg, T., Nygaard, L.: The OPUS corpus - parallel and free. In: Proceedings of LREC 2004, pp. 1183–1186 (2004)
Google Scholar
Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of MT Summit, pp. 79–86 (2005)
Google Scholar
Ralf, S., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, pp. 2142–2147 (2006)
Google Scholar
Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pp. 121–124 (2009)
Google Scholar
István, V., Shoichi, Y.: Bilingual dictionary generation for low-resourced language pairs. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 862–870 (2009)
Google Scholar
Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13, 157–170 (2004)
Article Google Scholar
Li, J., Ezeife, C.I.: Cleaning web pages for effective web content mining. In: Bressan, S., Küng, J., Wagner, R. (eds.) DEXA 2006. LNCS, vol. 4080, pp. 560–571. Springer, Heidelberg (2006)
Chapter Google Scholar
Zhang, Y., Wu, K., Gao, J., Vines, P.: Automatic acquisition of Chinese-English parallel corpus from the web. In: Proceedings of the 28th European Conference on Information Retrieval, pp. 420–431 (2006)
Google Scholar
Utiyama, M., Isahara, H.: A Japanese-English patent parallel corpus. In: Proceedings of Machine Translation Summit XI, pp. 475–482 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Japan Advanced Institute of Science and Technology, 923-1292, Japan, Nomi, Ishikawa, Asahidai 1-1
Pham Ngoc Khanh & Ho Tu Bao

Authors

Pham Ngoc Khanh
View author publications
You can also search for this author in PubMed Google Scholar
Ho Tu Bao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Information Engineering, Roosevelt Road National Taiwan University, No. 1, Sec. 4, 10617, Taipei, Taiwan R.O.C.
Pu-Jen Cheng
School of Computing, National University of Singapore (NUS), Computing 1, 13 Computing Drive, 117417, Singapore
Min-Yen Kan
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong Shatin, N.T. Hong Kong, China
Wai Lam
School of Computing, Computing 1, National University of Singapore (NUS), 13 Computing Drive, 117417, Singapore
Preslav Nakov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Khanh, P.N., Bao, H.T. (2010). Mining Parallel Documents across Web Sites. In: Cheng, PJ., Kan, MY., Lam, W., Nakov, P. (eds) Information Retrieval Technology. AIRS 2010. Lecture Notes in Computer Science, vol 6458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17187-1_52

Download citation

DOI: https://doi.org/10.1007/978-3-642-17187-1_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17186-4
Online ISBN: 978-3-642-17187-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics