Skip to main content

Mining Parallel Documents across Web Sites

  • Conference paper
  • 1352 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6458))

Abstract

Most methods on building parallel corpora often start from large scale bilingual websites that are not always an available resource for many language pairs. In this paper we present a novel method to mine parallel documents between English and other non-popular languages which are situated on different locations on the Internet. Our method is motivated by the observation that many non-popular language news are translated from popular English news websites. Given a news in a non-popular language, a method is proposed to search for its original English version located on another website using search engines. Experiments with English-Vietnamese show that our method can provide bilingual document pairs in science domain with precision around 90%. Our method is more flexible and scalable than traditional approaches that collect parallel texts from multilingual websites as its starting point is only a set of monolingual news. Furthermore, this method can be applied to mine parallel documents between non-popular languages pairs with scarce resources.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brown, P.F., Cocke, J., Pietra, S.A.D., Pietra, V.J.D., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Computational Linguistics 16(2), 79–85 (1990)

    Google Scholar 

  2. Gey, F.C., Kando, N., Peters, C.: Cross-language information retrieval: the way ahead. Inf. Process. Manage 41(3), 415–431 (2005)

    Article  Google Scholar 

  3. Kumano, A. and Hirakawa, H.: Building an MT dictionary from parallel texts based on linguistic and statistical information. In: Proceedings of the 15th conference on Computational Linguistics, pp. 76–81 (1994).

    Google Scholar 

  4. Philip, R., Smith, N.A.: The web as a parallel corpus. Computational Linguistics 29(3), 349–380 (2003)

    Article  Google Scholar 

  5. Ma, X., Liberman, D.Y.: BITS: A method for bilingual text search over the web. In: Proceedings of Machine Translation Summit VII, pp. 538–542 (1999)

    Google Scholar 

  6. Chen, J., Nie, J.Y.: Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 21–28 (2000)

    Google Scholar 

  7. Colleier, N., Hirakawa, H., Kumano, A.: Creating a noisy parallel corpus from newswire articles using cross-language information retrieval. Transactions of Information Procession Society of Japan 40(1), 351–361 (1999)

    Google Scholar 

  8. Munteanu, D.S., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31(4), 477–504 (2005)

    Article  Google Scholar 

  9. Ruopp, A., Xia, F.: Finding parallel texts on the web using cross-language information retrieval. In: Proceedings of the Second International Workshop On Cross Lingual Information Access Addressing the Information Need of Multilingual Societies, pp. 18–25 (2008)

    Google Scholar 

  10. Jorg, T., Nygaard, L.: The OPUS corpus - parallel and free. In: Proceedings of LREC 2004, pp. 1183–1186 (2004)

    Google Scholar 

  11. Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of MT Summit, pp. 79–86 (2005)

    Google Scholar 

  12. Ralf, S., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, pp. 2142–2147 (2006)

    Google Scholar 

  13. Yu, K., Tsujii, J.: Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pp. 121–124 (2009)

    Google Scholar 

  14. István, V., Shoichi, Y.: Bilingual dictionary generation for low-resourced language pairs. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 862–870 (2009)

    Google Scholar 

  15. Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13, 157–170 (2004)

    Article  Google Scholar 

  16. Li, J., Ezeife, C.I.: Cleaning web pages for effective web content mining. In: Bressan, S., Küng, J., Wagner, R. (eds.) DEXA 2006. LNCS, vol. 4080, pp. 560–571. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  17. Zhang, Y., Wu, K., Gao, J., Vines, P.: Automatic acquisition of Chinese-English parallel corpus from the web. In: Proceedings of the 28th European Conference on Information Retrieval, pp. 420–431 (2006)

    Google Scholar 

  18. Utiyama, M., Isahara, H.: A Japanese-English patent parallel corpus. In: Proceedings of Machine Translation Summit XI, pp. 475–482 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Khanh, P.N., Bao, H.T. (2010). Mining Parallel Documents across Web Sites. In: Cheng, PJ., Kan, MY., Lam, W., Nakov, P. (eds) Information Retrieval Technology. AIRS 2010. Lecture Notes in Computer Science, vol 6458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17187-1_52

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-17187-1_52

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-17186-4

  • Online ISBN: 978-3-642-17187-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics