Skip to main content

Effects of Aligned Corpus Quality and Size in Corpus-Based CLIR

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4956))

Abstract

Aligned corpora are often-used resources in CLIR systems. The three qualities of translation corpora that most dramatically affect the performance of a corpus-based CLIR system are: (1) topical nearness to the translated queries, (2) the quality of the alignments, and (3) the size of the corpus. In this paper, the effects of these factors are studied and evaluated. Topics of two different domains (news and genomics) are translated with corpora of varying alignment quality, ranging from a clean parallel corpus to noisier comparable corpora. Also, the sizes of the corpora are varied. The results show that of the three qualities, topical nearness is the most crucial factor, outweighing both other factors. This indicates that noisy comparable corpora should be used as complimentary resources, when parallel corpora are not available for the domain in question.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Oard, D.W., Diekema, A.R.: Cross-Language Information Retrieval. Annual Review of Information Science and Technology (ARIST) 33, 223–256 (1998)

    Google Scholar 

  2. Pirkola, A., Hedlund, T., Keskustalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Inf. Retr. 4, 209–230 (2001)

    Article  MATH  Google Scholar 

  3. Sheridan, P., Ballerini, J.P.: Experiments in Multilingual Information Retrieval Using the SPIDER System. In: SIGIR 1996: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 58–65. ACM Press, New York (1996)

    Chapter  Google Scholar 

  4. Franz, M., McCarley, J.S., Ward, T., Zhu, W.J.: Quantifying the Utility of Parallel Corpora. In: SIGIR 2001: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 398–399. ACM Press, New York (2001)

    Chapter  Google Scholar 

  5. Zhu, J., Wang, H.: The Effect of Translation Quality in MT-Based Cross-Language Information Retrieval. In: ACL 2006: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the ACL, pp. 593–600. Association for Computational Linguistics, Morristown, NJ (2006)

    Chapter  Google Scholar 

  6. Xu, J., Weischedel, R.: Empirical Studies on the Impact of Lexical Resources on CLIR Performance. Inf. Process. Manage. 41, 475–487 (2005)

    Article  MATH  Google Scholar 

  7. Peters, C.: What Happened in CLEF 2006. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 1–10. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  8. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., Varga, D.: The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages. In: LREC 2006:Proceedings of the 5th International Conference on Language Resources and Evaluation, pp. 2142–2147. European Language Resources Association, Paris (2006)

    Google Scholar 

  9. Talvensaari, T., Laurikkala, J., Järvelin, K., Juhola, M., Keskustalo, H.: Creating and Exploiting a Comparable Corpus in Cross-language Information Retrieval. ACM Trans. Inf. Syst. 25, 4 (2007)

    Article  Google Scholar 

  10. Allan, J., Callan, J.P., Croft, W.B., Ballesteros, L., Broglio, J., Xu, J., Shu, H.: Inquery at TREC-5. In: TREC-5: The Fifth Text Retrieval Conference, pp. 119–132. National Institute of Standards and Technology (1996)

    Google Scholar 

  11. Talvensaari, T., Järvelin, K., Pirkola, A., Juhola, M., Laurikkala, J.: Focused Web Crawling in Acquisition of Comparable Corpora. Information Retrieval (submitted, 2007)

    Google Scholar 

  12. Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: a New Approach to Topic-specific Web Resource Discovery. In: WWW 1999: Proceeding of the Eighth International Conference on World Wide Web, pp. 1623–1640. Elsevier North-Holland, Inc (1999)

    Google Scholar 

  13. Singhal, A., Buckley, C., Mitra, M.: Pivoted Document Length Normalization. In: SIGIR 1996: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21–29. ACM Press, New York (1996)

    Chapter  Google Scholar 

  14. Hersh, W.R.: Report on the TREC 2004 Genomics Track. SIGIR Forum 39, 21–24 (2005)

    Article  Google Scholar 

  15. Keskustalo, H., Hedlund, T., Airio, E.: Utaclir: General Query Translation Framework for Several Language Pairs. In: SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 448–448. ACM Press, New York (2002)

    Chapter  Google Scholar 

  16. McNamee, P., Mayfield, J.: Comparing Cross-language Query Expansion Techniques by Degrading Translation Resources. In: SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 159–166. ACM Press, New York (2002)

    Chapter  Google Scholar 

  17. Pirkola, A.: The Effects of Query Structure and Dictionary Setups in Dictionary-based Cross-language Information Retrieval. In: SIGIR 1998: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 55–63. ACM Press, New York (1998)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Craig Macdonald Iadh Ounis Vassilis Plachouras Ian Ruthven Ryen W. White

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Talvensaari, T. (2008). Effects of Aligned Corpus Quality and Size in Corpus-Based CLIR. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds) Advances in Information Retrieval. ECIR 2008. Lecture Notes in Computer Science, vol 4956. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78646-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78646-7_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78645-0

  • Online ISBN: 978-3-540-78646-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics