Skip to main content

N-Grams for Translation and Retrieval in CL-SDR

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3237))

Abstract

We report on a first attempt to perform cross-language spoken document retrieval. Without prior monolingual speech retrieval experience we applied the same general approach we use for bilingual retrieval that is typified by the use of overlapping character n-grams for tokenization and a statistical language model of retrieval. An innovative approach was adopted for coping with out-of-vocabulary words and misspelled or mistranscribed words: direct translation of individual n-grams was the sole mechanism to translate source language queries into target language terms. Though this approach shows promise, especially for non-speech retrieval, our performance appears to lag that of other teams participating in this novel evaluation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. McNamee, P., Mayfield, J.: JHU/APL Experiments in Tokenization and Non-Word Translation. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 85–97. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  2. Ng, C., Wilkinson, R., Zobel, J.: Experiments in Spoken Document Retrieval Using Phoneme N-grams. Speech Communication 32, 1–2, 61–77 (2000)

    Google Scholar 

  3. Ng, K.: Subword-based Approaches for Spoken Document Retrieval. Ph.D. Thesis. MIT (2000)

    Google Scholar 

  4. McNamee, P., Mayfield, J.: Scalable Multilingual Information Access. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785, pp. 207–218. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  5. McNamee, P., Mayfield, J.: Character N-gram Tokenization for European Language Text Retrieval. Information Retrieval (to appear)

    Google Scholar 

  6. Hiemstra, D.: Using Language Models for Information Retrieval. Ph. D. Thesis. Center for Telematics and Information Technology, The Netherlands (2000)

    Google Scholar 

  7. Miller, D., Leek, T., Schwartz, R.: A hidden Markov model information retrieval system. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214–221 (1999)

    Google Scholar 

  8. Ponte, J., Croft, B.: A Language Modeling Approach to Information Retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281 (1998)

    Google Scholar 

  9. Pirkola, A., Hedlund, T., Keskusalo, H., Järvelin, K.: Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings. Information Retrieval 4, 209–230 (2001)

    Article  MATH  Google Scholar 

  10. Porter, M.: Snowball: A Language for Stemming Algorithms, Available online at: http://snowball.tartarus.org/texts/introduction.html (visited, March 13, 2003)

  11. http://europa.eu.int/

  12. McNamee, P., Mayfield, J.: Comparing Cross-Language Query Expansion Techniques by Degrading Translation Resources. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval, pp. 159–166 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

McNamee, P., Mayfield, J. (2004). N-Grams for Translation and Retrieval in CL-SDR. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds) Comparative Evaluation of Multilingual Information Access Systems. CLEF 2003. Lecture Notes in Computer Science, vol 3237. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30222-3_63

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30222-3_63

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24017-4

  • Online ISBN: 978-3-540-30222-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics