Skip to main content

A Dictionary- and Corpus-Independent Statistical Lemmatizer for Information Retrieval in Low Resource Languages

  • Conference paper
Multilingual and Multimodal Information Access Evaluation (CLEF 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6360))

Abstract

We present a dictionary- and corpus-independent statistical lemmatizer StaLe that deals with the out-of-vocabulary (OOV) problem of dictionary-based lemmatization by generating candidate lemmas for any inflected word forms. StaLe can be applied with little effort to languages lacking linguistic resources. We show the performance of StaLe both in lemmatization tasks alone and as a component in an IR system using several datasets and query types in four high resource languages. StaLe is competitive, reaching 88-108 % of gold standard performance of a commercial lemmatizer in IR experiments. Despite competitive performance, it is compact, efficient and fast to apply to new languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Airio, E.: Word normalization and decompounding in mono- and bilingual IR. Information Retrieval 9(3), 249–271 (2006)

    Article  Google Scholar 

  2. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press/Addison-Wesley (1999)

    Google Scholar 

  3. Bendersky, M., Croft, W.B.: Analysis of Long Queries in a Large Scale Search Log. In: Proceedings of the 2009 workshop on Web Search Click Data, Barcelona, Spain, pp. 8–14 (2009)

    Google Scholar 

  4. Frakes, W.B., Baeza-Yates, R.: Information Retrieval: Data Structures & Algorithms. Prentice Hall, Englewood Cliffs (1992)

    Google Scholar 

  5. Kettunen, K., Airio, E.: Is a morphologically complex language really that complex in full-text retrieval? In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 411–422. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  6. Kettunen, K.: Reductive and Generative Approaches to Morphological Variation of Keywords in Monolingual Information Retrieval. Acta Universitatis Tamperensis 1261. University of Tampere, Tampere (2007)

    Google Scholar 

  7. Koskenniemi, K.: Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. Ph.D. Thesis, University of Helsinki, Department of General Linguistics, Helsinki (1983)

    Google Scholar 

  8. Krovetz, R.: Viewing morphology as an inference process. In: Proceedings of the 16th ACM/SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, Pennsylvania, USA, pp. 191–202 (1993)

    Google Scholar 

  9. The Lemur Tool-kit for Language Modelling and Information Retrieval, http://www.lemurproject.org/ (visited 30.3.2010)

  10. Lindén, K.: A Probabilistic Model for Guessing Base Forms of New Words by Analogy. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 106–116. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  11. Losee, R.M.: Is 1 Noun Worth 2 Adjectives? Measuring Relative Feature Utility. Information Processing and Management 42(5), 1248–1259 (2006)

    Article  Google Scholar 

  12. Lovins, J.B.: Development of a Stemming Algorithm. Mechanical Translation and Computation Linguistics 11(1), 23–31 (1968)

    Google Scholar 

  13. Majumder, P., Mitra, M., Parui, S.K., Kole, G., Mitra, P., Datta, K.: YASS: Yet Another Suffix Stripper. ACM Transactions on Information Systems (TOIS) 25(4) (2007)

    Google Scholar 

  14. Parole, Språkbanken, most common PAROLE words, http://spraakbanken.gu.se/eng/ (visited 30.3.2010)

  15. Pirkola, A.: Morphological Typology of Languages for IR. Journal of Documentation 57(3), 330–348 (2001)

    Article  Google Scholar 

  16. Pirkola, A., Toivonen, J., Keskustalo, H., Visala, K., Järvelin, K.: Fuzzy translation of cross-lingual spelling variants. In: Proceedings of the 26th ACM SIGIR Conference, Toronto, Canada, pp. 345–353 (2003)

    Google Scholar 

  17. Plisson, J., Lavrac, N., Mladenic, D.: A rule based approach to word lemmatization. In: Proceedings of the 7th International Multi-Conference Information Society IS 2004, pp. 83–86 (2004)

    Google Scholar 

  18. Snowball, http://snowball.tartarus.org/ (visited 30.3.2010)

  19. Wicentowski, R.: Modelling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. Ph.D. Thesis, Baltimore, Maryland, USA (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Loponen, A., Järvelin, K. (2010). A Dictionary- and Corpus-Independent Statistical Lemmatizer for Information Retrieval in Low Resource Languages. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds) Multilingual and Multimodal Information Access Evaluation. CLEF 2010. Lecture Notes in Computer Science, vol 6360. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15998-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15998-5_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15997-8

  • Online ISBN: 978-3-642-15998-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics