Skip to main content

Improving the Automatic Retrieval of Text Documents

  • Conference paper
Book cover Advances in Cross-Language Information Retrieval (CLEF 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2785))

Included in the following conference series:

Abstract

This paper reports on a statistical stemming algorithm based on link analysis. Considering that a word is formed by a prefix (stem) and a suffix, the key idea is that the interlinked prefixes and suffixes form a community of sub-strings. Thus, discovering these communities means searching for the best word splits that give the best word stems. The algorithm has been used in our participation in the CLEF 2002 Italian monolingual task. The experimental results show that stemming improves text retrieval effectiveness. They also show that the effectiveness level of our algorithm is comparable to that of an algorithm based on a-priori linguistic knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. M. Agosti, M. Bacchin, and M. Melucci. Report on the Construction of an Italian Test Collection. Position paper at the Workshop on Multi-lingual Information Retrieval at the ACM International Conference on Research and Development in Information Retrieval (SIGIR), Berkeley, CA, USA, 1999. 280

    Google Scholar 

  2. A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the World Wide Web. In Proceedings of the World Wide Web Conference, pages 415-429, Hong Kong, 2001. ACM Press. 285

    Google Scholar 

  3. C. Cleverdon. The Cranfield Tests on Index Language Devices. In K. Sparck Jones and P. Willett (Eds.). Readings in Information Retrieval, pages 47-59, Morgan Kaufmann, 1997.

    Google Scholar 

  4. W. B. Frakes and R. Baeza-Yates. Information Retrieval: data structures and algorithms. Prentice Hall, 1992. 282

    Google Scholar 

  5. J. Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):154–198, 2001. 283

    Article  MathSciNet  Google Scholar 

  6. M. Hafer and S. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371–385, 1994. 283

    Article  Google Scholar 

  7. D. Harman. How effective is suffixing? Journal of the American Society for Information Science, 42(1):7–15, 1991. 282, 286

    Article  Google Scholar 

  8. J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, September 1999. 283, 285

    Article  MATH  MathSciNet  Google Scholar 

  9. R. Krovetz. Viewing Morphology as an Inference Process,. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR), 1993. 282

    Google Scholar 

  10. J. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22–31, 1968. 282

    Google Scholar 

  11. The Jakarta Project. Lucene. http://jakarta.apache.org/lucene/docs/index.html, 2002. 286

  12. C. D. Manning and H. Schütze. Foundations of statistical natural language processing. The MIT Press, 1999. 283

    Google Scholar 

  13. C.D. Paice. Another Stemmer. In A CM SIGIR Forum, 24, 56–61, 1990. 282

    Article  Google Scholar 

  14. M. Popovic and P. Willett. The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 43(5):383–390, 1992. 282

    Google Scholar 

  15. M. Porter. Snowball: A language for stemming algorithms. http://snowball.sourceforge.net, 2001. 287

  16. M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. 282

    Article  Google Scholar 

  17. G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY, 1983. 286

    MATH  Google Scholar 

  18. G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523, 1988. 286

    Article  Google Scholar 

  19. Institut interfacultaire d’informatique. CLEF and Multilingual information retrieval. University of Neuchatel. http://www.unine.ch/info/clef/, 2002. 286

  20. C. Buckley. Treceval. ftp://ftp.cs.cornell.edu/pub/smart/, 2002.

  21. E. M. Voorhees. Special Issue on the Sixth Text Retrieval Conference (TREC-6). Information Processing and Management. Volume 36, Number 1, 2000. 281

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Agosti, M., Bacchin, M., Ferro, N., Melucci, M. (2003). Improving the Automatic Retrieval of Text Documents. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds) Advances in Cross-Language Information Retrieval. CLEF 2002. Lecture Notes in Computer Science, vol 2785. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45237-9_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-45237-9_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40830-7

  • Online ISBN: 978-3-540-45237-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics