Abstract
This paper reports on a statistical stemming algorithm based on link analysis. Considering that a word is formed by a prefix (stem) and a suffix, the key idea is that the interlinked prefixes and suffixes form a community of sub-strings. Thus, discovering these communities means searching for the best word splits that give the best word stems. The algorithm has been used in our participation in the CLEF 2002 Italian monolingual task. The experimental results show that stemming improves text retrieval effectiveness. They also show that the effectiveness level of our algorithm is comparable to that of an algorithm based on a-priori linguistic knowledge.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
M. Agosti, M. Bacchin, and M. Melucci. Report on the Construction of an Italian Test Collection. Position paper at the Workshop on Multi-lingual Information Retrieval at the ACM International Conference on Research and Development in Information Retrieval (SIGIR), Berkeley, CA, USA, 1999. 280
A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the World Wide Web. In Proceedings of the World Wide Web Conference, pages 415-429, Hong Kong, 2001. ACM Press. 285
C. Cleverdon. The Cranfield Tests on Index Language Devices. In K. Sparck Jones and P. Willett (Eds.). Readings in Information Retrieval, pages 47-59, Morgan Kaufmann, 1997.
W. B. Frakes and R. Baeza-Yates. Information Retrieval: data structures and algorithms. Prentice Hall, 1992. 282
J. Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):154–198, 2001. 283
M. Hafer and S. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371–385, 1994. 283
D. Harman. How effective is suffixing? Journal of the American Society for Information Science, 42(1):7–15, 1991. 282, 286
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, September 1999. 283, 285
R. Krovetz. Viewing Morphology as an Inference Process,. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR), 1993. 282
J. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22–31, 1968. 282
The Jakarta Project. Lucene. http://jakarta.apache.org/lucene/docs/index.html, 2002. 286
C. D. Manning and H. Schütze. Foundations of statistical natural language processing. The MIT Press, 1999. 283
C.D. Paice. Another Stemmer. In A CM SIGIR Forum, 24, 56–61, 1990. 282
M. Popovic and P. Willett. The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 43(5):383–390, 1992. 282
M. Porter. Snowball: A language for stemming algorithms. http://snowball.sourceforge.net, 2001. 287
M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. 282
G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY, 1983. 286
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523, 1988. 286
Institut interfacultaire d’informatique. CLEF and Multilingual information retrieval. University of Neuchatel. http://www.unine.ch/info/clef/, 2002. 286
C. Buckley. Treceval. ftp://ftp.cs.cornell.edu/pub/smart/, 2002.
E. M. Voorhees. Special Issue on the Sixth Text Retrieval Conference (TREC-6). Information Processing and Management. Volume 36, Number 1, 2000. 281
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Agosti, M., Bacchin, M., Ferro, N., Melucci, M. (2003). Improving the Automatic Retrieval of Text Documents. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds) Advances in Cross-Language Information Retrieval. CLEF 2002. Lecture Notes in Computer Science, vol 2785. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45237-9_23
Download citation
DOI: https://doi.org/10.1007/978-3-540-45237-9_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40830-7
Online ISBN: 978-3-540-45237-9
eBook Packages: Springer Book Archive