Improving the Automatic Retrieval of Text Documents

Agosti, Maristella; Bacchin, Michela; Ferro, Nicola; Melucci, Massimo

doi:10.1007/978-3-540-45237-9_23

Maristella Agosti⁵,
Michela Bacchin⁵,
Nicola Ferro⁵ &
…
Massimo Melucci⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2785))

Included in the following conference series:

Workshop of the Cross-Language Evaluation Forum for European Languages

Abstract

This paper reports on a statistical stemming algorithm based on link analysis. Considering that a word is formed by a prefix (stem) and a suffix, the key idea is that the interlinked prefixes and suffixes form a community of sub-strings. Thus, discovering these communities means searching for the best word splits that give the best word stems. The algorithm has been used in our participation in the CLEF 2002 Italian monolingual task. The experimental results show that stemming improves text retrieval effectiveness. They also show that the effectiveness level of our algorithm is comparable to that of an algorithm based on a-priori linguistic knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A New Stemming Algorithm for Efficient Information Retrieval Systems and Web Search Engines

An Efficient Corpus-Based Stemmer

Article 07 June 2017

Statistical Stemmers: A Reproducibility Study

References

M. Agosti, M. Bacchin, and M. Melucci. Report on the Construction of an Italian Test Collection. Position paper at the Workshop on Multi-lingual Information Retrieval at the ACM International Conference on Research and Development in Information Retrieval (SIGIR), Berkeley, CA, USA, 1999. 280
Google Scholar
A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the World Wide Web. In Proceedings of the World Wide Web Conference, pages 415-429, Hong Kong, 2001. ACM Press. 285
Google Scholar
C. Cleverdon. The Cranfield Tests on Index Language Devices. In K. Sparck Jones and P. Willett (Eds.). Readings in Information Retrieval, pages 47-59, Morgan Kaufmann, 1997.
Google Scholar
W. B. Frakes and R. Baeza-Yates. Information Retrieval: data structures and algorithms. Prentice Hall, 1992. 282
Google Scholar
J. Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):154–198, 2001. 283
Article MathSciNet Google Scholar
M. Hafer and S. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371–385, 1994. 283
Article Google Scholar
D. Harman. How effective is suffixing? Journal of the American Society for Information Science, 42(1):7–15, 1991. 282, 286
Article Google Scholar
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, September 1999. 283, 285
Article MATH MathSciNet Google Scholar
R. Krovetz. Viewing Morphology as an Inference Process,. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR), 1993. 282
Google Scholar
J. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22–31, 1968. 282
Google Scholar
The Jakarta Project. Lucene. http://jakarta.apache.org/lucene/docs/index.html, 2002. 286
C. D. Manning and H. Schütze. Foundations of statistical natural language processing. The MIT Press, 1999. 283
Google Scholar
C.D. Paice. Another Stemmer. In A CM SIGIR Forum, 24, 56–61, 1990. 282
Article Google Scholar
M. Popovic and P. Willett. The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 43(5):383–390, 1992. 282
Google Scholar
M. Porter. Snowball: A language for stemming algorithms. http://snowball.sourceforge.net, 2001. 287
M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. 282
Article Google Scholar
G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY, 1983. 286
MATH Google Scholar
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523, 1988. 286
Article Google Scholar
Institut interfacultaire d’informatique. CLEF and Multilingual information retrieval. University of Neuchatel. http://www.unine.ch/info/clef/, 2002. 286
C. Buckley. Treceval. ftp://ftp.cs.cornell.edu/pub/smart/, 2002.
E. M. Voorhees. Special Issue on the Sixth Text Retrieval Conference (TREC-6). Information Processing and Management. Volume 36, Number 1, 2000. 281
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Engineering, University of Padua, Via Gradenigo, 6/a, 35031, Padova, Italy
Maristella Agosti, Michela Bacchin, Nicola Ferro & Massimo Melucci

Authors

Maristella Agosti
View author publications
You can also search for this author in PubMed Google Scholar
Michela Bacchin
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Ferro
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Melucci
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche (ISTI-CNR), Via G. Moruzzi 1, 56124, Pisa, Italy
Carol Peters
Eurospider Information Technology AG, Schaffhauserstr. 18, 8006, Zürich, Switzerland
Martin Braschler
Universidad Nacional de Educación a Distancia Lenguajes y Sístemas Informáticos, Ciudad Universitaria, 28040, Madrid, Spain
Julio Gonzalo
Informationszentrum Sozialwissenschaften, Arbeitsgemeinschaft Sozialwissenschaftlicher Institute e.V. (IZ), Lennéstr. 30, 53113, Bonn, Germany
Michael Kluck

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agosti, M., Bacchin, M., Ferro, N., Melucci, M. (2003). Improving the Automatic Retrieval of Text Documents. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds) Advances in Cross-Language Information Retrieval. CLEF 2002. Lecture Notes in Computer Science, vol 2785. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45237-9_23

Download citation

DOI: https://doi.org/10.1007/978-3-540-45237-9_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40830-7
Online ISBN: 978-3-540-45237-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics