poster

Don't have a stemmer?: be un+concern+ed

Authors:

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Pages 813 - 814

https://doi.org/10.1145/1390334.1390518

Published: 20 July 2008 Publication History

Get Access

Abstract

The choice of indexing terms used to represent documents crucially determines how e ective subsequent retrieval will be. IR systems commonly use rule-based stemmers to normalize surface word forms to combat the problem of not finding documents that contain words related to query terms by inflectional or derivational morphology. But such stemmers are not available in all languages. In this paper we explore the effectiveness of unsupervised morphological segmentation as an alternative to stemming using test sets in thirteen European languages. We find that unsupervised segmentation is significantly better than unnormalized words, in several cases by more than 20%. However, rule-based stemming, if available, is better in low complexity languages. We also compare these methods to the use of character n-grams, finding that on average n-grams yield the best performance.

References

[1]

M. Creutz and K. Lagus. Unsupervised discovery of morphemes. In ACL-02 Workshop on Morphological and Phonological Learning, pages 21--30, 2002.

Digital Library

Google Scholar

[2]

G. M. Di Nunzio, N. Ferro, T. Mandl, and C. Peters. CLEF 2007: Ad hoc track overview. In CLEF 2007 Working Notes, 2007.

Google Scholar

[3]

D. Harman. How effective is stemming? JASIS, 42(1):7--15, 1991.

Crossref

Google Scholar

[4]

D. A. Hull. Stemming algorithms: A case study for detailed evaluation. JASIS, 47(1):70--84, 1996.

Digital Library

Google Scholar

[5]

R. Krovetz. Viewing morphology as an inference process. In ACM SIGIR 1993, pages 191--202, 1993.

Digital Library

Google Scholar

[6]

M. Kurimo, M. Creutz, and V. Turunen. Overview of Morpho Challenge in CLEF 2007. In Working Notes of the CLEF 2007 Workshop, 2007.

Digital Library

Google Scholar

[7]

P. McNamee and J. Mayfield. Character n-gram tokenization for european language text retrieval. Information Retrieval, 7(1-2):73--97, 2004.

Digital Library

Google Scholar

[8]

M. F. Porter. An algorithm for suffix stripping. Program, 14:130--137, 1980.

Crossref

Google Scholar

Cited By

View all

Boyer CDolamic LFalquet G(2015)Language Independent Tokenization vs. Stemming in Automated Detection of Health Websites’ HONcode Conformity: An EvaluationProcedia Computer Science10.1016/j.procs.2015.08.48464(224-231)Online publication date: 2015
https://doi.org/10.1016/j.procs.2015.08.484
Akinyemi JClarke C(2012)Fast and effective soft linksSoftware: Practice and Experience10.1002/spe.212243:5(577-593)Online publication date: 11-Apr-2012
https://doi.org/10.1002/spe.2122
Spiegler SMonson CJoshi AHuang CJurafsky D(2010)EMMAProceedings of the 23rd International Conference on Computational Linguistics10.5555/1873781.1873897(1029-1037)Online publication date: 23-Aug-2010
https://dl.acm.org/doi/10.5555/1873781.1873897
Show More Cited By

Index Terms

Don't have a stemmer?: be un+concern+ed
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection
    2. Search engine architectures and scalability
      1. Search engine indexing

Recommendations

Addressing morphological variation in alphabetic languages
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

The selection of indexing terms for representing documents is a key decision that limits how effective subsequent retrieval can be. Often stemming algorithms are used to normalize surface forms, and thereby address the problem of not finding documents ...
The Rule-Based Sundanese Stemmer

Our research proposed an iterative Sundanese stemmer by removing the derivational affixes prior to the inflexional. This scheme was chosen because, in the Sundanese affixation, a confix (one of derivational affix) is applied in the last phase of a ...
A novel Arabic lemmatization algorithm
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, ...

Comments

Information & Contributors

Information

Published In

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

July 2008

934 pages

ISBN:9781605581644

DOI:10.1145/1390334

General Chairs:
Tat-Seng Chua
National University of Singapore
,
Mun-Kew Leong
National Library Board, Singapore
,
Program Chairs:
Syung Hyon Myaeng
Information and Communications University, Korea
,
Douglas W. Oard
University of Maryland, College Park, USA
,
Fabrizio Sebastiani
Consiglio Nazionale delle Ricerche, Italy

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

SIGIR '08

Sponsor:

SIGIR '08: The 31st Annual International ACM SIGIR Conference

July 20 - 24, 2008

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
462
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Boyer CDolamic LFalquet G(2015)Language Independent Tokenization vs. Stemming in Automated Detection of Health Websites’ HONcode Conformity: An EvaluationProcedia Computer Science10.1016/j.procs.2015.08.48464(224-231)Online publication date: 2015
https://doi.org/10.1016/j.procs.2015.08.484
Akinyemi JClarke C(2012)Fast and effective soft linksSoftware: Practice and Experience10.1002/spe.212243:5(577-593)Online publication date: 11-Apr-2012
https://doi.org/10.1002/spe.2122
Spiegler SMonson CJoshi AHuang CJurafsky D(2010)EMMAProceedings of the 23rd International Conference on Computational Linguistics10.5555/1873781.1873897(1029-1037)Online publication date: 23-Aug-2010
https://dl.acm.org/doi/10.5555/1873781.1873897
Chin SDeCook RStreet WEichmann DFeng DCallan JHovy EPaşca M(2010)Query-based text normalization selection models for enhanced retrieval accuracyProceedings of the NAACL HLT 2010 Workshop on Semantic Search10.5555/1867767.1867770(19-26)Online publication date: 5-Jun-2010
https://dl.acm.org/doi/10.5555/1867767.1867770
McNamee P(2009)JHU Ad Hoc Experiments at CLEF 2008Evaluating Systems for Multilingual and Multimodal Information Access10.1007/978-3-642-04447-2_21(170-177)Online publication date: 2009
https://doi.org/10.1007/978-3-642-04447-2_21
McNamee P(2008)JHU ad hoc experiments at CLEF 2008Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access10.5555/1813809.1813836(170-177)Online publication date: 17-Sep-2008
https://dl.acm.org/doi/10.5555/1813809.1813836

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Addressing morphological variation in alphabetic languages

The Rule-Based Sundanese Stemmer

A novel Arabic lemmatization algorithm

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations