Abstract
A full-text information retrieval system has to deal with various phenomena of string equivalence: ignore case matching, morphological inflection, derivation, synonymy, and hyponymy or hyperonymy. Technically, this can be handled either at the time of indexing by reducing equivalent strings to a common form or at the time of query processing by enriching the query with the whole set of the equivalent forms. We argue for that the latter way allows for greater flexibility and easier maintenance, while being more affordable than it is usually considered. Our proposal consists in enriching the query only with those forms that really appear in the document base. Our experiments with a thesaurus-based information retrieval system showed only insignificant increase of the query size on average with a 200-megabyte document base, even with highly inflective Spanish language.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aho, Alfred V. Algorithms for finding patterns in strings. In J. van Leeuwen (ed.), Handbook of Theoretical Computer Science, chapter 5, pp. 254–300. Elsevier Science Publishers B. V., 1990.
Cassidy P. An Investigation of the Semantic Relations in the Roget’s Thesaurus: Preliminary Results. In: A. Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, IPN-UNAM, Mexico, to appear. See also Proc. of CICLing-2000, February 2000, CIC-IPN, Mexico City, ISBN 970-18-4206-5.
Gelbukh, A. A data structure for prefix search under access locality requirements and its application to spelling correction. Proc. of MICAI-2000: Mexican International Conference on Artificial Intelligence, Acapulco, Mexico, 2000.
Gelbukh, A., G. Sidorov, and A. Guzm’an-Arenas. Use of a Weighted Topic Hierarchy for Document Classification, Matousek et al., TSD-99: Text, Speech, Dialogue. Lecture Notes in Artificial Intelligence N 1692, Springer, 1999.
Gusfield, Dan. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997; ISBN: 0521585198.
Guzm’an-Arenas, Adolfo. Finding the main themes in a Spanish document, Journal Expert Systems with Applications, Vol. 14, No. 1/2. Jan/Feb 1998, pp. 139–148.
Fellbaum, Ch. (ed.) WordNet as Electronic Lexical Database. MIT Press, 1998.
Frakes, W., and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992.
Hausser, Ronald. Three principled methods of automatic word form recognition. Proc. of VEXTAL: Venecia per il Tratamento Automatico delle Lingue. Venice, Italy, Sept. 1999.
Koskenniemi, Kimmo. Two-level Morphology: A General Computational Model for Word-Form Recognition and Production. University of Helsinki Publications, N 1l, 1983.
Kowalski, Gerald. Information Retrieval Systems Theory and Implementation, Kluwer Academic Publishers, 1997.
Lenat, D. B. and R. V. Guha. Building Large Knowledge Based Systems. Reading, Massachusetts: Addison Wesley, 1990. See also more recent publications on CYC project, http://www.cyc.com.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gelbukh, A.F. (2000). Lazy Query Enrichment: A Method for Indexing Large Specialized Document Bases with Morphology and Concept Hierarchy. In: Ibrahim, M., Küng, J., Revell, N. (eds) Database and Expert Systems Applications. DEXA 2000. Lecture Notes in Computer Science, vol 1873. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44469-6_49
Download citation
DOI: https://doi.org/10.1007/3-540-44469-6_49
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67978-3
Online ISBN: 978-3-540-44469-5
eBook Packages: Springer Book Archive