Abstract
Document retrieval on natural languages with a rich morphology — particularly in terms of derivation and (single-word) composition — suffers from serious performance degradation with the direct query-term-to-text-word matching paradigm that underlies the vast majority of current search engines. We propose an alternative approach in which morphologically complex word forms, which appear in the query as well as in the documents, are segmented into relevant subwords (such as stems, named entities, acronyms) and are subsequently submitted to the matching procedure. We evaluate our approach with the AltaVista™ Search Engine on a large medical document collection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
R. H. Baud, C. Lovis, A.-M. Rassinoux, and J.-R. Scherrer. Morpho-semantic parsing of medical expressions. In AMIA’98-Proceedings of the 1998 AMIA Annual Fall Symposium, pages 760–764. Orlando, FL, November 7–11, 1998.
Y. Choueka. Responsa: An operational full-text retrieval system with linguistic components for large corpora. In A. Zampolli, editor, Computational Lexicology and Lexicography: A Volume in Honor of B. Quemada. Pisa: Giardini Press, 1992.
P. Dujols, P. Aubas, C. Baylon, and F. Grémy. Morphosemantic analysis and translation of medical compound terms. Methods of Information in Medicine, 30(1):30–35, 1991.
D. Harman. How effective is suffixing? Journal of the American Society for Information Science, 42(1):7–15, 1991.
D. A. Hull. Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Science, 47(1):70–84, 1996.
H. Jäppinen and J. Niemistö. Inflections and compounds: Some linguistic problems for automatic indexing. In RIAO 88-Proceedings of the RIAO 88 Conference, volume 1, pages 333–342. Cambridge, MA, March 21–24, 1988.
W. Kraaij and R. Pohlmann. Viewing stemming as recall enhancement. In SIGIR’ 96-Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 40–48. Zurich, Switzerland, August 18–22, 1996.
R. Krovetz. Viewing morphology as an inference process. In SIGIR’93-Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 191–203. Pittsburgh, PA, USA, June 27–July 1, 1993.
J. B. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11(1/2):22–31, 1968.
L. M. Norton and M. G. Pacak. Morphosemantic analysis of compound word forms denoting surgical procedures. Methods of Information in Medicine, 22(1):29–36, 1983.
M. G. Pacak, L. M. Norton, and G. S. Dunham. Morphosemantic analysis of-itis forms in medical language. Methods of Information in Medicine, 19(2):99–105, 1980.
M. Popovic and P. Willett. The effectiveness of stemming for natural language access to Slovene textual data. Journal of the American Society for Information Science, 43(5):384–390, 1992.
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
C. J. van Rijsbergen. Information Retrieval. London: Butterworths, 2nd edition, 1979.
E. Tzoukermann, J. L. Klavans, and C. Jacquemin. Effective use of natural language processing techniques for automatic conflation of multi-word terms: The role of derivational morphology, part of speech tagging, and shallow parsing. In SIGIR’97-Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 148–155. Philadelphia, PA, USA, July 27–31, 1997.
F. Wingert. Morphologic analysis of compound words. Methods of Information in Medicine, 24(3):155–162, 1985.
S. Wolff. The use of morphosemantic regularities in the medical vocabulary for automatic lexical coding. Methods of Information in Medicine, 23(4):195–203, 1984.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hahn, U., Honeck, M., Schulz, S. (2001). A Search Engine for Morphologically Complex Languages. In: Hoffmann, F., Hand, D.J., Adams, N., Fisher, D., Guimaraes, G. (eds) Advances in Intelligent Data Analysis. IDA 2001. Lecture Notes in Computer Science, vol 2189. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44816-0_8
Download citation
DOI: https://doi.org/10.1007/3-540-44816-0_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42581-6
Online ISBN: 978-3-540-44816-7
eBook Packages: Springer Book Archive