Abstract
Identification of words with the same basic meaning (stemming) has important applications in Information Retrieval, first of all for constructing word frequency lists. Usual morphologically-based approaches (including the Porter stemmers) rely on language-dependent linguistic resources or knowledge, which causes problems when working with multilingual data and multi-thematic document collections. We suggest several empirical formulae with easy to adjust parameters and demonstrate how to construct such formulae for a given language using an inductive method of model self-organization. This method considers a set of models (formulae) of a given class and selects the best ones using training and test samples. We describe the method and give detailed examples for French, Italian, Portuguese, and Spanish. The formulae are examined on real domain-oriented document collections. Our approach can be easily applied to other European languages.
Work done under partial support of Mexican Government (CONACyT and CGEPI-IPN).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baeza-Yates, R., Ribero-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Cramer, H.: Mathematical methods of statistics, Cambridge (1946)
Gelbukh, A.: Exact and approximate prefis search under access locality requirements for morphological analysis and spelling correction. Computación y Sistemas 6(3), 167–182 (2003)
Gelbukh, A., Sidorov, G.: Zipf and Heaps Laws’ Coefficients Depend on Language. In: Gelbukh, A. (ed.) CICLing 2001. LNCS, vol. 2004, pp. 332–335. Springer, Heidelberg (2001)
Gelbukh, A., Sidorov, G.: Morphological Analysis of Inflective Languages through Generation. Procesamiento de Lenguaje Natural (29), 105–112 (2002)
Gelbukh, A., Sidorov, G.: Approach to construction of automatic morphological analysis systems for inflective languages with little effort. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 215–220. Springer, Heidelberg (2003)
Ivahnenko, A.:: Manual on typical algorithms of modeling. Tehnika Publ., Kiev (1980) (in Russian)
Makagonov, P., Alexandrov, M.: Constructing empirical formulas for testing word similarity by the inductive method of model self-organization. In: Ranchhold, Mamede (eds.) Advances in Natural Language Processing. LNCS (LNAI), vol. 2379, pp. 239–247. Springer, Heidelberg (2002)
Porter, M.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Alexandrov, M., Blanco, X., Makagonov, P. (2004). Testing Word Similarity: Language Independent Approach with Examples from Romance. In: Meziane, F., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2004. Lecture Notes in Computer Science, vol 3136. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27779-8_20
Download citation
DOI: https://doi.org/10.1007/978-3-540-27779-8_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22564-5
Online ISBN: 978-3-540-27779-8
eBook Packages: Springer Book Archive