An Ensemble of Classifiers Methodology for Stemming in Inflectional Languages: Using the Example of Latvian

Eger, Steffen; Sējāne, Ineta

doi:10.3233/978-1-60750-641-6-217

Abstract

In this paper, we present a stemming methodology based both on a hand-crafted rule-based system and data-driven machine learning approaches. The rule-based system models phenomena of Latvian, a highly inflectional language, in a linguistically sound and consistent way. While the handcrafted stemmer can be used on its own, it may also serve as a supplier of training data for our statistical modeling. This relies on two assumptions which are quite natural in the context of stemming and many other NLP applications such as grapheme-to-phoneme conversion, lemmatization, etc., namely that the output sequence is not longer than the input sequence and that the orderings of input and output sequence characters are ‘similar’. Under these conditions, we train several machine learning algorithms and show that very good results for stemming in Latvian can be obtained by combining them via bootstrapping and ensemble of classifiers methods.

Contact

IOS Press Copyright 2024

Contact

IOS Press Copyright 2024

This website uses cookies

This website uses cookies