Abstract
Automatic creation of syntactic and semantic word categorizations is a challenging problem for highly inflecting languages due to excessive data sparsity. Moreover, the study of colloquial language resources requires the utilization of fully corpus-based tools. We present a completely automated approach for producing word categorizations for morphologically rich languages. Self-Organizing Map (SOM) is utilized for clustering words based on the morphological properties of the context words. These properties are extracted using an automated morphological segmentation algorithm called Morfessor. Our experiments on a colloquial Finnish corpus of stories told by young children show that utilizing unsupervised morphs as features leads to clearly improved clusterings when compared to the use of whole context words as features.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ritter, H., Kohonen, T.: Self-Organizing Maps. Biological Cybernetics 61, 241–254 (1989)
Honkela, T., Pulkki, V., Kohonen, T.: Contextual relations of words in Grimm tales analyzed by self-organizing map. In: Proceedings of ICANN 1995. Paris. EC2 et Cie, vol. 2, pp. 3–7 (1995)
Redington, M., Chater, N., Finch, S.: Distributional information: A powerful cue for acquiring syntactic categories. Cognitive Science 22(4), 425–469 (1998)
Schutze, H.: Automatic word sense discrimination. Computational Linguistics 24(1), 97–123 (1998)
Lagus, K., Airola, A., Creutz, M.: Data analysis of conceptual similarities of Finnish verbs. In: Proceedings of the CogSci 2002, Fairfax, Virginia, pp. 566–571 (2002)
Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: ACL 30, pp.183–190 (1993)
Schulte im Walde, S.: Clustering verbs semantically according to their alternation behaviour. In: COLING 2000, pp. 747–753 (2000)
Light, M.: Morphological cues for lexical semantics. In: ACL 34, pp. 25–31 (1996)
Goldsmith, J.: Unsupervised learning of the morphology of a natural language. Computational Linguistics 27(2), 153–198 (2001)
Creutz, M., Lagus, K.: Unsupervised discovery of morphemes. In: Proceedings of the Workshop on Morphological and Phonological Learning of ACL 2002, Philadelphia, Pennsylvania, pp. 21–30 (2002)
Kohonen, T.: Self-organized formation of topologically correct feature maps. Biological Cybernetics 43, 59–69 (1982)
Creutz, M., Lagus, K.: Inducing the morphological lexicon of a natural language from unannotated text. In: Proceedings of AKRR 2005, Espoo, pp. 106–113 (2005)
Hirsimaki, T., Creutz, M., Siivola, V., Kurimo, M., Virpioja, S., Pylkkonen, J.: Unlimited vocabulary speech recognition with morph language models applied to Finnish. Computer Speech and Language 20(4), 515–541 (2006)
Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, Berlin (2001)
Riihela, M.: The Storycrafting Method, Stakes, Helsinki, Finland (2001)
Hakulinen, A., Vilkuna, M., Korhonen, R., Koivisto, V., Heinonen, T., Alho, I.: Iso suomen kielioppi. Suomalaisen Kirjallisuuden Seura, Helsinki (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Klami, M., Lagus, K. (2006). Unsupervised Word Categorization Using Self-Organizing Maps and Automatically Extracted Morphs. In: Corchado, E., Yin, H., Botti, V., Fyfe, C. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2006. IDEAL 2006. Lecture Notes in Computer Science, vol 4224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11875581_109
Download citation
DOI: https://doi.org/10.1007/11875581_109
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45485-4
Online ISBN: 978-3-540-45487-8
eBook Packages: Computer ScienceComputer Science (R0)