Abstract
Language software applications encounter new words, e.g., acronyms, technical terminology, names or compounds of such words. In order to add new words to a lexicon, we need to indicate their inflectional paradigm. We present a new generally applicable method for creating an entry generator, i.e. a paradigm guesser, for finite-state transducer lexicons. As a guesser tends to produce numerous suggestions, it is important that the correct suggestions be among the first few candidates. We prove some formal properties of the method and evaluate it on Finnish, English and Swedish full-scale transducer lexicons. We use the open-source Helsinki Finite-State Technology [1] to create finite-state transducer lexicons from existing lexical resources and automatically derive guessers for unknown words. The method has a recall of 82-87 % and a precision of 71-76 % for the three test languages. The model needs no external corpus and can therefore serve as a baseline.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
HFST–Helsinki Finite-State Technology, http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/index.shtml
Mikheev, A.: Unsupervised Learning of Word-Category Guessing Rules. In: Proc. of the 34th Annual Meeting of the Association for Computational Linguistics (ACL 1996), pp. 327–334 (1996)
Oflazer, K., Nirenburg, S., McShane, M.: Bootstrapping Morphological Analyzers by Combining Human Elicitation and Machine Learning. Comp. Ling. 27(1), 59–85 (2001)
Wicentowski, R.: Modeling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. PhD Thesis, Baltimore, USA (2002)
Goldsmith, J.A.: Morphological Analogy: Only a Beginning (2007), http://hum.uchicago.edu/~jagoldsm/Papers/analogy.pdf
Creutz, M., Hirsimäki, T., Kurimo, M., Puurula, A., Pylkkönen, J., Siivola, V., Varjokallio, M., Arisoy, E., Saraçlar, M., Stolcke, A.: Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Trans. on Speech and Lang. Proc. 5(1), art. 3 (2007)
Kurimo, M., Creutz, M., Turunen, V.: Overview of Morpho Challenge in CLEF 2007. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 19–21. Springer, Heidelberg (2008)
Lindén, K.: A probabilistic model for guessing base forms of new words by analogy. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 106–116. Springer, Heidelberg (2008)
Kuenning, G.: Dictionaries for International Ispell (2007), http://www.lasr.cs.ucla.edu/geoff/ispell-dictionaries.html
Lingsoft, Inc.: Demos, http://www.lingsoft.fi/?doc_id=107&lang=en
Koskenniemi, K.: Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production. Department of General Linguistics, University of Helsinki, Publication No. 11 (1983)
Karlsson, F.: SWETWOL: A Comprehensive Morphological Analyser for Swedish. Nordic Journal of Linguistics 15(1), 1–45 (1992)
Nykysuomen sanalista, http://kaino.kotus.fi/sanat/nykysuomi/
FreeLing 2.1–An Open Source Suite of Language Analyzers, http://garraf.epsevg.upc.es/freeling/
Westerberg, T.: Den stora svenska ordlistan (2008), http://www.dsso.se/
Mikheev, A.: Automatic Rule Induction for Unknown-Word Guessing. Comp. Ling. 23(3), 405–423 (1997)
Stroppa, N., Yvon, F.: An Analogical Learner for Morphological Analysis. In: Proc. of the 9th Conference on Computational Natural Language Learning (CoNLL), pp. 120–127 (2005)
Wicentowski, R.: Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model. In: Proc. of the Seventh Meeting of the ACL Special Interest Group in Computational Phonology, ACL, pp. 70–77 (2004)
Claveau, V., L’Homme, M.C.: Structuring Terminology using Analogy-Based Machine Learning. In: Proceedings of the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005, pp. 17–18 (2005)
Baldwin, T.: Bootstrapping Deep Lexical Resources: Resources for Courses. In: Proc. of the ACL-SIGLEX Workshop on Deep Lexical Acquisition, ACL, pp. 67–76 (2005)
Daelemans, W., Zavrel, J., Sloot, K., Bosch, A.: TiMBL: Tilburg Memory-Based Learner, version 6.0, Reference Guide’, Technical Report–ILK07-03, Department of Communication and Information Sciences, Tilburg University (2003)
Pirinen, T.: Open Source Morphology for Finnish using Finite-State Methods (in Finnish). Technical Report. Department of Linguistics, University of Helsinki (2008)
Sakarovitch, J.: Éléments de théorie des automates. Vuibert (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lindén, K. (2009). Guessers for Finite-State Transducer Lexicons. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-00382-0_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00381-3
Online ISBN: 978-3-642-00382-0
eBook Packages: Computer ScienceComputer Science (R0)