Skip to main content

Corpus-Based Lexeme Ranking for Morphological Guessers

  • Conference paper
State of the Art in Computational Morphology (SFCM 2009)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 41))

  • 351 Accesses

Abstract

Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. To add new words to a morphological lexicon, we need to determine their base form and indicate their inflectional paradigm. A base form and a paradigm define a lexeme. In this article, we evaluate a lexicon-based method augmented with data from a corpus or the internet for generating and ranking lexeme suggestions for new words. As an entry generator often produces numerous suggestions, it is important that the best suggestions be among the first few, otherwise it may become more efficient to create the entries by hand. By generating lexeme suggestions with an entry generator and then further generating some key word forms for the lexemes, we can find support for the lexemes in a corpus. Our ranking methods have 56–79% average precision and 78–89% recall among the top 6 candidates, i.e., an F-score of 65–84%, indicating that the first correct entry suggestion is on the average found as the second or third candidate. The corpus-based ranking methods were found to be significant in practice as they save time for the lexicographer by increasing recall with 7–8% among the top candidates.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Mikheev, A.: Unsupervised Learning of Word Category Guessing Rules. In: Proc. of the 34th Annual Meeting of the Association for Computational Linguistics (ACL 1996), pp. 327–334 (1996)

    Google Scholar 

  2. Mikheev, A.: Automatic Rule Induction for Unknown-Word Guessing. Computational Linguistics 23(3), 405–423 (1997)

    Google Scholar 

  3. Oflazer, K., Nirenburg, S., McShane, M.: Bootstrapping Morphological Analyzers by Combining Human Elicitation and Machine Learning. Computational Linguistics 27(1), 59–85 (2001)

    Article  Google Scholar 

  4. Wicentowski, R.: Modeling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. PhD Thesis. John Hopkins University, Baltimore, USA (2002)

    Google Scholar 

  5. Goldsmith, J.A.: Morphological Analogy: Only a Beginning (2007), http://hum.uchicago.edu/~jagoldsm/Papers/analogy.pdf

  6. Creutz, M., Hirsimäki, T., Kurimo, M., Puurula, A., Pylkkönen, J., Siivola, V., Varjokallio, M., Arisoy, E., Saraçlar, M., Stolcke, A.: Morph-based speech recognition and modeling of out-of vocabulary words across languages. ACM Transactions on Speech and Language Processing 5(1), article 3 (2007)

    Google Scholar 

  7. Kurimo, M., Creutz, M., Turunen, V.: Overview of Morpho Challenge in CLEF 2007. In: Working Notes of the CLEF 2007 Workshop, pp. 19–21 (2007)

    Google Scholar 

  8. Lindén, K.: A Probabilistic Model for Guessing Base Forms of New Words by Analogy. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 106–116. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  9. Kuenning, G.: Dictionaries for International Ispell (2007), http://www.lasr.cs.ucla.edu/geoff/ispelldictionaries.html

  10. Lingsoft, Inc.: Demos, http://www.lingsoft.fi/?doc_id=107&lang=en

  11. Lindén, K.: Guessers for Finite-State Transducer Lexicons. In: CICling-2009, 10th International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, Mexico, March 1- 7 (2009)

    Google Scholar 

  12. Kotimaisten kielten tutkimuskeskuksen nykysuomen sanalista. Research Institute for the Languages of Finland (2007), http://kaino.kotus.fi/sanat/nykysuomi/

  13. Sakarovitch, J.: Éléments de théorie des automates. Vuibert, Paris (2003)

    Google Scholar 

  14. HFST: Helsinki Finite-State Technology (2008), http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/index.shtml

  15. Koskenniemi, K.: Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production. PhD Thesis. Department of General Linguistics, University of Helsinki, Publication No. 11 (1983)

    Google Scholar 

  16. Pirinen, T.: Open Source Morphology for Finnish using Finite-State Methods (in Finnish). Technical Report. Department of Linguistics, University of Helsinki (2008)

    Google Scholar 

  17. Forsberg, M., Hammarström, H., Ranta, A.: Morphological Lexicon Extraction from Raw Text Data. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 488–499. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  18. Yarowsky, D., Wicentowski, R.: Minimally Supervised Morphological Analysis by Multimodal Alignment. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (2000)

    Google Scholar 

  19. Lindén, K.: Entry Generation by Analogy—Encoding New Words for Morphological Lexicons. Northern European Journal of Language Technology (May 2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lindén, K., Tuovila, J. (2009). Corpus-Based Lexeme Ranking for Morphological Guessers. In: Mahlow, C., Piotrowski, M. (eds) State of the Art in Computational Morphology. SFCM 2009. Communications in Computer and Information Science, vol 41. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04131-0_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04131-0_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04130-3

  • Online ISBN: 978-3-642-04131-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics