Skip to main content

Morphological Lexicon Extraction from Raw Text Data

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4139))

Abstract

The tool extract enables the automatic extraction of lemma-paradigm pairs from raw text data. The tool uses search patterns that consist of regular expressions and propositional logic. These search patterns define sufficient conditions for including lemma-paradigm pairs in the lexicon, on the basis of word forms occurring in the data. This paper explains the search pattern syntax of extract as well as the search algorithm, and discusses the design of search patterns from the recall and precision point of view.

The extract tool was developed for morphologies defined in the Functional Morphology tool [1], but it is usable for all systems that implement a word-and-paradigm description of a morphology.

The usefulness of the tool is demonstrated by a case study on the Canadian Hansards Corpus of French. The result is evaluated in terms of precision of the extracted lemmas and statistics on coverage and rule productiveness. Competitive extraction figures show that human-written rules in a tailored tool is a time-efficient approach to the task at hand.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Forsberg, M., Ranta, A.: Functional Morphology. In: Proceedings of the Ninth ACM SIGPLAN International Conference of Functional Programming, Snowbird, Utah, pp. 213–223 (2004)

    Google Scholar 

  2. Creutz, M., Lagus, K.: Inducing the morphological lexicon of a natural language from unannotated text. In: Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR 2005), Finland, Espoo, June 15-17, pp. 106–113 (2005)

    Google Scholar 

  3. Utpal Sharma, J.K., Das, R.: Unsupervised learning of morphology for building lexicon for a highly inflectional language. In: Proceedings of the 6th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), Philadelphia, July 2002, Association for Computational Linguistics, pp. 1–10 (2002)

    Google Scholar 

  4. Hopcroft, J., Ullman, J.: Introduction to Automata Theory, Languages, and Computation, 2nd edn. Addison-Wesley, Reading (2001)

    MATH  Google Scholar 

  5. Germann, U.: Corpus of hansards of the 36th parliament of canada. Provided by the Natural Language Group of the Universtity of Southern California Information Sciences Institute (2003) (15 million words) (accessed November 1, 2005), Downloadable at: http://www.isi.edu/natural-language/download/hansard/

  6. Clément, L., Sagot, B., Lang, B.: Morphology based automatic acquisition of large-coverage lexica. In: Proc. of LREC 2004, Lisboa, Portugal, pp. 1841–1844 (2004)

    Google Scholar 

  7. Oliver, A., Tadić, M.: Enlarging the croatian morphological lexicon by automatic lexical acquisition from raw corpora. In: Proc. of LREC 2004, Lisboa, Portugal, pp. 1259–1262 (2004)

    Google Scholar 

  8. Oliver, A.: Adquisició d’informació lèxica i morfosintàctica a partir de corpus sense anotar: aplicació al rus i al croat. PhD thesis, Universitat de Barcelona (2004)

    Google Scholar 

  9. Goldsmith, J.: Unsupervised learning of the morphology of natural language. Computational Linguistics 27(2), 153–198 (2001)

    Article  MathSciNet  Google Scholar 

  10. Kermanidis, K.L., Fakotakis, N., Kokkinakis, G.: Automatic acquisition of verb subcategorization information by exploiting minimal linguistic resources. International Journal of Corpus Linguistics 9(1), 1–28 (2004)

    Article  Google Scholar 

  11. Faure, D., Nédellec, C.: Asium: Learning subcategorization frames and restrictions of selection. In: Kodratoff, Y. (ed.) 10th Conference on Machine Learning (ECML 1998) – Workshop on Text Mining, Chemnitz, Germany, April 1998. Springer, Berlin (1998)

    Google Scholar 

  12. Gamallo, P., Agustini, A., Lopes, G.P.: Learning subcategorisation information to model a grammar with ”co-restrictions”. Traitement Automatique des Langues 44(1), 93–177 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Forsberg, M., Hammarström, H., Ranta, A. (2006). Morphological Lexicon Extraction from Raw Text Data. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_49

Download citation

  • DOI: https://doi.org/10.1007/11816508_49

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-37334-6

  • Online ISBN: 978-3-540-37336-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics