Skip to main content

A Morphological Analyzer Using Hash Tables in Main Memory (MAHT) and a Lexical Knowledge Base

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2012)

Abstract

This paper presents a morphological analyzer for the Spanish language (MAHT). This system is mainly based on the storage of words and its morphological information, leading to a lexical knowledge base that has almost five million words. The lexical knowledge base practically covers the whole morphological casuistry of the Spanish language. However, the analyzer solves the processing of prefixes and of enclitic pronouns by easy rules, since the words that can include these elements are much and some of them are neologisms. MAHT reaches a processing average speed over 275,000 words per second. This one is possible because it uses hash tables in main memory. MAHT has been designed to isolate the data from the algorithms that analyze words, even with their irregular forms. This design is very important for an irregular and highly inflectional language, like Spanish, to simplify the insertion of new words and the maintenance of program code.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Academia Española de la Lengua: Ortografía de la Lengua Española. Espasa Calpe, Madrid (1999)

    Google Scholar 

  2. Alsina, R.: Todos los Verbos Castellanos Conjugados, 17th edn. Teide, Barcelona (1990)

    Google Scholar 

  3. Alvar Ezquerra, M.: Diccionario de voces de uso actual. Arco/Libros, Madrid (1994)

    Google Scholar 

  4. Antoshenkov, G., Ziauddin, M.: Query processing and optimization in Oracle Rdb. The International Journal on Very Large Data Bases 54, 229–237 (1996)

    Article  Google Scholar 

  5. Appelt, D.E., Israel, D.J.: Introduction to information extraction technology. In: Proceedings of 16th International Joint Conference on Artificial Intelligence IJCAI 1999, Tutorial, Stockholm (1999)

    Google Scholar 

  6. Askitis, N., Zobel, J.: Cache-Conscious Collision Resolution in String Hash Tables. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 91–102. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  7. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Company, Boston (1999)

    Google Scholar 

  8. Baldzis, S., Kolalas, S., Eumeridou, E.: The Computational Modern Greek Morphological Lexicon ―An Efficient and Comprehensive System for Morphological Analysis and Synthesis. Literary and Linguistic Computing 202, 153–187 (2005)

    Article  Google Scholar 

  9. Biblograf (ed.): Diccionario General de la Lengua Española Vox, Electronic edn. Biblograf, Barcelona (1997)

    Google Scholar 

  10. Byrne, W., Hajič, J., Ircing, P., Krbec, P., Psutka, J.: Morpheme Based Language Models for Speech Recognition of Czech. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 211–216. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  11. Carreras, F.J.: Sistema Computacional de Gestión Morfológica del Español SCOGEME. PhD Thesis. Las Palmas de Gran Canaria: Universidad de Las Palmas de Gran Canaria, Spain (2002)

    Google Scholar 

  12. Carter, J.L., Wegman, M.N.: Universal classes of hash functions. Journal Computer and System Sciences 18, 143–154 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  13. Casares, J.: Diccionario Ideológico de la Lengua Española, 2nd edn. Gustavo Gili, Barcelona (1990)

    Google Scholar 

  14. Clave: Diccionario de Uso del Español Actual. Electronic edn. Clave S.M, Madrid (1997)

    Google Scholar 

  15. Daciuk, J., Watson, R.E., Watson, B.: Incremental construction of acyclic finite-state automata and transducers. In: Proceedings of Finite State Methods in Natural Language Processing. Bilkent University, Ankara (1998)

    Google Scholar 

  16. Erjavec, T., Džeroski, S.: Machine Learning of Morphosyntactic Structure: Lemmatizing Unknown Slovene Words. Applied Artificial Intelligence 181, 17–41 (2004)

    Article  Google Scholar 

  17. Espasa Calpe (ed.): Gran Diccionario de Sinónimos y Antónimos, 4th edn. Espasa Calpe, Madrid (1991)

    Google Scholar 

  18. Horowitz, E., Sahni, S.: Fundamentals of Data Structures. Pitman Publishing Limited, London (1977)

    Google Scholar 

  19. Koskenniemi, K.: Two-level Model for Morphological Analysis’. In: Proceedings of the Eighth International Joint Conference on Artificial Intelligence, pp. 8–12. Karlsruhe, West Germany (1983)

    Google Scholar 

  20. Larousse (ed.): Gran Diccionario de la Lengua Española. Larousse Planeta, Barcelona (1996)

    Google Scholar 

  21. Mani, I., Maybury, M.T. (eds.): Advances in Automatic Text Summarization. MIT Press (1999)

    Google Scholar 

  22. Minnen, G., Carroll, J., Pearce, D.: Applied morphological processing of English. Natural Language Engineering 73, 225–250 (2001)

    Google Scholar 

  23. Moliner, M.: Diccionario de Uso del Español de María Moliner, 2nd electronic edn. Gredos, Madrid (1996)

    Google Scholar 

  24. Papakitsos, E., Grigoriadou, M., Philokyprou, G.: Modelling a Morpheme based Lexicon for Modern Greek. Literary and Linguistic Computing 174, 475–490 (2002)

    Article  Google Scholar 

  25. Pérez, J.R.: Reconocimiento y generación integrada de la morfología del español: Una aplicación a la gestión de un diccionario de sinónimos y antónimos. PhD thesis. Las Palmas de Gran Canaria: Universidad de Las Palmas de Gran Canaria (1996)

    Google Scholar 

  26. Polguère, A.: Towards a theoretically-motivated general public dictionnary of semantic derivations and collocations for French. In: Proceedings of EURALEX 2000, Stuttgart, pp. 517–528 (2000)

    Google Scholar 

  27. Prószéky, G.: Industrial Applications of Unification Morphology. In: Proceedings of the 4th Conference on Applied Natural Language Processing, Stuttgart, pp. 213–214 (1994)

    Google Scholar 

  28. Prószéky, G., Kis, B.: A Unification-based Approach to Morpho-syntactic Parsing of Agglutinative and Other Highly Inflectional Languages. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, Maryland, pp. 261–268 (1999)

    Google Scholar 

  29. Ramakrishna, M.V., Zobel, J.: Performance in practice of string hashing functions. In: Proceedings of the International Conference on Database Systems for Advanced Applications, pp. 215–223 (1997)

    Google Scholar 

  30. Real Academia Española (ed.): Diccionario de la Real Academia Española, Electronic edn. 21.1.0. Real Academia Española and Espasa Calpe, Madrid (1995)

    Google Scholar 

  31. Santana, O., Pérez, J., Carreras, F., Hernández, Z., Rodríguez, G.: The Spanish Morphology in Internet. In: Cueva Lovelle, J.M., Rodríguez, B.M.G., Gayo, J.E.L., del Pueto Paule Ruiz, M., Aguilar, L.J. (eds.) ICWE 2003. LNCS, vol. 2722, pp. 507–510. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  32. Sedláček, R., Smrž, P.: Automatic Processing of Czech Inflectional and Derivative Morphology. FI MU Report Series. Faculty of Informatics, Masaryk University (2001)

    Google Scholar 

  33. Sgarbas, K.N., Fakotakis, N.D., Kokkinakis, G.K.: A Straightforward Approach to Morphological Analysis and Synthesis. In: Proceedings of COMLEX 2000, Workshop on Computational Lexicography and Multimedia Dictionaries, Kato Achaia, Greece, pp. 31–34 (2000)

    Google Scholar 

  34. Sproat, R.: Morphology and Computation. MIT Press, Cambridge (1992)

    Google Scholar 

  35. Velásquez, F., Gelbukh, A., Sidorov, G.: AGME: un sistema de análisis y generación de la morfología del español. In: Proceedings of Workshop Multilingual Information Access and Natural Language Processing of IBERAMIA 2002 (8th Iberoamerican Conference on Artificial Intelligence), pp. 1–6 (2002)

    Google Scholar 

  36. Villena, J., González, J.C., González, B.: STILUS: Sistema de revisión lingüística de textos en castellano. Procesamiento del Lenguaje Natural 29, 305–306 (2002)

    Google Scholar 

  37. Zobel, J., Heinz, S., Williams, H.: In memory hash tables for accumulating text vocabularies. Information Processing Letters 80(6), 271–277 (2001)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Carreras-Riudavets, F.J., Rodríguez-del-Pino, J.C., Hernández-Figueroa, Z., Rodríguez-Rodríguez, G. (2012). A Morphological Analyzer Using Hash Tables in Main Memory (MAHT) and a Lexical Knowledge Base. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28604-9_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28604-9_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28603-2

  • Online ISBN: 978-3-642-28604-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics