A Morphological Analyzer Using Hash Tables in Main Memory (MAHT) and a Lexical Knowledge Base

Carreras-Riudavets, Francisco J.; Rodríguez-del-Pino, Juan C.; Hernández-Figueroa, Zenón; Rodríguez-Rodríguez, Gustavo

doi:10.1007/978-3-642-28604-9_7

Francisco J. Carreras-Riudavets¹⁷,
Juan C. Rodríguez-del-Pino¹⁷,
Zenón Hernández-Figueroa¹⁷ &
…
Gustavo Rodríguez-Rodríguez¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7181))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2039 Accesses
1 Citations

Abstract

This paper presents a morphological analyzer for the Spanish language (MAHT). This system is mainly based on the storage of words and its morphological information, leading to a lexical knowledge base that has almost five million words. The lexical knowledge base practically covers the whole morphological casuistry of the Spanish language. However, the analyzer solves the processing of prefixes and of enclitic pronouns by easy rules, since the words that can include these elements are much and some of them are neologisms. MAHT reaches a processing average speed over 275,000 words per second. This one is possible because it uses hash tables in main memory. MAHT has been designed to isolate the data from the algorithms that analyze words, even with their irregular forms. This design is very important for an irregular and highly inflectional language, like Spanish, to simplify the insertion of new words and the maintenance of program code.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Academia Española de la Lengua: Ortografía de la Lengua Española. Espasa Calpe, Madrid (1999)
Google Scholar
Alsina, R.: Todos los Verbos Castellanos Conjugados, 17th edn. Teide, Barcelona (1990)
Google Scholar
Alvar Ezquerra, M.: Diccionario de voces de uso actual. Arco/Libros, Madrid (1994)
Google Scholar
Antoshenkov, G., Ziauddin, M.: Query processing and optimization in Oracle Rdb. The International Journal on Very Large Data Bases 54, 229–237 (1996)
Article Google Scholar
Appelt, D.E., Israel, D.J.: Introduction to information extraction technology. In: Proceedings of 16th International Joint Conference on Artificial Intelligence IJCAI 1999, Tutorial, Stockholm (1999)
Google Scholar
Askitis, N., Zobel, J.: Cache-Conscious Collision Resolution in String Hash Tables. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 91–102. Springer, Heidelberg (2005)
Chapter Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Company, Boston (1999)
Google Scholar
Baldzis, S., Kolalas, S., Eumeridou, E.: The Computational Modern Greek Morphological Lexicon ―An Efficient and Comprehensive System for Morphological Analysis and Synthesis. Literary and Linguistic Computing 202, 153–187 (2005)
Article Google Scholar
Biblograf (ed.): Diccionario General de la Lengua Española Vox, Electronic edn. Biblograf, Barcelona (1997)
Google Scholar
Byrne, W., Hajič, J., Ircing, P., Krbec, P., Psutka, J.: Morpheme Based Language Models for Speech Recognition of Czech. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 211–216. Springer, Heidelberg (2000)
Chapter Google Scholar
Carreras, F.J.: Sistema Computacional de Gestión Morfológica del Español SCOGEME. PhD Thesis. Las Palmas de Gran Canaria: Universidad de Las Palmas de Gran Canaria, Spain (2002)
Google Scholar
Carter, J.L., Wegman, M.N.: Universal classes of hash functions. Journal Computer and System Sciences 18, 143–154 (1979)
Article MathSciNet MATH Google Scholar
Casares, J.: Diccionario Ideológico de la Lengua Española, 2nd edn. Gustavo Gili, Barcelona (1990)
Google Scholar
Clave: Diccionario de Uso del Español Actual. Electronic edn. Clave S.M, Madrid (1997)
Google Scholar
Daciuk, J., Watson, R.E., Watson, B.: Incremental construction of acyclic finite-state automata and transducers. In: Proceedings of Finite State Methods in Natural Language Processing. Bilkent University, Ankara (1998)
Google Scholar
Erjavec, T., Džeroski, S.: Machine Learning of Morphosyntactic Structure: Lemmatizing Unknown Slovene Words. Applied Artificial Intelligence 181, 17–41 (2004)
Article Google Scholar
Espasa Calpe (ed.): Gran Diccionario de Sinónimos y Antónimos, 4th edn. Espasa Calpe, Madrid (1991)
Google Scholar
Horowitz, E., Sahni, S.: Fundamentals of Data Structures. Pitman Publishing Limited, London (1977)
Google Scholar
Koskenniemi, K.: Two-level Model for Morphological Analysis’. In: Proceedings of the Eighth International Joint Conference on Artificial Intelligence, pp. 8–12. Karlsruhe, West Germany (1983)
Google Scholar
Larousse (ed.): Gran Diccionario de la Lengua Española. Larousse Planeta, Barcelona (1996)
Google Scholar
Mani, I., Maybury, M.T. (eds.): Advances in Automatic Text Summarization. MIT Press (1999)
Google Scholar
Minnen, G., Carroll, J., Pearce, D.: Applied morphological processing of English. Natural Language Engineering 73, 225–250 (2001)
Google Scholar
Moliner, M.: Diccionario de Uso del Español de María Moliner, 2nd electronic edn. Gredos, Madrid (1996)
Google Scholar
Papakitsos, E., Grigoriadou, M., Philokyprou, G.: Modelling a Morpheme based Lexicon for Modern Greek. Literary and Linguistic Computing 174, 475–490 (2002)
Article Google Scholar
Pérez, J.R.: Reconocimiento y generación integrada de la morfología del español: Una aplicación a la gestión de un diccionario de sinónimos y antónimos. PhD thesis. Las Palmas de Gran Canaria: Universidad de Las Palmas de Gran Canaria (1996)
Google Scholar
Polguère, A.: Towards a theoretically-motivated general public dictionnary of semantic derivations and collocations for French. In: Proceedings of EURALEX 2000, Stuttgart, pp. 517–528 (2000)
Google Scholar
Prószéky, G.: Industrial Applications of Unification Morphology. In: Proceedings of the 4th Conference on Applied Natural Language Processing, Stuttgart, pp. 213–214 (1994)
Google Scholar
Prószéky, G., Kis, B.: A Unification-based Approach to Morpho-syntactic Parsing of Agglutinative and Other Highly Inflectional Languages. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, Maryland, pp. 261–268 (1999)
Google Scholar
Ramakrishna, M.V., Zobel, J.: Performance in practice of string hashing functions. In: Proceedings of the International Conference on Database Systems for Advanced Applications, pp. 215–223 (1997)
Google Scholar
Real Academia Española (ed.): Diccionario de la Real Academia Española, Electronic edn. 21.1.0. Real Academia Española and Espasa Calpe, Madrid (1995)
Google Scholar
Santana, O., Pérez, J., Carreras, F., Hernández, Z., Rodríguez, G.: The Spanish Morphology in Internet. In: Cueva Lovelle, J.M., Rodríguez, B.M.G., Gayo, J.E.L., del Pueto Paule Ruiz, M., Aguilar, L.J. (eds.) ICWE 2003. LNCS, vol. 2722, pp. 507–510. Springer, Heidelberg (2003)
Chapter Google Scholar
Sedláček, R., Smrž, P.: Automatic Processing of Czech Inflectional and Derivative Morphology. FI MU Report Series. Faculty of Informatics, Masaryk University (2001)
Google Scholar
Sgarbas, K.N., Fakotakis, N.D., Kokkinakis, G.K.: A Straightforward Approach to Morphological Analysis and Synthesis. In: Proceedings of COMLEX 2000, Workshop on Computational Lexicography and Multimedia Dictionaries, Kato Achaia, Greece, pp. 31–34 (2000)
Google Scholar
Sproat, R.: Morphology and Computation. MIT Press, Cambridge (1992)
Google Scholar
Velásquez, F., Gelbukh, A., Sidorov, G.: AGME: un sistema de análisis y generación de la morfología del español. In: Proceedings of Workshop Multilingual Information Access and Natural Language Processing of IBERAMIA 2002 (8th Iberoamerican Conference on Artificial Intelligence), pp. 1–6 (2002)
Google Scholar
Villena, J., González, J.C., González, B.: STILUS: Sistema de revisión lingüística de textos en castellano. Procesamiento del Lenguaje Natural 29, 305–306 (2002)
Google Scholar
Zobel, J., Heinz, S., Williams, H.: In memory hash tables for accumulating text vocabularies. Information Processing Letters 80(6), 271–277 (2001)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Informática y Sistemas, Universidad de Las Palmas de Gran Canaria, 35017, Las Palmas, Spain
Francisco J. Carreras-Riudavets, Juan C. Rodríguez-del-Pino, Zenón Hernández-Figueroa & Gustavo Rodríguez-Rodríguez

Authors

Francisco J. Carreras-Riudavets
View author publications
You can also search for this author in PubMed Google Scholar
Juan C. Rodríguez-del-Pino
View author publications
You can also search for this author in PubMed Google Scholar
Zenón Hernández-Figueroa
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo Rodríguez-Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research (CIC), National Polytechnic Institute (IPN), Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Carreras-Riudavets, F.J., Rodríguez-del-Pino, J.C., Hernández-Figueroa, Z., Rodríguez-Rodríguez, G. (2012). A Morphological Analyzer Using Hash Tables in Main Memory (MAHT) and a Lexical Knowledge Base. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28604-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-28604-9_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28603-2
Online ISBN: 978-3-642-28604-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics