Abstract
Text processing in Serbian is based on the Intex format system of electronic dictionaries. Although lexical recognition is successful for 75% to 90% of word forms (depending on the type of text), some categories of words remain unrecognized. In this paper we present two aspects of e-dictionary enhancement that provide for additional recognition of two important categories of words: named entities and words generally not recorded in traditional dictionaries. We first describe the structure and content of dictionaries of proper names, both personal and geographic, developed to recognize the corresponding classes of named entities. Then we present a set of lexical transducers expressing morphological rules governing word formation, developed for the recognition of unknown words. The resources presented significantly improve the lexical recognition process.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Erjavec, T., Džeroski, S.: Machine Learning of Morphosyntactic Structure: Lemmatising Unknown Slovene Words. Appl. Artificial Intelligence 18(1), 17–40 (2004)
Krstev, C., Pavlović-Lažetić, G., Obradović, I., Vitas, D.: Using Textual and Lexical Resources in Developing Serbian Wordnet, Romanina. Journal for Information Science & Technology (2004) [in print]
Grass, T., Maurel, D., Piton, O., Eggert, E.: Description of a Multilingual Database of Proper Names. In: Ranchhod, E., Mamede, N.J. (eds.) PorTAL 2002. LNCS (LNAI), vol. 2389, pp. 137–140. Springer, Heidelberg (2002)
Pala, K., Sedláček, R., Veber, M.: Relations between Inflectional and Derivation Patterns. In: Proc. of Workshop Morphological Processing of Slavic languages, EACL 2003, Budapest, pp. 1–8 (2003)
Silberztein, M.D.: Le dictionaire électronique et analyse automatique de textes: Le systeme INTEX. Masson, Paris (1993)
Vitas, D., et al.: An Overview of Resources and Basic Tools for Processing of SerbianWritten Texts. In: Proc. of the Workshop on Balkan Language Resources and Tools, 1st Balkan Conference in Informatics (2003), http://iit.demokritos.gr/skel/bci03_workshop/pages/programme.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pavlović-Lažetić, G., Vitas, D., Krstev, C. (2004). Towards Full Lexical Recognition. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2004. Lecture Notes in Computer Science(), vol 3206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30120-2_23
Download citation
DOI: https://doi.org/10.1007/978-3-540-30120-2_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23049-6
Online ISBN: 978-3-540-30120-2
eBook Packages: Springer Book Archive