Abstract
Named-Entity-Recognition (NER) is widely used as a foundation for Natural Language Processing (NLP) applications. There have been few previous attempts on building generic NER systems for Tamil language. These attempts were based on machine-learning approaches such as Hidden Markov Models (HMM), Maximum Entropy Markov Models (MEMM), Support Vector Machine (SVM) and Conditional Random Fields (CRF). Among them, CRF has been proven to be the best with respect to the accuracy of NER in Tamil. This paper presents a novel approach to build a Tamil NER system using the Margin-Infused Relaxed Algorithm (MIRA). We also present a comparison of performance between MIRA and CRF algorithms for Tamil NER. When the gazetteer, POS tags and orthographic features are used with the MIRA algorithm, it attains an F1-measure of 81.38% on the Tamil BBC news data whereas the CRF algorithm shows only an F1-measure of 79.13% for the same set of features. Our NER system outperforms all the previous NER systems for Tamil language.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
NN - Noun, NNC - Compound Noun, RB - Adverb, VM - Verb Main, SYM - Symbol, PRP - Personal Pronoun, JJ - Adjective, NNP - Pronoun, PSP - Prepositions, QC - Quantity Count, VAUX - Verb Auxiliary, DEM - Determiners, QF - Quantifiers, NEG - Negatives, QO - Quantity Order, WQ - Word Question, INTF - Intensifier, NNPC - Compound Pro Noun.
References
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investig. 30(1), 3–26 (2007)
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370 (2005)
Malarkodi, C.S., Pattabhi, R.K., Sobha, L.D.: Tamil NER–coping with real time challenges. In: 24th International Conference on Computational Linguistics, pp. 23–38 (2012)
Laws, F., Schätze, H.: Stopping criteria for active learning of named entity recognition. In: Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pp. 465–472 (2008)
Shen, D., Zhang, J., Su, J., Zhou, G., Tan, C.-L.: Multi-criteria-based active learning for named entity recognition. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p. 589 (2004)
Vijayakrishna, R., Sobha, L.: Domain focused named entity recognizer for tamil using conditional random fields. In: Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian Languages, pp. 59–66 (2008)
Pandian, S., Pavithra, K.A., Geetha, T.: Hybrid three-stage named entity recognizer for tamil. In: The Sixth Annual Conference on Informatics and Systems (INFOS), pp. 45–52 (2008)
Crammer, K., Singer, Y.: Ultraconservative online algorithms for multiclass problems. J. Mach. Learn. Res. 3, 951–991 (2003)
Banerjee, S., Naskar, S.K., Bandyopadhyay, S.: Bengali named entity recognition using margin infused relaxed algorithm. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS (LNAI), vol. 8655, pp. 125–132. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10816-2_16
Cunningham, H., Tablan, V., Roberts, A., Bontcheva, K.: Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS Comput. Biol. 9(2), e1002854 (2013)
Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 69–72 (2006)
Ekbal, A., Haque, R., Das, A., Poka, V., Bandyopadhyay, S.: Language independent named entity recognition in indian languages. In: IJCNLP, pp. 33–40 (2008)
Mikheev, A., Moens, M., Grover, C.: Named entity recognition without gazetteers. In: Proceedings of the Ninth Conference on European Chapter of the Association for Computational Linguistics, pp. 1–8 (1999)
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 173–180 (2003)
Dhanalakshmi, V., Shivapratap, G., Soman Kp, R.S.: Tamil POS tagging using linear programming. Int. J. Recent Trends Eng. 1(2), 166–169 (2009)
Kudo, T.: CRF++: Yet another CRF toolkit, CRF++: Yet Another CRF toolkit (2005). https://taku910.github.io/crfpp/. Accessed 24 Jan 2016
Crammer, K., McDonald, R., Pereira, F.: Scalable large-margin online learning for structured classification. In: NIPS Workshop on Learning With Structured Outputs (2005)
Krishnamurti, B.: The Dravidian Languages. Cambridge University Press, Cambridge (2003)
Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML 2001 Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)
Acknowledgement
We would like to thank AU-KBC research centre of Chennai, Forum for Information Retrieval Evaluation (FIRE) and Department of Registrations of Persons Sri Lanka for providing us necessary language resources and tools to carry out this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Theivendiram, P. et al. (2018). Named-Entity-Recognition (NER) for Tamil Language Using Margin-Infused Relaxed Algorithm (MIRA). In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2016. Lecture Notes in Computer Science(), vol 9623. Springer, Cham. https://doi.org/10.1007/978-3-319-75477-2_33
Download citation
DOI: https://doi.org/10.1007/978-3-319-75477-2_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75476-5
Online ISBN: 978-3-319-75477-2
eBook Packages: Computer ScienceComputer Science (R0)