Abstract
In this paper we introduce a multilingual Named Entity Recognition (NER) system that uses statistical modeling techniques. The system identifies and classifies NEs in the Hungarian and English languages by applying AdaBoostM1 and the C4.5 decision tree learning algorithm. We focused on building as large a feature set as possible, and used a split and recombine technique to fully exploit its potentials. This methodology provided an opportunity to train several independent decision tree classifiers based on different subsets of features and combine their decisions in a majority voting scheme. The corpus made for the CoNLL 2003 conference and a segment of Szeged Corpus was used for training and validation purposes. Both of them consist entirely of newswire articles. Our system remains portable across languages without requiring any major modification and slightly outperforms the best system of CoNLL 2003, and achieved a 94.77% F measure for Hungarian. The real value of our approach lies in its different basis compared to other top performing models for English, which makes our system extremely successful when used in combination with CoNLL modells.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bikel, D.M., Schwartz, R.L., Weischedel, R.M.: An algorithm that learns what’s in a name. Machine Learning 34(1-3), 211–231 (1999)
Carreras, X., Márques, L., Padró, L.: Named Entity Extraction using AdaBoost. In: Proceedings of CoNLL-2002, Taipei, Taiwan, pp. 167–170 (2002)
Chieu, H.L., Ng, H.T.: Named Entity Recognition with a Maximum Entropy Approach. In: Proceedings of CoNLL-2003, pp. 160–163 (2003)
Chinchor, N.: MUC-7 Named Entity Task Definition. In: Proceedings of Seventh Message Understanding Conference (1998)
Cucerzan, S., Yarowsky, D.: Language-independent named entity recognition combining morphological and contextual evidence. In: Proceedings of Joint SIGDAT Conf. on EMNLP/VLC (1999)
Csendes, D., Csirik, J.A., Gyimóthy, T.: The Szeged Corpus: A POS Tagged and Syntactically Annotated Hungarian Natural Language Corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS, vol. 3206, pp. 41–47. Springer, Heidelberg (2004)
Richárd, F., György, S., András, K.: Named Entity Recognition for Hungarian using various Machine Learning Algorithms (accepted for publication in Acta Cybernetica), http://www.inf.u-szeged.hu/~rfarkas/ACTA2006_hun_namedentity.pdf
Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named Entity Recognition through Classifier Combination. In: Proceedings of CoNLL-2003, pp. 168–171 (2003)
Gábor, K., Héja, E., Mészáros, Á., Sass, B.: Nyílt tokenosztályok reprezentációjának technológiája. In: IKTA-00037/2002, Budapest, Hungary (2002)
Kim, J.-D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the Bio-Entity Task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (2004)
Quinlan, R.: C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco (1993)
Prószéky, G.: Syntax as Meta-Morphology. In: Proceedings of COLING 1996, vol. 2, pp. 1123–1126 (1996)
Shapire, R.E.: The Strength of Weak Learnability. Machine Learnings 5, 197–227 (1990)
Szarvas, G., Farkas, R., Felföldi, L., Kocsor, A., Csirik, J.: A highly accurate Named Entity corpus for Hungarian, In: Proceedings of International Conference on Language Resources and Evaluation (2006)
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL 2003 Shared Task: Language-Independent Named Entity Recognition. In: Proceedings of CoNLL 2003 (2003)
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Szarvas, G., Farkas, R., Kocsor, A. (2006). A Multilingual Named Entity Recognition System Using Boosting and C4.5 Decision Tree Learning Algorithms. In: Todorovski, L., Lavrač, N., Jantke, K.P. (eds) Discovery Science. DS 2006. Lecture Notes in Computer Science(), vol 4265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893318_27
Download citation
DOI: https://doi.org/10.1007/11893318_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46491-4
Online ISBN: 978-3-540-46493-8
eBook Packages: Computer ScienceComputer Science (R0)