Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts

Marcińczuk, Michał; Janicki, Maciej

doi:10.1007/978-3-642-28604-9_22

Michał Marcińczuk¹⁷ &
Maciej Janicki¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7181))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2001 Accesses
5 Citations

Abstract

In this paper we present several optimizations introduced to Conditional Random Fields-based model for proper names recognition in Polish running texts. The proposed optimizations refer to word-level segmentation problems, gazetteers incompleteness, problem of unambiguous generalization features, feature construction and selection, and finally recognition of common proper names on the basis of external sources of knowledge. The problem of proper name recognition is limited to recognition of person first names and surnames, names of countries, cities and roads. The evaluation is performed in two ways: a single domain evaluation using 10-fold cross validation on a Corpus of Stock Exchange Reports and a cross-domain evaluation on a Corpus of Economic News. An additional corpus of Wikipedia articles, namely InfiKorp is used in the feature selection. Finally, we evaluate three configurations of proposed modifications. The top configuration improved the final result from 94.53% to 95.65% of F-measure for single domain and from 70.86% to 79.63% for cross-domain evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Mykowiecka, A., Kupść, A., Marciniak, M., Piskorski, J.: Resources for Information Extraction from Polish texts. In: Proceedings of the 3rd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, (LTC 2007), Poznań, Poland, October 5-7 (2007)
Google Scholar
Graliński, F., Jassem, K., Marcińczuk, M.: An Environment for Named Entity Recognition and Translation. In: Màrquez, L., Somers, H. (eds.) Proceedings of the 13th Annual Conference of the European Association for Machine Translation, Barcelona, Spain, pp. 88–95 (2009)
Google Scholar
Graliński, F., Jassem, K., Marcińczuk, M., Wawrzyniak, P.: Named Entity Recognition in Machine Anonymization. In: Kłopotek, M.A., Przepiorkowski, A., Wierzchoń, A.T., Trojanowski, K. (eds.) Recent Advances in Intelligent Information Systems, pp. 247–260. Academic Publishing House Exit (2009)
Google Scholar
Marcińczuk, M., Zaśko-Zielińska, M., Piasecki, M.: Structure Annotation in the Polish Corpus of Suicide Notes. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 419–426. Springer, Heidelberg (2011)
Chapter Google Scholar
ACE (Automatic Content Extraction) English Annotation Guidelines for Entities. Linguistic Data Consortium, LDC (2008)
Google Scholar
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Seventh Conference on Natural Language Learning, CoNLL (2003)
Google Scholar
Mykowiecka, A., Waszczuk, J.: Semantic Annotation of City Transportation Information Dialogues Using CRF Method. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS, vol. 5729, pp. 411–418. Springer, Heidelberg (2009), doi:10.1007/978-3-642-04208-9_56
Chapter Google Scholar
Marcińczuk, M., Stanek, M., Piasecki, M., Musiał, A.: Rich Set of Features for Proper Name Recognition in Polish Texts. In: Proc. of the S&IIS 2011, Poland (2011)
Google Scholar
Georgiev, G., Nakov, P., Ganchev, K., Osenova, P., Simov, K.: Feature-Rich Named Entity Recognition for Bulgarian Using Conditional Random Fields. In: Proceedings of the International Conference RANLP 2009, pp. 113–117. Association for Computational Linguistics, Borovets (2009)
Google Scholar
Benajiba, Y., Rosso, P.: Arabic Named Entity Recognition using Conditional Random Fields. In: Proc. Workshop on HLT & NLP with in the Arabic World (2008)
Google Scholar
Marcińczuk, M., Piasecki, M.: Statistical Proper Name Recognition in Polish Economic Texts. In: Control and Cybernetics (2011)
Google Scholar
Radziszewski, A., Śniatowski, T.: Maca: a configurable tool to integrate Polish morphological data. In: Proceedings of Free RBMT 2011, Barcelona, Spain (2011)
Google Scholar
Piskorski, J.: Extraction of Polish named entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004 (ELR 2004), pp. 313–316. ACL, Prague (2004)
Google Scholar
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conf. of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, vol. 1, pp. 134–141. Association for Computational Linguistics, Stroudsburg (2003)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Wrocław University of Technology, Wrocław, Poland
Michał Marcińczuk & Maciej Janicki

Authors

Michał Marcińczuk
View author publications
You can also search for this author in PubMed Google Scholar
Maciej Janicki
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research (CIC), National Polytechnic Institute (IPN), Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Marcińczuk, M., Janicki, M. (2012). Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28604-9_22

Download citation

DOI: https://doi.org/10.1007/978-3-642-28604-9_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28603-2
Online ISBN: 978-3-642-28604-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics