Rich Set of Features for Proper Name Recognition in Polish Texts

Marcińczuk, Michał; Stanek, Michał; Piasecki, Maciej; Musiał, Adam

doi:10.1007/978-3-642-25261-7_26

Michał Marcińczuk¹⁶,
Michał Stanek¹⁶,
Maciej Piasecki¹⁶ &
…
Adam Musiał¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7053))

Included in the following conference series:

International Joint Conferences on Security and Intelligent Information Systems

844 Accesses
7 Citations

Abstract

In this paper we analyse the importance of data generalisation and usage of local context in the problem of the Proper Name recognition. We present an extended set of features that provide generalised description of the data and encode linguistic information. To utilize the rich set of features we applied Conditional Random Fields (CRF) — a modern approach for sequence labelling. We present results of the evaluation on a single domain following the cross-validation scheme and cross-domain evaluation based on training and testing on different corpora. We show that the extended set of features improves the final results for CRF and also this approach outperforms Hidden Markov Models (HMM). On the single domain CRF obtained 92.53% of F-measure for 5 categories of proper names, and 67.72% and 72.62% of F-measure for other two corpora in cross-domain evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Two-Tier Machine Learning Using Conditional Random Fields with Constraints

CRF+LG: A Hybrid Approach for the Portuguese Named Entity Recognition

Completing features for author name disambiguation (AND): an empirical analysis

Article 11 January 2022

References

Abramowicz, W., Filipowska, W., Piskorski, J., Węcel, K., Wieloch, K.: Linguistic Suite for Polish Cadastral System. In: Proceedings of the LREC 2006, Genoa, Italy, pp. 53–58 (2006) ISBN 2-9517408-2-4
Google Scholar
Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a High-Performance Learning Name-finder. In: Proceedings of Conference on Applied Natural Language Processing (1997)
Google Scholar
Chinchor, N.A.: Overview of MUC-7/MET-2. In: Proceedings of the 7th Message Understanding Conference (1998)
Google Scholar
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: ACL, pp. 363–370 (2005)
Google Scholar
Graliński, F., Jassem, K., Marcińczuk, M.: An Environment for Named Entity Recognition and Translation. In: Márquez, L., Somers, H. (eds.) Proceedings of the 13th Annual Conference of the European Association for Machine Translation, Barcelona, Spain, pp. 88–95 (2009)
Google Scholar
Graliński, F., Jassem, K., Marcińczuk, M., Wawrzyniak, P.: Named Entity Recognition in Machine Anonymization. In: Kłopotek, M.A., Przepiorkowski, A., Wierzchoń, A.T., Trojanowski, K. (eds.) Recent Advances in Intelligent Information Systems, pp. 247–260. Academic Publishing House Exit ( (2009)
Google Scholar
Karpowicz, T.: Kultura języka polskiego. Wymowa, ortografia, interpunkcja (2009)
Google Scholar
Kripke, S.: Naming and Necessity (1972)
Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001) ISBN 1-55860-778-1
Google Scholar
LDC: ACE (Automatic Content Extraction) English Annotation Guidelines for Entities (Version 6.6), Technical report, Linguistic Data Consortium (2008)
Google Scholar
Marcińczuk, M., Piasecki, M.: Statistical Proper Name Recognition in Polish Economic Texts. Control and Cybernetics (to appear in, 2011)
Google Scholar
McCallum, A., Freitag, D., Pereira, F.C.N.: Maximum Entropy Markov Models for Information Extraction and Segmentation. In: Proceedings of the Seventeenth International Conference on Machine Learning, ICML 2000, pp. 591–598. Morgan Kaufmann Publishers Inc., San Francisco (2000) ISBN 1-55860-707-2
Google Scholar
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, CONLL 2003, vol. 4, pp. 188–191. Association for Computational Linguistics, Stroudsburg (2003)
Chapter Google Scholar
Mykowiecka, A., Kupść, A., Marciniak, M., Piskorski, J.: Resources for Information Extraction from Polish texts. In: Proceedings of the 3rd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2007), Poznań, Poland, October 5-7 (2007)
Google Scholar
Paz, A.: Introduction to probabilistic automata (Computer science and applied mathematics). Academic Press, Inc., Orlando (1971) ISBN 0125476507
Google Scholar
Peng, F., McCallum, A.: Accurate Information Extraction from Research Papers Using Conditional Random Fields. In: HLT-NAACL, pp. 329–336 (2004)
Google Scholar
Piskorski, J.: Extraction of Polish named entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004 (ELR 2004), pp. 313–316. Association for Computational Linguistics, Prague (2004)
Google Scholar
Piskorski, J.: Named-Entity Recognition for Polish with SProUT. In: Bolc, L., Michalewicz, Z., Nishida, T. (eds.) IMTCI 2004. LNCS (LNAI), vol. 3490, pp. 122–133. Springer, Heidelberg (2005) ISBN 3-540-29035-4
Chapter Google Scholar
Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version, Institute of Computer Science. Polish Academy of Sciences, Warsaw (2004)
Google Scholar
Rosenfeld, B., Fresko, M., Feldman, R.: A systematic comparison of feature-rich probabilistic classifiers for NER tasks. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 217–227. Springer, Heidelberg (2005) ISBN 978-3-540-29244-9
Chapter Google Scholar
Savary, A., Waszczuk, J., Przepiórkowski, A.: Towards the Annotation of Named Entities in the National Corpus of Polish. In: LREC 2010 Proceedings (2010)
Google Scholar
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, vol. 1, pp. 134–141. Association for Computational Linguistics, Stroudsburg (2003)
Google Scholar
Vishwanathan, S.V.N., Schraudolph, N.N., Schmidt, M.W., Murphy, K.P.: Accelerated training of conditional random fields with stochastic gradient methods. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 969–976. ACM, New York (2006) ISBN 1-59593-383-2
Google Scholar

Download references

Author information

Authors and Affiliations

Wrocław University of Technology, Wrocław, Poland
Michał Marcińczuk, Michał Stanek, Maciej Piasecki & Adam Musiał

Authors

Michał Marcińczuk
View author publications
You can also search for this author in PubMed Google Scholar
Michał Stanek
View author publications
You can also search for this author in PubMed Google Scholar
Maciej Piasecki
View author publications
You can also search for this author in PubMed Google Scholar
Adam Musiał
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Pascal Bouvry Mieczysław A. Kłopotek Franck Leprévost Małgorzata Marciniak Agnieszka Mykowiecka Henryk Rybiński

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Marcińczuk, M., Stanek, M., Piasecki, M., Musiał, A. (2012). Rich Set of Features for Proper Name Recognition in Polish Texts. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds) Security and Intelligent Information Systems. SIIS 2011. Lecture Notes in Computer Science, vol 7053. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25261-7_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-25261-7_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25260-0
Online ISBN: 978-3-642-25261-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics