Skip to main content

Rich Set of Features for Proper Name Recognition in Polish Texts

  • Conference paper
Security and Intelligent Information Systems (SIIS 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7053))

Abstract

In this paper we analyse the importance of data generalisation and usage of local context in the problem of the Proper Name recognition. We present an extended set of features that provide generalised description of the data and encode linguistic information. To utilize the rich set of features we applied Conditional Random Fields (CRF) — a modern approach for sequence labelling. We present results of the evaluation on a single domain following the cross-validation scheme and cross-domain evaluation based on training and testing on different corpora. We show that the extended set of features improves the final results for CRF and also this approach outperforms Hidden Markov Models (HMM). On the single domain CRF obtained 92.53% of F-measure for 5 categories of proper names, and 67.72% and 72.62% of F-measure for other two corpora in cross-domain evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abramowicz, W., Filipowska, W., Piskorski, J., Węcel, K., Wieloch, K.: Linguistic Suite for Polish Cadastral System. In: Proceedings of the LREC 2006, Genoa, Italy, pp. 53–58 (2006) ISBN 2-9517408-2-4

    Google Scholar 

  2. Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a High-Performance Learning Name-finder. In: Proceedings of Conference on Applied Natural Language Processing (1997)

    Google Scholar 

  3. Chinchor, N.A.: Overview of MUC-7/MET-2. In: Proceedings of the 7th Message Understanding Conference (1998)

    Google Scholar 

  4. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: ACL, pp. 363–370 (2005)

    Google Scholar 

  5. Graliński, F., Jassem, K., Marcińczuk, M.: An Environment for Named Entity Recognition and Translation. In: Márquez, L., Somers, H. (eds.) Proceedings of the 13th Annual Conference of the European Association for Machine Translation, Barcelona, Spain, pp. 88–95 (2009)

    Google Scholar 

  6. Graliński, F., Jassem, K., Marcińczuk, M., Wawrzyniak, P.: Named Entity Recognition in Machine Anonymization. In: Kłopotek, M.A., Przepiorkowski, A., Wierzchoń, A.T., Trojanowski, K. (eds.) Recent Advances in Intelligent Information Systems, pp. 247–260. Academic Publishing House Exit ( (2009)

    Google Scholar 

  7. Karpowicz, T.: Kultura języka polskiego. Wymowa, ortografia, interpunkcja (2009)

    Google Scholar 

  8. Kripke, S.: Naming and Necessity (1972)

    Google Scholar 

  9. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001) ISBN 1-55860-778-1

    Google Scholar 

  10. LDC: ACE (Automatic Content Extraction) English Annotation Guidelines for Entities (Version 6.6), Technical report, Linguistic Data Consortium (2008)

    Google Scholar 

  11. Marcińczuk, M., Piasecki, M.: Statistical Proper Name Recognition in Polish Economic Texts. Control and Cybernetics (to appear in, 2011)

    Google Scholar 

  12. McCallum, A., Freitag, D., Pereira, F.C.N.: Maximum Entropy Markov Models for Information Extraction and Segmentation. In: Proceedings of the Seventeenth International Conference on Machine Learning, ICML 2000, pp. 591–598. Morgan Kaufmann Publishers Inc., San Francisco (2000) ISBN 1-55860-707-2

    Google Scholar 

  13. McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, CONLL 2003, vol. 4, pp. 188–191. Association for Computational Linguistics, Stroudsburg (2003)

    Chapter  Google Scholar 

  14. Mykowiecka, A., Kupść, A., Marciniak, M., Piskorski, J.: Resources for Information Extraction from Polish texts. In: Proceedings of the 3rd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2007), Poznań, Poland, October 5-7 (2007)

    Google Scholar 

  15. Paz, A.: Introduction to probabilistic automata (Computer science and applied mathematics). Academic Press, Inc., Orlando (1971) ISBN 0125476507

    Google Scholar 

  16. Peng, F., McCallum, A.: Accurate Information Extraction from Research Papers Using Conditional Random Fields. In: HLT-NAACL, pp. 329–336 (2004)

    Google Scholar 

  17. Piskorski, J.: Extraction of Polish named entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004 (ELR 2004), pp. 313–316. Association for Computational Linguistics, Prague (2004)

    Google Scholar 

  18. Piskorski, J.: Named-Entity Recognition for Polish with SProUT. In: Bolc, L., Michalewicz, Z., Nishida, T. (eds.) IMTCI 2004. LNCS (LNAI), vol. 3490, pp. 122–133. Springer, Heidelberg (2005) ISBN 3-540-29035-4

    Chapter  Google Scholar 

  19. Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version, Institute of Computer Science. Polish Academy of Sciences, Warsaw (2004)

    Google Scholar 

  20. Rosenfeld, B., Fresko, M., Feldman, R.: A systematic comparison of feature-rich probabilistic classifiers for NER tasks. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 217–227. Springer, Heidelberg (2005) ISBN 978-3-540-29244-9

    Chapter  Google Scholar 

  21. Savary, A., Waszczuk, J., Przepiórkowski, A.: Towards the Annotation of Named Entities in the National Corpus of Polish. In: LREC 2010 Proceedings (2010)

    Google Scholar 

  22. Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, vol. 1, pp. 134–141. Association for Computational Linguistics, Stroudsburg (2003)

    Google Scholar 

  23. Vishwanathan, S.V.N., Schraudolph, N.N., Schmidt, M.W., Murphy, K.P.: Accelerated training of conditional random fields with stochastic gradient methods. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 969–976. ACM, New York (2006) ISBN 1-59593-383-2

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Pascal Bouvry Mieczysław A. Kłopotek Franck Leprévost Małgorzata Marciniak Agnieszka Mykowiecka Henryk Rybiński

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Marcińczuk, M., Stanek, M., Piasecki, M., Musiał, A. (2012). Rich Set of Features for Proper Name Recognition in Polish Texts. In: Bouvry, P., Kłopotek, M.A., Leprévost, F., Marciniak, M., Mykowiecka, A., Rybiński, H. (eds) Security and Intelligent Information Systems. SIIS 2011. Lecture Notes in Computer Science, vol 7053. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25261-7_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25261-7_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25260-0

  • Online ISBN: 978-3-642-25261-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics