Skip to main content

Heterogeneous Named Entity Similarity Function

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7499))

Abstract

Many text processing tasks require to recognize and classify Named Entities. Currently available morphological analysers for Polish cannot handle unknown words (not included in analyser’s lexicon). Polish is a language with rich inflection, so comparing two words (even having the same lemma) is a non-trivial task. The aim of the similarity function is to match unknown word form with its word form in named-entity dictionary. In this article a complex similarity function is presented. It is based on a decision function implemented as a Logistic Regression classifier. The final similarity function is a combination of several simple metrics combined with the help of the classifier. The proposed function is very effective in word forms matching task.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Piasecki, M., Radziszewski, A.: Polish morphological guesser based on a statistical a tergo index. In: Proc. of IMCSIT — 2nd International Symposium Advances in Artificial Intelligence and Applications, AAIA 2007, pp. 247–256 (2007)

    Google Scholar 

  2. Woliński, M.: Morfeusz —a Practical Tool for the Morphological Analysis of Polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Morfeusz a Practical Tool for the Morphological Analysis of Polish. Advances in Soft Computing, vol. 5, pp. 511–520. Springer, Heidelberg (2006)

    Google Scholar 

  3. Piskorski, J.: Named-Entity Recognition for Polish with SProUT. In: Bolc, L., Michalewicz, Z., Nishida, T. (eds.) IMTCI 2004. LNCS (LNAI), vol. 3490, pp. 122–133. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  4. Piskorski, J., Sydow, M., Wieloch, K.: Comparison of String Distance Metrics for Lemmatisation of Named Entities in Polish. In: Vetulani, Z., Uszkoreit, H. (eds.) LTC 2007. LNCS, vol. 5603, pp. 413–427. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  5. Piskorski, J., Sydow, M., Kupść, A.: Lemmatization of polish person names. In: Proc. of the Workshop on Balto-Slavonic Natural Language Processing: Information Extraction and Enabling Technologies, ACL 2007, pp. 27–34. ACL, USA (2007)

    Chapter  Google Scholar 

  6. Christen, P.: A comparison of personal name matching: Techniques and practical issues. In: International Conference on Data Mining Workshops, pp. 290–294 (2006)

    Google Scholar 

  7. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A Comparison of String Distance Metrics for Name-Matching Tasks. In: Proceedings of IJCAI 2003 Workshop on Information Integration, pp. 73–78 (2003)

    Google Scholar 

  8. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string metrics for matching names and records. In: Proceedings of the KDD 2003 Workshop on Data, Washington, DC, pp. 13–18 (2003)

    Google Scholar 

  9. Lubenko, I., Ker, A.D.: Steganalysis using logistic regression. In: Proc. SPIE 7880, 78800K (2011)

    Google Scholar 

  10. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8 (1966)

    Google Scholar 

  11. Savary, A., Piskorski, J.: Lexicons and grammars for named entity annotation in the National corpus of Polish. In: Proceedings of the 18th International Conference Intelligent Information Systems, IIS 2010, Siedlce, Poland (2010)

    Google Scholar 

  12. Džeroski, S., Erjavec, T.: Learning to Lemmatise Slovene Words. In: Cussens, J., Džeroski, S. (eds.) LLL 1999. LNCS (LNAI), vol. 1925, pp. 69–88. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kocoń, J., Piasecki, M. (2012). Heterogeneous Named Entity Similarity Function. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2012. Lecture Notes in Computer Science(), vol 7499. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32790-2_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-32790-2_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32789-6

  • Online ISBN: 978-3-642-32790-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics