Abstract
Current research efforts in Named Entity Recognition deal mostly with the English language. Even though the interest in multi-language Information Extraction is growing, there are only few works reporting results for the Russian language. This paper introduces quality baselines for the Russian NER task. We propose a corpus which was manually annotated with organization and person names. The main purpose of this corpus is to provide gold standard for evaluation. We implemented and evaluated two approaches to NER: knowledge-based and statistical. The first one comprises several components: dictionary matching, pattern matching and rule-based search of lexical representations of entity names within a document. We assembled a set of linguistic resources and evaluated their impact on performance. For the data-driven approach we utilized our implementation of a linear-chain CRF which uses a rich set of features. The performance of both systems is promising (62.17% and 75.05% F 1 measure), although they do not employ morphological or syntactical analysis.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: Proceedings of the 16th Conference on Computational Linguistics, vol. 1, pp. 466–471. ACL, Stroudsburg (1996)
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147. ACL, Morristown (2003)
Cunningham, H., Wilks, Y., Gaizauskas, R.J.: GATE: a general architecture for text engineering. In: Proceedings of the 16th Conference on Computational Linguistics, vol. 2, pp. 1057–1060. ACL, Stroudsburg (1996)
Popov, B., Kirilov, A., Maynard, D., Manov, D.: Creation of reusable components and language resources for named entity recognition in Russian. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation. European Language Resources Association (2004)
Babych, B., Hartley, A.: Improving machine translation quality with automatic named entity recognition. In: Proceedings of the 7th International EAMT Workshop, pp. 1–8. ACL, Stroudsburg (2003)
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Kim, J.D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pp. 70–75. ACL, Stroudsburg (2004)
Ritter, A., Clark, S., Mausam, E.O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534. ACL, Stroudsburg (2011)
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)
Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In: Proceedings of the 13th Conference on Computational Natural Language Learning, pp. 147–155. ACL, Stroudsburg (2009)
Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1030–1038. ACL, Suntec (2009)
Ciaramita, M., Altun, Y.: Named-entity recognition in novel domains with external lexical knowledge. In: Proceedings of the NIPS Workshop on Advances in Structured Learning for Text and Speech Processing (2005)
Tkachenko, M., Simanovsky, A.: Named entity recognition: Exploring features. In: Jancsary, J. (ed.) Proceedings of KONVENS 2012, ÖGAI, pp. 118–127 (2012)
Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: OntoNotes: the 90% solution. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 57–60. Association for Computational Linguistics, Stroudsburg (2006)
Finkel, J.R., Manning, C.D.: Joint parsing and named entity recognition. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 326–334. ACL, Stroudsburg (2009)
Du, M., von Etter, P., Kopotev, M., Novikov, M., Tarbeeva, N., Yangarber, R.: Building Support Tools for Russian-Language Information Extraction. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 380–387. Springer, Heidelberg (2011)
Ehrmann, M., Turchi, M., Steinberger, R.: Building a multilingual named entity-annotated corpus using annotation projection. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2011 Organising Committee, Hissar, Bulgaria, pp. 118–124 (2011)
Szabó, M.K., Vincze, V., Nagy T., I.: HunOr: A Hungarian-Russian parallel corpus. In: Calzolari, N., Choukri, K., Declerck, T., Dogan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association, Istanbul (2012)
Chinchor, N.A.: MUC-7 named entity task definition. In: Proceedings of the 7th Message Understanding Conference (MUC-7), Fairfax, VA, USA (1998)
Tanenblatt, M., Coden, A., Sominsky, I.: The ConceptMapper approach to named entity recognition. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), European Language Resources Association, Valletta (2010)
Kluegl, P., Atzmueller, M., Puppe, F.: TextMarker: A tool for rule-based information extraction. In: Proceedings of the 2nd UIMA@GSCL Workshop, 2009 Conference of the GSCL (Gesellschaft fur Sprachtechnologie und Computerlinguistik) (2009)
Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia - a crystallization point for the Web of Data. Web Semant. 7(3), 154–165 (2009)
Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia Spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems, pp. 1–8. ACM, New York (2011)
Chrupala, G.: Efficient induction of probabilistic word classes with LDA. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 363–372. Asian Federation of Natural Language Processing (2011)
Brown, P.F., Della Pietra, V.J., Desouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Comp. Linguistics 18(4), 467–479 (1992)
Clark, A.: Combining distributional and morphological information for part of speech induction. In: Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics, vol. 1, pp. 59–66. ACL, Stroudsburg (2003)
Krishnan, V., Manning, C.D.: An effective two-stage model for exploiting non-local dependencies in named entity recognition. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 1121–1128. ACL, Stroudsburg (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gareev, R., Tkachenko, M., Solovyev, V., Simanovsky, A., Ivanov, V. (2013). Introducing Baselines for Russian Named Entity Recognition. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-37247-6_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37246-9
Online ISBN: 978-3-642-37247-6
eBook Packages: Computer ScienceComputer Science (R0)