Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

Zhou, Jie; Li, Bi-cheng; Chen, Gang

doi:10.1631/FITEE.1500067

Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

Published: 07 November 2015

Volume 16, pages 940–956, (2015)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Jie Zhou¹,
Bi-cheng Li¹ &
Gang Chen¹

171 Accesses
3 Altmetric
Explore all metrics

Abstract

Named entity recognition (NER) is a core component in many natural language processing applications. Most NER systems rely on supervised machine learning methods, which depend on time-consuming and expensive annotations in different languages and domains. This paper presents a method for automatically building silver-standard NER corpora from Chinese Wikipedia. We refine novel and language-dependent features by exploiting the text and structure of Chinese Wikipedia. To reduce tagging errors caused by entity classification, we design four types of heuristic rules based on the characteristics of Chinese Wikipedia and train a supervised NE classifier, and a combined method is used to improve the precision and coverage. Then, we realize type identification of implicit mention by using boundary information of outgoing links. By selecting the sentences related with the domains of test data, we can train better NER models. In the experiments, large-scale NER corpora containing 2.3 million sentences are built from Chinese Wikipedia. The results show the effectiveness of automatically annotated corpora, and the trained NER models achieve the best performance when combining our silver-standard corpora with gold-standard corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using Wikipedia for Cross-Language Named Entity Recognition

UZNER: A Benchmark for Named Entity Recognition in Uzbek

A Chinese named entity recognition model: integrating label knowledge and lexicon information

Article 16 May 2024

References

Alotaibi, F., Lee, M., 2012. Mapping Arabic Wikipedia into the named entities taxonomy. Proc. 24th Int. Conf. on Computational Linguistics, p.43–52.
Google Scholar
An, J., Lee, S., Lee, G.G., 2003. Automatic acquisition of named entity tagged corpus from World Wide Web. Proc. 41st Annual Meeting on Association for Computational Linguistics, p.165–168. [doi:10.3115/1075178.1075207]
Google Scholar
Auer, S., Bizer, C., Kobilarov, G., et al., 2007. DBpedia: a nucleus for a Web of open data. LNCS, 4825:722–735. [doi:10.1007/978–3-540–76298-0_52]
Google Scholar
Balasuriya, D., Ringland, N., Nothman, J., et al., 2009. Named entity recognition in Wikipedia. Proc. Workshop on the People’s Web Meets NLP, ACL-IJCNLP, p.10–18.
Google Scholar
Bunescu, R., Pasca, M., 2006. Using encyclopedic knowledge for named entity disambiguation. Proc. 11th Conf. of the European Chapter of the Association for Computational Linguistics, p.9–16.
Google Scholar
Carletta, J., 1996. Assessing agreement on classification tasks: the kappa statistic. Comput. Ling., 22(2):249–254.
Google Scholar
Ciaramita, M., Altun, Y., 2005. Named-entity recognition in novel domains with external lexical knowledge. Proc. Human Language Technologies in Advances in Structured Learning for Text and Speech Processing Workshop, p.209–212.
Google Scholar
Dakka, W., Cucerzan, S., 2008. Augmenting Wikipedia with named entity tags. Proc. Int. Joint Conf. on Natural Language Processing, p.545–552.
Google Scholar
Darwish, K., 2013. Named entity recognition using crosslingual resources: Arabic as an example. Proc. 51st Annual Meeting of the Association for Computational Linguistics, p.1558–1567.
Google Scholar
Ehrmann, M., Turchi, M., 2010. Building multilingual named entity annotated corpora exploiting parallel corpora. Proc. Workshop on Annotation and Exploitation of Parallel Corpora, p.24–33.
Google Scholar
Etzioni, O., Cafarella, M., Downey, D., et al., 2005. Unsupervised named-entity extraction from the Web: an experimental study. Artif. Intell., 165(1):91–134. [doi:10.1016/j. artint.2005.03.001]
Article Google Scholar
Fu, R., Qin, B., Liu, T., 2011. Generating Chinese named entity data from a parallel corpus. Proc. 5th Int. Joint Conf. on Natural Language Processing, p.264–272.
Google Scholar
Gabrilovich, E., Markovitch, S., 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. Proc. 20th Int. Joint Conf. on Artificial Intelligence, p.1606–1611.
Google Scholar
Guo, H., Zhu, H., Guo, Z., et al., 2009. Domain adaptation with latent semantic association for named entity recognition. Proc. Human Language Technologies: the Annual Conf. of the North American Chapter of the ACL, p.281–289.
Google Scholar
Higashinaka, R., Sadamitsu, K., Saito, K., et al., 2012. Creating an extended named entity dictionary from Wikipedia. Proc. 24th Int. Conf. on Computational Linguistics, p.1163–1178.
Google Scholar
Ji, H., Grishman, R., Dang, H.T., 2011. Overview of the TAC2011 Knowledge Base Population Track. Proc. Text Analysis Conf.
Google Scholar
Jiang, J., Zhai, C.X., 2006. Exploiting domain structure for named entity recognition. Proc. Main Conf. on Human Language Technology Conf. of the North American Chapter of the Association of Computational Linguistics, p.74–81. [doi:10.3115/1220835.1220845]
Google Scholar
Jiang, J., Zhai, C.X., 2007. A two-stage approach to domain adaptation for statistical classifiers. Proc. 16th ACM Conf. on Information and Knowledge Management, p.401–410. [doi:10.1145/1321440.1321498]
Google Scholar
Liao, W., Veeramachaneni, S., 2009. A simple semisupervised algorithm for named entity recognition. Proc. NAACL HLT Workshop on Semi-Supervised Learning for Natural Language Processing, p.58–65.
Chapter Google Scholar
Liu, H., Chen, Y., 2010. Computing semantic relatedness between named entities using Wikipedia. Proc. Int. Conf. on Artificial Intelligence and Computational Intelligence, p.388–392. [doi:10.1109/AICI.2010.88]
Google Scholar
Liu, X., Zhang, S., Wei, F., et al., 2011. Recognizing named entities in Tweets. Proc. 49th Annual Meeting of the Association for Computational Linguistics, p.359–367.
Google Scholar
Medelyan, O., Milne, D., Legg, C., et al., 2009. Mining meaning from Wikipedia. Int. J. Human-Comput. Stud., 67(9):716–754. [doi:10.1016/jijhcs.2009.05.004]
Article Google Scholar
Mika, P., Ciaramita, M., Zaragoza, H., et al., 2008. Learning to tag and tagging to learn: a case study on Wikipedia. IEEE Intell. Syst., 23(5):26–33. [doi:10.1109/MIS.2008.85]
Article Google Scholar
Nadeau, D., Turney, P.D., Matwin, S., 2006. Unsupervised named entity recognition: generating gazetteers and resolving ambiguity. LNCS, 4013:266–277. [doi:10.1007/ 11766247_23]
MathSciNet Google Scholar
Nastase, V., Strube, M., 2013. Transforming Wikipedia into a large scale multilingual concept network. Artif. Intell., 194:62–85. [doi:10.1016/jartint.2012.06.008]
Article MATH MathSciNet Google Scholar
Nemeskey, D.M., Simon, E., 2012. Automatically generated NE tagged corpora for English and Hungarian. Proc. 4th Named Entity Workshop, p.38–46.
Google Scholar
Ni, Y., Zhang, L., Qiu, Z., et al., 2010. Enhancing the opendomain classification of named entity using linked open data. Proc. 9th Int. Semantic Web Conf., p.566–581.
Google Scholar
Nothman, J., Curran, J.R., Murphy, T., 2008. Transforming Wikipedia into named entity training data. Proc. Australian Language Technology Workshop, p.124–132.
Google Scholar
Nothman, J., Ringland, N., Radford, W., et al., 2013. Learning multilingual named entity recognition from Wikipedia. Artif. Intell., 194:151–175. [doi:10.1016/jartint.2012.03. 006]
Article MATH MathSciNet Google Scholar
Ratinov, L., Roth, D., 2009. Design challenges and misconceptions in named entity recognition. Proc. 13th Conf. on Computational Natural Language Learning, p.147–155. [doi:10.3115/1596374.1596399]
Google Scholar
Richman, A.E., Schone, P., 2008. Mining Wiki resources for multilingual named entity recognition. Proc. 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, p.1–9.
Google Scholar
Toral, A., Ferrández, S., Monachini, M., et al., 2012. Web 2.0, language resources and standards to automatically build a multilingual named entity lexicon. Lang. Res. Eval., 46(3):383–419. [doi:10.1007/s10579–011-9148-x]
Article Google Scholar
Zesch, T., Müller, C., Gurevych, I., 2008. Extracting lexical semantic knowledge from Wikipedia and Wiktionary. Proc. Conf. on Language Resources and Evaluation, p.1646–1651.
Google Scholar
Zhang, W., Sun, L., Zhang, X., 2012. A entity relation extraction method based on Wikipedia and pattern clustering. J. Chin. Inform. Process., 26(2):75–81 (in Chinese).
Google Scholar
Zhou, J., Dai, X., Yin, C., et al., 2006. Automatic recognition of Chinese organization name based on cascaded conditional random fields. Acta Electron. Sin., 34(5):804–809 (in Chinese).
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Signal Analysis and Information Processing, Zhengzhou Information Science and Technology Institute, Zhengzhou, 450002, China
Jie Zhou, Bi-cheng Li & Gang Chen

Authors

Jie Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Bi-cheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Gang Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Zhou.

Additional information

Project supported by the National Natural Science Foundation of China (No. 14BXW028)

ORCID: Jie ZHOU, http://orcid.org/0000-0001-5615-9334

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, J., Li, Bc. & Chen, G. Automatically building large-scale named entity recognition corpora from Chinese Wikipedia. Frontiers Inf Technol Electronic Eng 16, 940–956 (2015). https://doi.org/10.1631/FITEE.1500067

Download citation

Received: 07 March 2015
Accepted: 09 August 2015
Published: 07 November 2015
Issue Date: November 2015
DOI: https://doi.org/10.1631/FITEE.1500067

Key words

CLC number

TP391

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Using Wikipedia for Cross-Language Named Entity Recognition

UZNER: A Benchmark for Named Entity Recognition in Uzbek

A Chinese named entity recognition model: integrating label knowledge and lexicon information

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Subscribe and save

Buy Now