Abstract
Named entity recognition (NER) is a core component in many natural language processing applications. Most NER systems rely on supervised machine learning methods, which depend on time-consuming and expensive annotations in different languages and domains. This paper presents a method for automatically building silver-standard NER corpora from Chinese Wikipedia. We refine novel and language-dependent features by exploiting the text and structure of Chinese Wikipedia. To reduce tagging errors caused by entity classification, we design four types of heuristic rules based on the characteristics of Chinese Wikipedia and train a supervised NE classifier, and a combined method is used to improve the precision and coverage. Then, we realize type identification of implicit mention by using boundary information of outgoing links. By selecting the sentences related with the domains of test data, we can train better NER models. In the experiments, large-scale NER corpora containing 2.3 million sentences are built from Chinese Wikipedia. The results show the effectiveness of automatically annotated corpora, and the trained NER models achieve the best performance when combining our silver-standard corpora with gold-standard corpora.
Similar content being viewed by others
References
Alotaibi, F., Lee, M., 2012. Mapping Arabic Wikipedia into the named entities taxonomy. Proc. 24th Int. Conf. on Computational Linguistics, p.43–52.
An, J., Lee, S., Lee, G.G., 2003. Automatic acquisition of named entity tagged corpus from World Wide Web. Proc. 41st Annual Meeting on Association for Computational Linguistics, p.165–168. [doi:10.3115/1075178.1075207]
Auer, S., Bizer, C., Kobilarov, G., et al., 2007. DBpedia: a nucleus for a Web of open data. LNCS, 4825:722–735. [doi:10.1007/978–3-540–76298-0_52]
Balasuriya, D., Ringland, N., Nothman, J., et al., 2009. Named entity recognition in Wikipedia. Proc. Workshop on the People’s Web Meets NLP, ACL-IJCNLP, p.10–18.
Bunescu, R., Pasca, M., 2006. Using encyclopedic knowledge for named entity disambiguation. Proc. 11th Conf. of the European Chapter of the Association for Computational Linguistics, p.9–16.
Carletta, J., 1996. Assessing agreement on classification tasks: the kappa statistic. Comput. Ling., 22(2):249–254.
Ciaramita, M., Altun, Y., 2005. Named-entity recognition in novel domains with external lexical knowledge. Proc. Human Language Technologies in Advances in Structured Learning for Text and Speech Processing Workshop, p.209–212.
Dakka, W., Cucerzan, S., 2008. Augmenting Wikipedia with named entity tags. Proc. Int. Joint Conf. on Natural Language Processing, p.545–552.
Darwish, K., 2013. Named entity recognition using crosslingual resources: Arabic as an example. Proc. 51st Annual Meeting of the Association for Computational Linguistics, p.1558–1567.
Ehrmann, M., Turchi, M., 2010. Building multilingual named entity annotated corpora exploiting parallel corpora. Proc. Workshop on Annotation and Exploitation of Parallel Corpora, p.24–33.
Etzioni, O., Cafarella, M., Downey, D., et al., 2005. Unsupervised named-entity extraction from the Web: an experimental study. Artif. Intell., 165(1):91–134. [doi:10.1016/j. artint.2005.03.001]
Fu, R., Qin, B., Liu, T., 2011. Generating Chinese named entity data from a parallel corpus. Proc. 5th Int. Joint Conf. on Natural Language Processing, p.264–272.
Gabrilovich, E., Markovitch, S., 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. Proc. 20th Int. Joint Conf. on Artificial Intelligence, p.1606–1611.
Guo, H., Zhu, H., Guo, Z., et al., 2009. Domain adaptation with latent semantic association for named entity recognition. Proc. Human Language Technologies: the Annual Conf. of the North American Chapter of the ACL, p.281–289.
Higashinaka, R., Sadamitsu, K., Saito, K., et al., 2012. Creating an extended named entity dictionary from Wikipedia. Proc. 24th Int. Conf. on Computational Linguistics, p.1163–1178.
Ji, H., Grishman, R., Dang, H.T., 2011. Overview of the TAC2011 Knowledge Base Population Track. Proc. Text Analysis Conf.
Jiang, J., Zhai, C.X., 2006. Exploiting domain structure for named entity recognition. Proc. Main Conf. on Human Language Technology Conf. of the North American Chapter of the Association of Computational Linguistics, p.74–81. [doi:10.3115/1220835.1220845]
Jiang, J., Zhai, C.X., 2007. A two-stage approach to domain adaptation for statistical classifiers. Proc. 16th ACM Conf. on Information and Knowledge Management, p.401–410. [doi:10.1145/1321440.1321498]
Liao, W., Veeramachaneni, S., 2009. A simple semisupervised algorithm for named entity recognition. Proc. NAACL HLT Workshop on Semi-Supervised Learning for Natural Language Processing, p.58–65.
Liu, H., Chen, Y., 2010. Computing semantic relatedness between named entities using Wikipedia. Proc. Int. Conf. on Artificial Intelligence and Computational Intelligence, p.388–392. [doi:10.1109/AICI.2010.88]
Liu, X., Zhang, S., Wei, F., et al., 2011. Recognizing named entities in Tweets. Proc. 49th Annual Meeting of the Association for Computational Linguistics, p.359–367.
Medelyan, O., Milne, D., Legg, C., et al., 2009. Mining meaning from Wikipedia. Int. J. Human-Comput. Stud., 67(9):716–754. [doi:10.1016/jijhcs.2009.05.004]
Mika, P., Ciaramita, M., Zaragoza, H., et al., 2008. Learning to tag and tagging to learn: a case study on Wikipedia. IEEE Intell. Syst., 23(5):26–33. [doi:10.1109/MIS.2008.85]
Nadeau, D., Turney, P.D., Matwin, S., 2006. Unsupervised named entity recognition: generating gazetteers and resolving ambiguity. LNCS, 4013:266–277. [doi:10.1007/ 11766247_23]
Nastase, V., Strube, M., 2013. Transforming Wikipedia into a large scale multilingual concept network. Artif. Intell., 194:62–85. [doi:10.1016/jartint.2012.06.008]
Nemeskey, D.M., Simon, E., 2012. Automatically generated NE tagged corpora for English and Hungarian. Proc. 4th Named Entity Workshop, p.38–46.
Ni, Y., Zhang, L., Qiu, Z., et al., 2010. Enhancing the opendomain classification of named entity using linked open data. Proc. 9th Int. Semantic Web Conf., p.566–581.
Nothman, J., Curran, J.R., Murphy, T., 2008. Transforming Wikipedia into named entity training data. Proc. Australian Language Technology Workshop, p.124–132.
Nothman, J., Ringland, N., Radford, W., et al., 2013. Learning multilingual named entity recognition from Wikipedia. Artif. Intell., 194:151–175. [doi:10.1016/jartint.2012.03. 006]
Ratinov, L., Roth, D., 2009. Design challenges and misconceptions in named entity recognition. Proc. 13th Conf. on Computational Natural Language Learning, p.147–155. [doi:10.3115/1596374.1596399]
Richman, A.E., Schone, P., 2008. Mining Wiki resources for multilingual named entity recognition. Proc. 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, p.1–9.
Toral, A., Ferrández, S., Monachini, M., et al., 2012. Web 2.0, language resources and standards to automatically build a multilingual named entity lexicon. Lang. Res. Eval., 46(3):383–419. [doi:10.1007/s10579–011-9148-x]
Zesch, T., Müller, C., Gurevych, I., 2008. Extracting lexical semantic knowledge from Wikipedia and Wiktionary. Proc. Conf. on Language Resources and Evaluation, p.1646–1651.
Zhang, W., Sun, L., Zhang, X., 2012. A entity relation extraction method based on Wikipedia and pattern clustering. J. Chin. Inform. Process., 26(2):75–81 (in Chinese).
Zhou, J., Dai, X., Yin, C., et al., 2006. Automatic recognition of Chinese organization name based on cascaded conditional random fields. Acta Electron. Sin., 34(5):804–809 (in Chinese).
Author information
Authors and Affiliations
Corresponding author
Additional information
Project supported by the National Natural Science Foundation of China (No. 14BXW028)
ORCID: Jie ZHOU, http://orcid.org/0000-0001-5615-9334
Rights and permissions
About this article
Cite this article
Zhou, J., Li, Bc. & Chen, G. Automatically building large-scale named entity recognition corpora from Chinese Wikipedia. Frontiers Inf Technol Electronic Eng 16, 940–956 (2015). https://doi.org/10.1631/FITEE.1500067
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.1500067