Skip to main content
Log in

Generating Chinese named entity data from parallel corpora

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Annotating named entity recognition (NER) training corpora is a costly but necessary process for supervised NER approaches. This paper presents a general framework to generate large-scale NER training data from parallel corpora. In our method, we first employ a high performance NER system on one side of a bilingual corpus. Then, we project the named entity (NE) labels to the other side according to the word level alignments. Finally, we propose several strategies to select high-quality auto-labeled NER training data. We apply our approach to Chinese NER using an English-Chinese parallel corpus. Experimental results show that our approach can collect high-quality labeled data and can help improve Chinese NER.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Zhou G, Su J. Named entity recognition using an hmm-based chunk tagger. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002, 473–480

    Google Scholar 

  2. Chieu H L, Ng H T. Named entity recognition: a maximum entropy approach using global information. In: Proceedings of the 19th International Conference on Computational Linguistics. 2002, 1: 1–7

    Article  Google Scholar 

  3. Takeuchi K, Collier N. Use of support vector machines in extended named entity recognition. In: Proceedings of the 6th Conference on Natural Language Learning. 2002, 20: 1–7

    Google Scholar 

  4. Settles B. Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. 2004, 104–107

    Google Scholar 

  5. Florian R, Ittycheriah A, Jing H, Zhang T. Named entity recognition through classifier combination. In: Proceedings of the 7th Conference on Natural Language Learning. 2003, 4: 168–171

    Google Scholar 

  6. Klein D, Smarr J, Nguyen H, Manning C D. Named entity recognition with character-level models. In: Proceedings of the 7th Conference on Natural Language Learning. 2003, 4: 180–183

    Google Scholar 

  7. Finkel J, Dingare S, Manning C, Nissim M, Alex B, Grover C. Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinformatics, 2005, 6(Suppl 1): S5

    Article  Google Scholar 

  8. Ciaramita M, Altun Y. Named-entity recognition in novel domains with external lexical knowledge. In: Proceedings of Human Language Technologies: the 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers. 2005, 209–212

    Google Scholar 

  9. Resnik P, Smith N A. The web as a parallel corpus. Computational Linguistics, 2003, 29(3): 349–380

    Article  Google Scholar 

  10. Zhang Y, Wu K, Gao J, Vines P. Automatic acquisition of chinese-english parallel corpus from the web. In: Proceedings of the 28th European Conference on Advances in Information Retrieval. 2006, 420-431

  11. Fu R, Qin B, Liu T. Generating Chinese named entity data from a parallel corpus. In: Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011, 264–272

    Google Scholar 

  12. Yarowsky D, Ngai G, Wicentowski R. Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the 1st International Conference on Human Language Technology Research. 2001, 1–8

    Google Scholar 

  13. Huang F, Vogel S. Improved named entity translation and bilingual named entity extraction. In: Proceedings of the 4th IEEE International Conference on Multimodal Interfaces. 2002, 253–258

    Chapter  Google Scholar 

  14. Burkett D, Petrov S, Blitzer J, Klein D. Learning better monolingual models with unannotated bilingual text. In: Proceedings of the 14th Conference on Computational Natural Language Learning. 2010, 46–54

    Google Scholar 

  15. Hassan A, Fahmy H, Hassan H. Improving named entity translation by exploiting comparable and parallel corpora. In: Proceedings of the 2007 Workshop on Acquisition and Management of Multilingual Lexicons. 2007, 1–6

    Google Scholar 

  16. An J, Lee S, Lee G G. Automatic acquisition of named entity tagged corpus from world wide web. In: Proceedings of the 41st AnnualMeeting on Association for Computational Linguistics. 2003, 2: 165–168

    Google Scholar 

  17. Whitelaw C, Kehlenbeck A, Petrovic N, Ungar L. Web-scale named entity recognition. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008, 123–132

    Google Scholar 

  18. Richman A E, Schone P. MiningWiki resources for multilingual named entity recognition. In: Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics. 2008, 1–9

    Google Scholar 

  19. Nothman J, Curran J R, Murphy T. Transforming wikipedia into named entity training data. In: Proceedings of the 2008 Australian Language Technology Workshop. 2008, 124–132

    Google Scholar 

  20. Vlachos A, Gasperin C. Bootstrapping and evaluating named entity recognition in the biomedical domain. In: Proceedings of the 2006 HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. 2006, 138–145

    Google Scholar 

  21. Ma X. Champollion: A robust parallel text sentence aligner. In: Proceedings of the 5th International Conference on Language Resources and Evaluation. 2006, 489–492

    Google Scholar 

  22. Och F J, Ney H. Improved statistical alignment models. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. 2000, 440–447

    Google Scholar 

  23. Koehn P, Och F J, Marcu D. Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. 2003, 1: 48–54

    Article  Google Scholar 

  24. Lafferty J. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning. 2001, 282–289

    Google Scholar 

  25. Grishman R, Sundheim B. Message understanding conference-6: a brief history. In: Proceedings of the 1996 International Conference on Computational Linguistics. 1996, 466–471

    Google Scholar 

  26. Nadeau D, Sekine S. A survey of named entity recognition and classification. Lingvisticae Investigationes, 2007, 30(1): 3–26

    Article  Google Scholar 

  27. De Sitter A, Calders T, Daelemans W. A formal framework for evaluation of information extraction. Technical Report TR 2004-0, Department of Mathematics and Computer Science, University of Antwerp, 2004

    Google Scholar 

  28. Che W, Li Z, Liu T. LTP: a Chinese language technology platform. In: Proceedings of the 23rd International Conference on Computational Linguistics: Demonstrations. 2010, 13–16

    Google Scholar 

  29. Wu Y, Zhao J, Xu B, Yu H. Chinese named entity recognition based on multiple features. In: Proceedings of the 2005 Conference on Human Language Technology and Empirical Methods in Natural Language Processing. 2005, 427–434

    Chapter  Google Scholar 

  30. Zhang Y, Vogel S, Waibel A. Interpreting bleu/nist scores: how much improvement do we need to have a better system. In: Proceedings of the 2004 International Conference on Language Resources and Evaluation. 2004, 2051–2054

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ting Liu.

Additional information

Ruiji Fu is a PhD candidate at Harbin Institute of Technology (HIT), China. He received his MS and BS both in Computer Science from HIT, in 2009 and 2007 respectively. His research interests include natural language processing, text mining, and open information extraction.

Bing Qin received her PhD in computer science from Harbin Institute of Technology (HIT), China in 2005. She is a full professor in the School of Computer Science and Technology, HIT. Her research interests include natural language processing, text mining, and opinion mining.

Ting Liu received his PhD in computer science from Harbin Institute of Technology (HIT), China in 1998. He is a full professor in the School of Computer Science and Technology, HIT. His current research interests include natural language processing, information retrieval, and social computing.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fu, R., Qin, B. & Liu, T. Generating Chinese named entity data from parallel corpora. Front. Comput. Sci. 8, 629–641 (2014). https://doi.org/10.1007/s11704-014-3127-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-014-3127-5

Keywords

Navigation