skip to main content
research-article

Mining English-Chinese Named Entity Pairs from Comparable Corpora

Published: 01 December 2011 Publication History

Abstract

Bilingual Named Entity (NE) pairs are valuable resources for many NLP applications. Since comparable corpora are more accessible, abundant and up-to-date, recent researches have concentrated on mining bilingual lexicons using comparable corpora. Leveraging comparable corpora, this research presents a novel approach to mining English-Chinese NE translations by combining multi-dimension features from various information sources for every possible NE pair, which include the transliteration model, English-Chinese matching, Chinese-English matching, translation model, length, and context vector. These features are integrated into one model with linear combination and minimum sample risk (MSR) algorithm. As for the high type-dependence of NE translation, we integrate different features according to different NE types. We experiment with the above individual feature or integrated features to mine person NE (PN) pairs, location NE (LN) pairs and organization NE (ON) pairs. When using transliteration and length to mine PN pairs, we achieve the best performance of 84.9% (F-score). The LN pairs can be mined with the features of transliteration model, length, translation model, English-Chinese matching and Chinese-English matching. And the best performance is 83.4% (F-score). The ON pairs can be mined with the features of English-Chinese matching and Chinese-English matching. It reaches the best performance with 84.1% (F-score).

References

[1]
Braschler, M. and Schauble, P. 1998. Multilingual information retrieval based on document alignment techniques. In Proceedings of the 2nd European Conference on Research and Advanced Technology for Digital Libraries (ECDL’98). 183--197.
[2]
Chen, Y. F. and Zong, C. Q. 2008. A structure-based model for Chinese organization name translation. ACM Trans. Asian Lang. Inform. Process. 7, 1, Article No. 1.
[3]
Finkel, J. R., Grenager, T., and Manning, C. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL’05). 363--370.
[4]
Pascale, F. 1995a. Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In Proceedings of the 3rd Annual Workshop on Very Large Corpora (VLC’95). 173--183.
[5]
Gao, J. F., Yu, H., Yuan, W., and Xu, P. 2005. Minimum sample risk methods for language modeling. In Proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP’05). 209--216.
[6]
Gao, G. H., Gao J. F., and Nie, J. 2007. A system to mine large-scale bilingual dictionaries from monolingual Web pages. In Proceedings of Machine Translation Summit XI (MT’07), 57--64.
[7]
Huang, D. G., Zhao, L., Li, L. S., and Yu, H. T. 2010. Mining large-scale comparable corpora from Chinese-English news collections. In Proceedings of the 23rd International Conference on Computational Linguistics (CL’10). 472--480.
[8]
Jiang, L., Zhou, M., and Jian, L. F. 2007. Named entity translation with Web mining and transliteration. J. Chinese Inf. Proc. 21, 1, 23--29.
[9]
Klementiev, A. and Roth, D. 2006. Named entity transliteration and discovery from multilingual comparable corpora. In Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL (HLT’06). 82--88.
[10]
Lam, W., Chen, S. K., and Huang, R. Z. 2007. Named entity translation matching and learning: With application for mining unseen translations. ACM Trans. Inf. Sys. 25, 1, 1--32.
[11]
Shao, L. and Ng, H. T. 2004. Mining new word translations from comparable corpora. In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04).
[12]
Lu, M. and Zhao, J. 2006. Multi-feature based Chinese-English NE extraction from comparable corpora. In Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation (PACL’06). 134--141.
[13]
Luo, Y. Y. and Huang, D. G. 2009. Chinese word segmentation based on the marginal probabilities generated by CRFs. J. Chinese Inf. Proc. 23, 5, 3--8.
[14]
Sproat, R., Tao, T., and Zhai, C. X. 2006. Named entity transliteration with comparable corpora. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistic (CL’06). 73--80.
[15]
Tao, T., Yoon S. Y., Fister, A., Sproat, R., and Zhai, C. X. 2006. Unsupervised named entity transliteration using temporal and phonetic correlation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP’06). 250--257.
[16]
Yuan, W., Gao, J. F., and Bu, F. L. 2007. A study and improvement of minimum sample risk methods for language modeling. J. Softw. 18, 2, 196--204.

Cited By

View all
  • (2022)Mining an English-Chinese parallel Dataset of Financial NewsJournal of Open Humanities Data10.5334/johd.628Online publication date: 18-Mar-2022
  • (2019)A Summary of Studies on Bilingual Comparable Corpus2019 International Conference on Smart Grid and Electrical Automation (ICSGEA)10.1109/ICSGEA.2019.00138(595-599)Online publication date: Aug-2019
  • (2018)A relation extraction method of Chinese named entities based on location and semantic featuresApplied Intelligence10.1007/s10489-012-0353-038:1(1-15)Online publication date: 28-Dec-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian Language Information Processing
ACM Transactions on Asian Language Information Processing  Volume 10, Issue 4
December 2011
112 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/2025384
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2011
Accepted: 01 April 2011
Revised: 01 February 2011
Received: 01 November 2010
Published in TALIP Volume 10, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Chinese-English matching
  2. English-Chinese matching
  3. MSR
  4. Transliteration model
  5. comparable corpora
  6. mining
  7. named entity
  8. pairs
  9. translation model

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Mining an English-Chinese parallel Dataset of Financial NewsJournal of Open Humanities Data10.5334/johd.628Online publication date: 18-Mar-2022
  • (2019)A Summary of Studies on Bilingual Comparable Corpus2019 International Conference on Smart Grid and Electrical Automation (ICSGEA)10.1109/ICSGEA.2019.00138(595-599)Online publication date: Aug-2019
  • (2018)A relation extraction method of Chinese named entities based on location and semantic featuresApplied Intelligence10.1007/s10489-012-0353-038:1(1-15)Online publication date: 28-Dec-2018
  • (2017)Corpus-Based Translation Induction in Indian Languages Using Auxiliary Language Corpora from WikipediaACM Transactions on Asian and Low-Resource Language Information Processing10.1145/303829516:3(1-25)Online publication date: 17-Mar-2017
  • (2015)Translation Induction on Indian Language Corpora Using Translingual Themes from Other LanguagesComputational Linguistics and Intelligent Text Processing10.1007/978-3-319-18111-0_38(505-519)Online publication date: 2015
  • (2014)Translation approaches in Cross Language Information RetrievalInternational Conference on Computing and Communication Technologies10.1109/ICCCT2.2014.7066750(1-4)Online publication date: Dec-2014

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media