Chinese New Word Identification: A Latent Discriminative Model with Global Features

Sun, Xiao; Huang, De-Gen; Song, Hai-Yu; Ren, Fu-Ji

doi:10.1007/s11390-011-9411-z

Chinese New Word Identification: A Latent Discriminative Model with Global Features

Regular Paper
Published: 11 January 2011

Volume 26, pages 14–24, (2011)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Xiao Sun¹,
De-Gen Huang²,
Hai-Yu Song¹ &
…
Fu-Ji Ren³

129 Accesses
10 Citations
Explore all metrics

Abstract

Chinese new words are particularly problematic in Chinese natural language processing. With the fast development of Internet and information explosion, it is impossible to get a complete system lexicon for applications in Chinese natural language processing, as new words out of dictionaries are always being created. The procedure of new words identification and POS tagging are usually separated and the features of lexical information cannot be fully used. A latent discriminative model, which combines the strengths of Latent Dynamic Conditional Random Field (LDCRF) and semi-CRF, is proposed to detect new words together with their POS synchronously regardless of the types of new words from Chinese text without being pre-segmented. Unlike semi-CRF, in proposed latent discriminative model, LDCRF is applied to generate candidate entities, which accelerates the training speed and decreases the computational cost. The complexity of proposed hidden semi-CRF could be further adjusted by tuning the number of hidden variables and the number of candidate entities from the Nbest outputs of LDCRF model. A new-word-generating framework is proposed for model training and testing, under which the definitions and distributions of new words conform to the ones in real text. The global feature called “Global Fragment Features” for new word identification is adopted. We tested our model on the corpus from SIGHAN-6. Experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags with satisfactory results. The proposed model performs competitively with the state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

Information extraction from electronic medical documents: state of the art and future research directions

Article 08 November 2022

References

Goh C, Asahara M, Matsumoto Y. Chinese unknown word identification using character-based tagging and chunking. In Proc. the 41st Annual Meeting on Association for Computational Linguistics, Sapporo, Japan, Jul. 7-12, 2003, pp.197-200.
Nie J, Hannan M, Jin W. Unknown word detection and segmentation of Chinese using statistical and heuristic knowledge. Communications of COLIPS, 1995, 5(1): 47-57.
Google Scholar
Chen C, Bai M, Chen K. Category guessing for Chinese unknown words. In Proc. the Natural Language Processing Pacific Rim Symposium, Phuket, Thailand, Dec. 2-4, 1997, pp.35-40.
Sproat R, Shih C, Gale W, Chang N. A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics, 1996, 22(2): 377-404.
Google Scholar
Zheng J H, Li W H. A study on automatic identification for Internet new words according to word-building rule. Journal of Shanxi University (Natural Science Edition), 2002, 25(2): 115-119. (In Chinese)
Google Scholar
Yan W. New words mining from the dynamic current corpus based on VSM. In Proc. Dictionaries and Digital Symposium, Yantai, China, Aug. 16-20, 2004. (In Chinese)
Chen A. Chinese word segmentation using minimal linguistic knowledge. In Proc. the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, Jul. 11-12, 2003, pp.148-151.
Wu A D, Jiang Z X. Statistically-enhanced new word identification in a rule-based Chinese system. In Proc. the Second Chinese Language Processing Workshop, Hong Kong, China, Oct. 1-8, 2000, pp.46-51.
Zou G., Liu Y., Liu Q. Internet-oriented Chinese New Words Detection (in Chinese). Journal of Chinese Information Processing, 2004, 18: 1-9.
Google Scholar
Peng F, Feng F, McCallum A. Chinese segmentation and new word detection using conditional random fields. In Proc. the 20th International Conference on Computational Linguistics, Geneva, Switzerland, Aug. 23-27, 2004, pp.562-569.
Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. the 18th Int. Conf. Machine Learning, Williamstown, USA, Jun. 28-Jul. 1, 2001, pp.282-289.
Zhao H, Kit C. Scaling conditional random fields by one-against-the-other decomposition. Journal of Computer Science and Technology, July, 2008, 23(4): 612-619.
Article Google Scholar
Li H Q, Huang C N, Gao J F, Fan X Z. The use of SVM for Chinese new word identification. In Proc. IJCNLP 2004, Sanya, China, Mar. 22-24, 2004, pp.723-732.
Asahara M, Matsumoto Y. Japanese unknown word identification by character-based chunking. In Proc. the 20th International Conference on Computational Linguistics, Geneva, Switzerland, Aug. 23-27, 2004, pp.459-465.
Goh C L, Asahara M, Matsumoto Y. Training multi-classifiers for Chinese unknown word detection. Journal of Chinese Language and Computing, 2005, 15(1): 1-12.
Google Scholar
Goh G, Asahara M, Matsumoto Y. Machine learning-based methods to Chinese unknown word detection and POS tag guessing. Journal of Chinese Language and Computing, 2006, 16: 185-206.
Google Scholar
Morency L, Quattoni A, Darrell T. Latent-dynamic discriminative models for continuous gesture recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, USA, Jun. 17-22, 2007, pp.1-8.
Sun X, Wang H, Wang B. Predicting Chinese abbreviations from definitions: An empirical learning approach using support vector regression. Journal of Computer Science and Technology, 2008, 23(4): 602-611.
Article Google Scholar
Sun X, Huang D, Ren F. Detecting new words from Chinese text using latent semi-CRF models. IEICE Transactions on Information and Systems, 2010, E93-D(6): 1386-1393.
Sarawagi S, Cohen W. Semi-Markov conditional random fields for information extraction. In Proc. NIPS 2004, Vancouver, Canada, Dec. 13-18, 2004, pp.1185-1192.
Okanohara D, Miyao Y, Tsuruoka Y, Tsujii J. Improving the scalability of semi-Markov conditional random fields for named entity recognition. In Proc. the 21st Int. Conf. Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, Jul. 17-21, 2006, pp.465-472.
Liu D, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(3): 503-528.
Article MATH MathSciNet Google Scholar
Yu S, Duan H, Zhu X, Swen B, Chang B. Specification for corpus processing at Peking University: Word segmentation, POS tagging and phonetic notation. Journal of Chinese Language and Computing, 2003, 13: 121-158.
Google Scholar
Zhou G. A chunking strategy towards unknown word detection in Chinese word segmentation. In Proc. IJCNLP 2005, Jeju Island, Korea, Oct. 11-13, 2005, pp.530-541.
Sproat R, Emerson T. The first international Chinese word segmentation bakeoff. In Proc. the 2nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, Jul. 11-12, 2003, pp.133-143.
Emerson T. The second international Chinese word segmentation bakeoff. In Proc. the 4th SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, Oct. 14-15, 2005, pp.123-133.
Levow G A. The third international Chinese language processing bakeoff: Word segmentation and named entity recognition. In Proc. the 5th SIGHAN Workshop on Chinese Language Processing, Sydney, Australia, Jul. 22-23, 2006, pp.108-117.
Jin G, Chen X. The fourth international Chinese language processing bakeoff: Chinese word segmentation, named entity recognition and Chinese POS tagging. In Proc. Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, Jan. 11-12, 2008, pp.69-81.

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Dalian Nationalities University, Dalian, 116600, China
Xiao Sun & Hai-Yu Song
School of Computer Science and Engineering, Dalian University of Technology, Dalian, 116024, China
De-Gen Huang (Senior Member, CCF)
Department of Information Science and Intelligent Systems, Tokushima University, Tokushima, 7708506, Japan
Fu-Ji Ren (Member, IEEE)

Authors

Xiao Sun
View author publications
You can also search for this author in PubMed Google Scholar
De-Gen Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hai-Yu Song
View author publications
You can also search for this author in PubMed Google Scholar
Fu-Ji Ren
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao Sun.

Additional information

This work is partially supported by the Doctor Startup Fund of Liaoning Province under Grant No.20101021.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(PDF 56.7 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, X., Huang, DG., Song, HY. et al. Chinese New Word Identification: A Latent Discriminative Model with Global Features. J. Comput. Sci. Technol. 26, 14–24 (2011). https://doi.org/10.1007/s11390-011-9411-z

Download citation

Received: 19 June 2009
Revised: 14 December 2010
Published: 11 January 2011
Issue Date: January 2011
DOI: https://doi.org/10.1007/s11390-011-9411-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Chinese New Word Identification: A Latent Discriminative Model with Global Features

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Information extraction from electronic medical documents: state of the art and future research directions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

(PDF 56.7 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Chinese New Word Identification: A Latent Discriminative Model with Global Features

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Information extraction from electronic medical documents: state of the art and future research directions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

(PDF 56.7 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation