Abstract
Natural language processing, such as Checking and Correc- tion of Texts, Machine Translation, and Information Retrieval, usually starts from words. The identification of words in Indo-European lan- guages is a trivial task. However, this problem named text segmentation has been, and is still a bottleneck for various Asian languages, such as Chinese. There have been two main groups of approaches to Chinese segmentation: dictionary-based approaches and statistical approaches. However, both approaches have diffiiculty to deal with some Chinese text. To address the difficulties, we propose a hybrid approach using Sensitive Word Concept to Chinese text segmentation. Sensitive words are the compound words whose syntactic category is different from those of their components. According to the segmentation, a sensitive word may play different roles, leading to significantly different syntactic struc- tures. In this paper, we explain the concept of sensitive words and their efficacy in text segmentation firstly, then describe the hybrid approach that combines the rule-based method and the probability-based method using the concept of sensitive words. Our experimental results showed that the presented approach is able to address the text segmentation problems effectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
B.-I. Li and e. al., Am aximal matching automatic Chinese word segmentation algorithm using corpus tagging for ambiguity resolution. R.O.C. Computational Linguistics Conference, Taiwan, (1991)135–146.
C.-L. Yeh and e. al, Rule-based word identification for Mandarin Chinese sentences-Aunification approach. Computer processing of Chinese and Oriental Languages,vol. 5, (1991).
F-J. Ren, J-Y. Nie, The Concept of SensitiveWord in Chinese-Survey in a Machine-Readable Dictionary,Natural Language Processing,Vol.6,No.1,(1999),59–78.
F-J. Ren, A Hybrid Approach to Automatic Checking and Correction of Chinese Texts, Proceedings of Seventeenth IASTED International Conference on Applied Informatics, (1999),17–22.
J.-S. Chang and e. al., Chinese word segmentation through constraint satisfaction and statistical optimization. ROCLING-IV, Taiwan, (1991),147–165.
KaiYing L.: Estimation Report of Chinese Word Segmentation. Chinese Computerworld. Vol.584, No.12,(1996) 187–189.
K.-J. Chen and S.-H. Kiu, Word identification for Mandarin Chinese sentences. 5th International Conference on Computational Linguistics, (1992)101–107.
L-X. Fan, F-J. Ren, Y. Miyanaga, K. Tochinai: Automatic Composition of Chinese Compound Words for Chinese-Japanese Machine Translation,Transactions of Information Processing Society Of Japan, Vol.33,No.9,(1992),1103–1113.
N. Y. Liang,The Automatic Segmentation in Written Chinese and an Automatic Segmentation System-CDWS. The Academic Journal of Beijing Institute of Aeronautics and Astronautics, (1984)vol. 4.
R. Sproat and C. Shih,As tatistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, vol. 4, (1991)336–351.
R. Sproat, C. Shih, W. Gale, and N. Chang,As tochastic finite-state wordsegmentation algorithm for Chinese. (1994)ACL’94.
Sheng, Dayang,The automatic recognition of names of Chinese places. The Development and Application of Computational Linguistics. Tsinghua University Press. (1995)68–74.
T.-H. Chiang and et. al., Statistical models for segmentation and unknown word resolution. 5th R.O.C. Computational Linguistics Conference, (1992) 123–146.
T. Dunning, Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, vol. 19, (1993)61–74.
W. Jin and J.-Y. Nie, Segmentation du Chinois-une Etape Cruciale vers la Traduction Automatique du Chinois. in La Traductique, P. Bouillon and A. Clas, Eds.(1993)349–363.
Y. Liu, Q. Tan and et. al., Modern Chinese common word list for information processing, Tsinghua University Press,(1994).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ren, F. (2001). A Hybrid Approach of Text Segmentation Based on Sensitive Word Concept for NLP. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2001. Lecture Notes in Computer Science, vol 2004. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44686-9_37
Download citation
DOI: https://doi.org/10.1007/3-540-44686-9_37
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41687-6
Online ISBN: 978-3-540-44686-6
eBook Packages: Springer Book Archive