A Hybrid Approach of Text Segmentation Based on Sensitive Word Concept for NLP

Ren, Fuji

doi:10.1007/3-540-44686-9_37

Fuji Ren²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2004))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

778 Accesses

Abstract

Natural language processing, such as Checking and Correc- tion of Texts, Machine Translation, and Information Retrieval, usually starts from words. The identification of words in Indo-European lan- guages is a trivial task. However, this problem named text segmentation has been, and is still a bottleneck for various Asian languages, such as Chinese. There have been two main groups of approaches to Chinese segmentation: dictionary-based approaches and statistical approaches. However, both approaches have diffiiculty to deal with some Chinese text. To address the difficulties, we propose a hybrid approach using Sensitive Word Concept to Chinese text segmentation. Sensitive words are the compound words whose syntactic category is different from those of their components. According to the segmentation, a sensitive word may play different roles, leading to significantly different syntactic struc- tures. In this paper, we explain the concept of sensitive words and their efficacy in text segmentation firstly, then describe the hybrid approach that combines the rule-based method and the probability-based method using the concept of sensitive words. Our experimental results showed that the presented approach is able to address the text segmentation problems effectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

B.-I. Li and e. al., Am aximal matching automatic Chinese word segmentation algorithm using corpus tagging for ambiguity resolution. R.O.C. Computational Linguistics Conference, Taiwan, (1991)135–146.
Google Scholar
C.-L. Yeh and e. al, Rule-based word identification for Mandarin Chinese sentences-Aunification approach. Computer processing of Chinese and Oriental Languages,vol. 5, (1991).
Google Scholar
F-J. Ren, J-Y. Nie, The Concept of SensitiveWord in Chinese-Survey in a Machine-Readable Dictionary,Natural Language Processing,Vol.6,No.1,(1999),59–78.
Article Google Scholar
F-J. Ren, A Hybrid Approach to Automatic Checking and Correction of Chinese Texts, Proceedings of Seventeenth IASTED International Conference on Applied Informatics, (1999),17–22.
Google Scholar
J.-S. Chang and e. al., Chinese word segmentation through constraint satisfaction and statistical optimization. ROCLING-IV, Taiwan, (1991),147–165.
Google Scholar
KaiYing L.: Estimation Report of Chinese Word Segmentation. Chinese Computerworld. Vol.584, No.12,(1996) 187–189.
Google Scholar
K.-J. Chen and S.-H. Kiu, Word identification for Mandarin Chinese sentences. 5th International Conference on Computational Linguistics, (1992)101–107.
Google Scholar
L-X. Fan, F-J. Ren, Y. Miyanaga, K. Tochinai: Automatic Composition of Chinese Compound Words for Chinese-Japanese Machine Translation,Transactions of Information Processing Society Of Japan, Vol.33,No.9,(1992),1103–1113.
Google Scholar
N. Y. Liang,The Automatic Segmentation in Written Chinese and an Automatic Segmentation System-CDWS. The Academic Journal of Beijing Institute of Aeronautics and Astronautics, (1984)vol. 4.
Google Scholar
R. Sproat and C. Shih,As tatistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, vol. 4, (1991)336–351.
Google Scholar
R. Sproat, C. Shih, W. Gale, and N. Chang,As tochastic finite-state wordsegmentation algorithm for Chinese. (1994)ACL’94.
Google Scholar
Sheng, Dayang,The automatic recognition of names of Chinese places. The Development and Application of Computational Linguistics. Tsinghua University Press. (1995)68–74.
Google Scholar
T.-H. Chiang and et. al., Statistical models for segmentation and unknown word resolution. 5th R.O.C. Computational Linguistics Conference, (1992) 123–146.
Google Scholar
T. Dunning, Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, vol. 19, (1993)61–74.
Google Scholar
W. Jin and J.-Y. Nie, Segmentation du Chinois-une Etape Cruciale vers la Traduction Automatique du Chinois. in La Traductique, P. Bouillon and A. Clas, Eds.(1993)349–363.
Google Scholar
Y. Liu, Q. Tan and et. al., Modern Chinese common word list for information processing, Tsinghua University Press,(1994).
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Sciences, Hiroshima City University, 3-4-1, Ozuka-Higasi ,Asa-Minami-Ku, 731-31, Hiroshima, Japan
Fuji Ren

Authors

Fuji Ren
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CIC (Centro de Investigación en Computatción IPN (Instituto Politécnico Nacional), Av. Juan Dios Bátiz s/n esq. M. Othon Mendizabal Col. Nuevo Vallejo, CP. 07738, México, Mexico
Alexander Gelbukh (Unidad Profecional “Adolfo López Mateos”) (Unidad Profecional “Adolfo López Mateos”)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ren, F. (2001). A Hybrid Approach of Text Segmentation Based on Sensitive Word Concept for NLP. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2001. Lecture Notes in Computer Science, vol 2004. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44686-9_37

Download citation

DOI: https://doi.org/10.1007/3-540-44686-9_37
Published: 16 March 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41687-6
Online ISBN: 978-3-540-44686-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics