Skip to main content

A Hybrid Approach of Text Segmentation Based on Sensitive Word Concept for NLP

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2004))

  • 778 Accesses

Abstract

Natural language processing, such as Checking and Correc- tion of Texts, Machine Translation, and Information Retrieval, usually starts from words. The identification of words in Indo-European lan- guages is a trivial task. However, this problem named text segmentation has been, and is still a bottleneck for various Asian languages, such as Chinese. There have been two main groups of approaches to Chinese segmentation: dictionary-based approaches and statistical approaches. However, both approaches have diffiiculty to deal with some Chinese text. To address the difficulties, we propose a hybrid approach using Sensitive Word Concept to Chinese text segmentation. Sensitive words are the compound words whose syntactic category is different from those of their components. According to the segmentation, a sensitive word may play different roles, leading to significantly different syntactic struc- tures. In this paper, we explain the concept of sensitive words and their efficacy in text segmentation firstly, then describe the hybrid approach that combines the rule-based method and the probability-based method using the concept of sensitive words. Our experimental results showed that the presented approach is able to address the text segmentation problems effectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. B.-I. Li and e. al., Am aximal matching automatic Chinese word segmentation algorithm using corpus tagging for ambiguity resolution. R.O.C. Computational Linguistics Conference, Taiwan, (1991)135–146.

    Google Scholar 

  2. C.-L. Yeh and e. al, Rule-based word identification for Mandarin Chinese sentences-Aunification approach. Computer processing of Chinese and Oriental Languages,vol. 5, (1991).

    Google Scholar 

  3. F-J. Ren, J-Y. Nie, The Concept of SensitiveWord in Chinese-Survey in a Machine-Readable Dictionary,Natural Language Processing,Vol.6,No.1,(1999),59–78.

    Article  Google Scholar 

  4. F-J. Ren, A Hybrid Approach to Automatic Checking and Correction of Chinese Texts, Proceedings of Seventeenth IASTED International Conference on Applied Informatics, (1999),17–22.

    Google Scholar 

  5. J.-S. Chang and e. al., Chinese word segmentation through constraint satisfaction and statistical optimization. ROCLING-IV, Taiwan, (1991),147–165.

    Google Scholar 

  6. KaiYing L.: Estimation Report of Chinese Word Segmentation. Chinese Computerworld. Vol.584, No.12,(1996) 187–189.

    Google Scholar 

  7. K.-J. Chen and S.-H. Kiu, Word identification for Mandarin Chinese sentences. 5th International Conference on Computational Linguistics, (1992)101–107.

    Google Scholar 

  8. L-X. Fan, F-J. Ren, Y. Miyanaga, K. Tochinai: Automatic Composition of Chinese Compound Words for Chinese-Japanese Machine Translation,Transactions of Information Processing Society Of Japan, Vol.33,No.9,(1992),1103–1113.

    Google Scholar 

  9. N. Y. Liang,The Automatic Segmentation in Written Chinese and an Automatic Segmentation System-CDWS. The Academic Journal of Beijing Institute of Aeronautics and Astronautics, (1984)vol. 4.

    Google Scholar 

  10. R. Sproat and C. Shih,As tatistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages, vol. 4, (1991)336–351.

    Google Scholar 

  11. R. Sproat, C. Shih, W. Gale, and N. Chang,As tochastic finite-state wordsegmentation algorithm for Chinese. (1994)ACL’94.

    Google Scholar 

  12. Sheng, Dayang,The automatic recognition of names of Chinese places. The Development and Application of Computational Linguistics. Tsinghua University Press. (1995)68–74.

    Google Scholar 

  13. T.-H. Chiang and et. al., Statistical models for segmentation and unknown word resolution. 5th R.O.C. Computational Linguistics Conference, (1992) 123–146.

    Google Scholar 

  14. T. Dunning, Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, vol. 19, (1993)61–74.

    Google Scholar 

  15. W. Jin and J.-Y. Nie, Segmentation du Chinois-une Etape Cruciale vers la Traduction Automatique du Chinois. in La Traductique, P. Bouillon and A. Clas, Eds.(1993)349–363.

    Google Scholar 

  16. Y. Liu, Q. Tan and et. al., Modern Chinese common word list for information processing, Tsinghua University Press,(1994).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ren, F. (2001). A Hybrid Approach of Text Segmentation Based on Sensitive Word Concept for NLP. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2001. Lecture Notes in Computer Science, vol 2004. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44686-9_37

Download citation

  • DOI: https://doi.org/10.1007/3-540-44686-9_37

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-41687-6

  • Online ISBN: 978-3-540-44686-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics