Classifying Chinese Texts in Two Steps

Fan, Xinghua; Sun, Maosong; Choi, Key-sun; Zhang, Qin

doi:10.1007/11562214_27

Xinghua Fan^22,23,24,
Maosong Sun²²,
Key-sun Choi²⁴ &
…
Qin Zhang²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3651))

Included in the following conference series:

International Conference on Natural Language Processing

1546 Accesses
3 Citations

Abstract

This paper proposes a two-step method for Chinese text categorization (TC). In the first step, a Naïve Bayesian classifier is used to fix the fuzzy area between two categories, and, in the second step, the classifier with more subtle and powerful features is used to deal with documents in the fuzzy area, which are thought of being unreliable in the first step. The preliminary experiment validated the soundness of this method. Then, the method is extended from two-class TC to multi-class TC. In this two-step framework, we try to further improve the classifier by taking the dependences among features into consideration in the second step, resulting in a Causality Naïve Bayesian Classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Lewis, D.: Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In: Proceedings of ECML-1998, pp. 4–15 (1998)
Google Scholar
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Google Scholar
Mitchell, T.M.: Machine Learning. McCraw Hill, New York (1996)
MATH Google Scholar
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of SIGIR-1999, pp. 42–49 (1999)
Google Scholar
Fan, X.: Causality Reasoning and Text Categorization, Postdoctoral Research Report of Tsinghua University, P.R. China (April 2004) (in Chinese)
Google Scholar
Dumais, S.T., Platt, J., Hecherman, D., Sahami, M.: Inductive Learning Algorithms and Representation for Text Categorization. In: Proceedings of CIKM-1998, Bethesda, MD, pp. 148–155 (1998)
Google Scholar
Sahami, M., Dumais, S., Hecherman, D., Horvitz, E.A.: Bayesian Approach to Filtering Junk E-Mail. In: Learning for Text Categorization: Papers from the AAAI Workshop, 55-62, Madison Wisconsin. AAAI Technical Report WS-98-05 (1998)
Google Scholar
Fan, X.: Causality Diagram Theory Research and Applying It to Fault Diagnosis of Complexity System, Ph.D. Dissertation of Chongqing University, P.R. China (April 2002) (In Chinese)
Google Scholar
Fan, X., Qin, Z., Maosong, S., Xiyue, H.: Reasoning Algorithm in Multi-Valued Causality Diagram. Chinese Journal of Computers 26(3), 310–322 (2003) (in Chinese)
Google Scholar
Sahami, M.: Learning Limited Dependence Bayesian Classifiers. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, pp. 335–338 (1996)
Google Scholar
Rajashekar, T.B., Croft, W.B.: Combining Automatic and Manual Index Representations in Probabilistic Retrieval. Journal of the American society for information science 6(4), 272–283 (1995)
Article Google Scholar
Yang, Y., Ault, T., Pierce, T.: Combining Multiple Learning Strategies for Effective Cross Validation. In: Proceedings of ICML 2000, pp. 1167–1174 (2000)
Google Scholar
Hull, D.A., Pedersen, J.O., Schutze, H.: Method Combination for Document Filtering. In: Proceedings of SIGIR-1996, pp. 279–287 (1996)
Google Scholar
Larkey, L.S., Croft, W.B.: Combining Classifiers in Text Categorization. In: Proceedings of SIGIR-1996, pp. 289–297 (1996)
Google Scholar
Li, Y.H., Jain, A.K.: Classification of Text Documents. The Computer Journal 41(8), 537–546 (1998)
Article MATH Google Scholar
Lam, W., Lai, K.Y.: A Meta-learning Approach for Text Categorization. In: Proceedings of SIGIR-2001, pp. 303–309 (2001)
Google Scholar
Bennett, P.N., Dumais, S.T., Horvitz, E.: Probabilistic Combination of Text Classifiers Using Reliability Indicators: Models and Results. In: Proceedings of SIGIR-2002, pp. 11–15 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory of Intelligent Technology and Systems, Tsinghua University, Beijing, 100084, China
Xinghua Fan & Maosong Sun
State Intellectual Property Office of P.R. China, Beijing, 100088, China
Xinghua Fan & Qin Zhang
Computer Science Division, Korterm, KAIST, 373-1 Guseong-dong Yuseong-gu, Daejeon, 305-701, Korea
Xinghua Fan & Key-sun Choi

Authors

Xinghua Fan
View author publications
You can also search for this author in PubMed Google Scholar
Maosong Sun
View author publications
You can also search for this author in PubMed Google Scholar
Key-sun Choi
View author publications
You can also search for this author in PubMed Google Scholar
Qin Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Language Technology, Macquarie University, 2019, Sydney, NSW, Australia
Robert Dale
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Kam-Fai Wong
Institute for Infocomm Research, 21, Heng Mui Keng Terrace, 119613, Singapore
Jian Su
Language Information Sciences Research Centre, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
Oi Yee Kwong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, X., Sun, M., Choi, Ks., Zhang, Q. (2005). Classifying Chinese Texts in Two Steps. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_27

Download citation

DOI: https://doi.org/10.1007/11562214_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29172-5
Online ISBN: 978-3-540-31724-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics