Skip to main content
Log in

Classification of Chinese Texts Based on Recognition of Semantic Topics

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

For machine learning methods, processing and understanding Chinese texts are difficult, for that the basic unit of Chinese texts is not character but phrases, and there is no natural delimiter in Chinese texts to separate the phrases. The processing of a large number of Chinese Web texts is more difficult, because such texts are often less topic focused, short, irregular, sparse, and lacking in context. It poses a challenge for mining, clustering, and classification of Chinese Web texts. Typically, the recognition accuracy of the real meaning of such texts is low. In this paper, we propose a method that recognizes stable and abstract semantic topics that express the highly hierarchical relationship behind the Chinese texts from BaiduBaike. Then, based on these semantic topics, a discrete distribution model is established to convert analysis to a convex optimization problem by geometric programming. Our experiments demonstrated that the proposed approach outperforms many conventional machine learning methods, such as KNN, SVM, WIKI, CRFs, and LDA, regarding the recognition of mini training data and short Chinese Web texts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Hu W, Wu O, Chen Z, Fu Z. Maybank, Steve Nat. Recognition of Pornographic Web Pages by Classifying Texts and Images. IEEE Trans Pattern Anal Mach Intell. 2007;29(6):1019–34.

    Article  PubMed  Google Scholar 

  2. Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv. 2002;34(1):1–47.

    Article  Google Scholar 

  3. Jin-Shu S, Bo-Feng Z, Xin X. Advances in machine learning based text categorization. J Softw. 2006;17(9):1848–59.

    Article  Google Scholar 

  4. HP Zhang, HK Yu, DY Xiong, Q Liu. HHMM-based Chinese lexical analyzer ICTCLAS. Second SIGHAN workshop affiliated with 41th ACL; Sapporo Japan, July; 2003. pp 184–7.

  5. Chen YW, Wang HZ, Li HB, Zhong BN, Gou J, Chen DS. A topic extraction method for Chinese web text based on BaiduBaike and text classification. J Chin Comput Syst. 2012;33(12):2605–10.

    Google Scholar 

  6. T Hofmann, Probabilistic latent semantic indexing. Proceedings of the twenty-second annual. International SIGIR conference on research and development in information retrieval (SIGIR-99); 1999.

  7. Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn Res. 2003;3:993–1022.

    Google Scholar 

  8. Zhuang FZ, Luo P, Shen ZY, He Q, Xiong Y, Shi ZZ, Xiong H. Mining distinction and commonality across multiple domains using generative model for text classification. IEEE Trans Knowl Data Eng. 2012;24(11):2025–39.

    Article  Google Scholar 

  9. Gong Z, Zhang D, Hu M. An Improved SVM algorithm for Chinese text classification. Comput Simul. 2009;7:040.

    Google Scholar 

  10. J He, AH Tan, CL Tan. A comparative study on Chinese text categorization methods. In PRICAI workshop on text and web mining, vol. 35; 2000.

  11. X. Wan. Co-training for cross-lingual sentiment classification. In 4th international.

  12. Joint Conference on Natural Language Processing. Association for Computational Linguistics; 2009. P. 235–43.

  13. R Pandarachalil, S Sendhilkumar, GS Mahalakshmi. Twitter sentiment analysis for large-scale data: an unsupervised approach. Cogn Comput. 2014(4).

  14. Das D, Bandyopadhyay S. Sentence-level emotion and valence tagging. Cogn Comput. 2012;4:420–35.

    Article  Google Scholar 

  15. Yazdani M, Popescu-Belisa A. Computing text semantic relatedness using the contents and links of a hypertext encyclopedia. Artif Intell. 2013;194:176–202.

    Article  Google Scholar 

  16. C Huang, H Zhao. Which is essential for Chinese word segmentation: character versus word. In Proceedings of the 20th Pacific Asia conference on language, information and computation (PACLIC20); 2006. p. 1–12.

  17. Huang C, Zhao H. Chinese word segmentation: a decade review. J Chin Inf Process. 2007;21(3):8–18.

    Google Scholar 

  18. Xia YQ, Wong KF, Zhang P. Toward anomalous and dynamic nature of the Chinese network chat language. J Chin Inf Process. 2007;21(3):83–91.

    Google Scholar 

  19. Jian YY, Li P, Wang Q. An improved labeled latent Dirichlet Allocation model for multi-label classification. J Nanjing Univ Nat Sci Ed. 2013;49(4):425–32.

    Google Scholar 

  20. Li WB, Sun L, Zhang DK. Text classification based on labeled-LDA model. Chin J Comput. 2008;31(4):621–7.

    Google Scholar 

  21. Song SL, Wang SL, Chen P. Chinese text semantic representation for text classification. J Xidian Univ. 2013;40(2):89–97.

    Google Scholar 

  22. TS Teng. study on Chinese short-text classification. Master degree thesis of Tsinghua University; 2009.

Download references

Acknowledgments

This study was supported by the Grant of the National Science Foundation of China (No. 61175121); the Grant of the National Science Foundation of Fujian Province (No. 2013J06014); the Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (No. ZQNYX108); the Fundamental Research Funds for the Central Universities (No. JB-ZR1217).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ye-wang Chen or Ji-Xiang Du.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Yw., Zhou, Q., Luo, W. et al. Classification of Chinese Texts Based on Recognition of Semantic Topics. Cogn Comput 8, 114–124 (2016). https://doi.org/10.1007/s12559-015-9346-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-015-9346-8

Keywords

Navigation