Skip to main content
Log in

A method for Chinese text classification based on apparent semantics and latent aspects

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

The existing methods for text classification fail to achieve high accuracy in processing Chinese texts, for that the basic unit of Chinese texts is not hanzis but Chinese phrases, and there is no natural delimiter in Chinese texts to separate the phrases. Things go even worse in the case of processing large number of Chinese Web texts, for these texts often lack of enough context, because most of these text are often short, irregular and sparse. In this paper, a new classification method is proposed for Chinese texts based on apparent semantics and latent aspects (ASLA). First, the apparent semantics of Chinese text are extracted as features instead of hanzis by BaiduBaike; Second, pLSA is applied for mining the latent aspects of these apparent semantics. Third, the relevant degree of a document to a category is calculated according to the apparent semantics and latent aspects. Finally, the category of a document is determined by the relevant degree. The proposed method is able to process Chinese web short text well with mini train data. Our experiments showed that the proposed method is promising, and it outperforms pLSA,SVM, KNN and CRF in the case of training data is not enough and the text is irregular.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Bharti KK, Singh PK (2015) Chaotic gradient artificial bee colony for text clustering. Soft Comput. doi:10.1007/s12652-014-0237-8

  • Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3(1):993–1022

    MATH  Google Scholar 

  • Chen J, Huang DP, Hu SY, Liu Y (2014) Cai Y, Min HQ An opinion mining framework for cantonese reviews. J Ambient Intell Humaniz Comput. doi:10.1007/s12652-014-0237-8

  • Chen YW, Du JX (2014) A new method for classifying chinese text based on semantic topics and desity peaks. Int J Appl Math Mach Learn 1(1):35–54

    MathSciNet  Google Scholar 

  • Chen YW, Wang HZ et al (2012) A topic extraction method for chinese web text based on baidubaike and text classification. J Chin Comput Syst 33(12):2605–2010

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  • Durao F, Dolog P (2014) Improving tag-based recommendation with the collaborative value of wiki pages for knowledge sharing. J Ambient Intell Hum Comput 5(1):21–38

    Article  Google Scholar 

  • Fabrizio S (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Article  Google Scholar 

  • Fu RJ, Qin B, Liu T (2015) Open-categorical text classification based on multi-lda models. Soft Comput 19(1):29–38

  • Fudan NLP (2013) Chinese texts database. IOP Publishing PhysicsWeb. http://www.datatang.com/data/44082. Accessed 11 September 2014

  • Hofmann T(1999) Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, pp 289–296

  • Huang C, Zhao H (2006) Which is essential for chinese word segmentation:character versus word. In: Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, pp 1–12

  • Huang C, Zhao H (2007) Chinese word segmentation: a decade review. J Chin Inf Process 21(3):8–18

    Google Scholar 

  • Jiang YY, Li P, Wang Q (2013) An improved labeled latent dirichlet allocation model for multi-label classification. J Nanjing Univ Nat Sci Ed 49(4):425–432

    Google Scholar 

  • Li J, Wang YM (2006) Universal designated verifier ring signature (proof) without random oracles. Emerging Dir Embed Ubiquitous Comput 4097:332–341

    Article  Google Scholar 

  • Li J, Zhang FG, Wang YM (2006) A new hierarchical id-based cryptosystem and cca-secure pke. Emerging Dir Embed Ubiquitous Comput, 4097:362–371

  • Li RL (2010) Svmcls IOP Publishing PhysicsWeb. http://download.csdn.net/detail/superyangtze/2710559. Accessed 8 Sept 2014

  • Li WB, Sun L, Zhang DK (2008) Text classification based on labeled-lda model. Chin J Comput 31(4):621–627

    MathSciNet  Google Scholar 

  • Sartaj S (1999) Data structures, algorithms, and applications in java suffix trees. IOP Publishing PhysicsWeb. http://www.cise.ufl.edu/ sahni/dsaaj/enrich/c16/suffix.html. Accessed 11 Sept 2014

  • SogouC (2013) Sogou lab data. IOP Publishing PhysicsWeb. http://www.sogou.com/labs/dl/c.html. Accessed 12 Sept 2014

  • Song SL, Wang SL, Chen P (2013) Chinese text semantic representation for text classification. J Xidian Univ 40(2):89–97

    MathSciNet  Google Scholar 

  • Su JS, Zhang BF, Xu X (2005) Advances in machine learning based text categorization. J Softw 17(9):1848–1859

    Article  Google Scholar 

  • Teng SH (2009) Study on chinese short-text classification. Master’s thesis, Tsinghua University

  • Xia YQ, Wong KF, Zhang P (2007) Toward anomalous and dynamic nature of the chinese network chat language. J Chin Inf Process 21(3):83–91

    Google Scholar 

  • Xu G, Wang HF (2011) The development of topic models in natural language processing. Chin J Comput 34(8):1423–136

    Article  MathSciNet  Google Scholar 

  • Zhang HP, Yu HK, Xiong DY, Liu Q (2003) Hhmm-based chinese lexical analyzer ictclas. In: 2nd SIGHAN workshop affiliated with 41st ACL; Sapporo Japan, July 2003, pp 184–187

Download references

Acknowledgments

Supported by the Grant of the National Science Foundation of China (No. 61175121); the Grant of the National Science Foundation of Fujian Province (No. 201-3J06014); the Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University (No. ZQNYX108); the Fundamental Research Funds for the Central Universities (No. JBZR-1217); the Natural Science Foundation of Fujian Province, China (No. 2012J05117).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ji-Xiang Du.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, YW., Wang, JL., Cai, YQ. et al. A method for Chinese text classification based on apparent semantics and latent aspects. J Ambient Intell Human Comput 6, 473–480 (2015). https://doi.org/10.1007/s12652-015-0257-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-015-0257-z

Keywords

Navigation