Skip to main content
Log in

A high-performing comprehensive learning algorithm for text classification without pre-labeled training set

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In this paper, we investigate a comprehensive learning algorithm for text classification without pre-labeled training set based on incremental learning. In order to overcome the high cost in getting labeled training examples, this approach reforms fuzzy partition clustering to obtain a small quantity of labeled training data. Then the incremental learning of Bayesian classifier is applied. The model of the proposed classifier is composed of a Naïve-Bayes-based incremental learning algorithm and a modified fuzzy partition clustering method. For improved efficiency, a feature reduction is designed based on the Quadratic Entropy in Mutual Information. We perform experiments to demonstrate the performance of the approach, and the results show that our approach is feasible and effective.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Alcob JR (2002) Incremental learning of tree augmented Naive Bayes classifiers. Lect Note Comput Sci 2527: 32–41

    Article  Google Scholar 

  2. Arai K, Bu XQ (2007) ISODATA clustering with parameter (threshold for merge and split) estimation based on GA: genetic algorithm. In: Reports of the faculty of science and engineering, Saga University 36(1):17–23

  3. Belacel N et al (2002) Fuzzy j-means: a new heuristic for fuzzy clustering. Pattern Recogn 35: 2193–2200

    Article  MATH  Google Scholar 

  4. Chapelle O, Schölkopf B, Zien A (2006) Semi-supervised learning. The MIT Press, Cambridge, pp 15–33

    Google Scholar 

  5. El-Halees A (2007) Arabic text classification using maximum entropy. J Ser Nat Stud Eng 15(1): 157–167

    Google Scholar 

  6. El-Kourdi M, Bensaid A, Rachidi T (2004) Automatic Arabic document categorization based on the Naïve Bayes algorithm. In: Semitic ‘04 proceedings of the workshop on computational approaches to Arabic script-based languages, pp 51–58

  7. Exarchos TP, Tsipouras MG, Papaloukas C, Fotiadis DI (2009) An optimized sequential pattern matching methodology for sequence classification. Knowl Info Syst 19(2): 249–264

    Article  Google Scholar 

  8. Fadili MJ, Ruan S, Bloyet D, Mazoyer B (2000) A multistep unsupervised fuzzy clustering analysis of fMRI time series. Human Brain Map 10: 160–178

    Article  Google Scholar 

  9. Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29: 131–163

    Article  MATH  Google Scholar 

  10. Gifford HC, King MA, de Vries DJ, Soares EJ (2000) Channelized hotelling and human observer correlation for lesion detection in hepatic SPECT imaging. J Nucl 41: 514–521

    Google Scholar 

  11. Gong X, Sun J, Shi Z (2002) An active bayesian network classifier. J Comput Res Develop 5: 574–579

    Google Scholar 

  12. Hutter M, Zaffalon M (2005) Distribution of mutual information distribution of mutual information from complete and incomplete data. Comput Stat Data Anal 48(3): 633–657

    Article  MathSciNet  MATH  Google Scholar 

  13. Jensen JR (1996) Introductory digital image processing—a remote sensing perspective. Prentice Hall Inc., New Jersey, pp 197–256

    Google Scholar 

  14. Kim S-B, Seo H-C, Rim H-C (2008) Poisson Naive Bayes for text classification with feature weighting. In: Proceedings of the sixth international workshop on information retrieval with Asian languages, pp 33–40

  15. Laila K (2006) Arabic Text classification using NGram frequency statistics a comparative study. In: DMIN, pp 78–82

  16. Li P, Liang Q, Wu X (2009) Parameter estimation in semi-random decision tree ensembling on streaming data. In: The 13th Pacific-Asia conference on knowledge discovery and data mining, pp 376–388

  17. Liang Q, Li P, Hung P, Wu X (2009) Clustering web services for automatic categorization. In: IEEE international conference on services computing (SCC ‘09), pp 380–387

  18. Liu L, He H, Lu Y et al (2007) Application research of support vector machine in web information classification. J Chin Comput Syst 28(2): 337–340

    Google Scholar 

  19. Liu L, Li Z, Xiong L (2008) The application research of decision support vector machine in web information classification. In: Proceeding of the 12th international conference on computer supported cooperative work in design, pp 196–199

  20. Lung K (2005) A cluster validity index for fuzzy clustering. Pattern Recogn Lett 25: 1275–1291

    Google Scholar 

  21. Mesleh AA (2007) Chi square feature extraction based Svms Arabic language text categorization system. J Comput Sci 3(6): 430–435

    Article  Google Scholar 

  22. Park H-S, Cho S-B (2007) Evolutionary fuzzy cluster analysis with Bayesian validation of gene expression profiles. J Intell Fuzzy Syst 18: 543–559

    MATH  Google Scholar 

  23. Rennie JDM (1999) Improving multi-class text classification with Naive Bayes. In: Master thesis, Carnegie Mellon University, pp 10–37

  24. Ruspini EH (1969) A new approach to clustering. Info Cont 15: 22–32

    Article  MATH  Google Scholar 

  25. Saad MF, Alimi AM (2009) Modified fuzzy possibilistic C-means. In: Proceedings of the international multiConference of engineers and computer scientists 1:18–20

  26. Sawaf H, Zaplo J, Ney H (2001) Statistical classification methods for Arabic news articles. Arabic natural language processing, workshop on the ACL

  27. Shi Z (2002) Knowledge discovery. Tsinghua University Press, Beijing, pp 169–198

    Google Scholar 

  28. Takahashi K, Takamura H, Okumura M (2009) Direct estimation of class membership probabilities for multiclass classification using multiple scores. Knowl Info Syst 19(2): 185–210

    Article  Google Scholar 

  29. Thabtah F, Hadi W, Al-shammare G, AlHawari S (2008) VSMs with K-Nearest neighbour to categorise Arabic text data. In: Proceedings of the world congress on engineering and computer science, pp 22–24

  30. Torkkola K, Campbell W (2000) Mutual information in learning feature transformations. In: ICML ‘00 proceedings of the seventeenth international conference on machine learning, pp 1015–1022

  31. Torkkola K (2001a) Nonlinear feature transforms using maximum mutual information. In: Proceedings of the IJCNN, vol 4. pp 2756–2761

  32. Torkkola K (2002b) On Feature Extraction By Mutual Information Maximization. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 1. pp 821–824

  33. Valenzuela O, Rojas I, Herrera LJ, Guillén A, Rojas F (2006) Feature selection using mutual information and neutal networks. In: Monografías del Seminario Matemático García de Galdeano, vol 33. pp 331–340

  34. West M (2003) Bayesian factor regression models in the paradigm. Bayesian Stat 7: 723–732

    Google Scholar 

  35. Woon WL, Madnick S (2009) Asymmetric information distances for automated taxonomy construction. Knowl Info Syst 21(1): 91–111

    Article  Google Scholar 

  36. Yun C, Liang Z (2002) Information theory and coding. Electron Industry Press, Beijing

    Google Scholar 

  37. Yoshida ML, Hruschka Jr ER (2007) Quasi-incremental Bayesian classifier. KDD, In: ECMPL workshops

  38. Zhu X, Goldberg AB, Khot T (2009) Some new directions in graph-based semisupervised learning. In: IEEE international conference on multimedia and expo

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qianhui Liang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, L., Liang, Q. A high-performing comprehensive learning algorithm for text classification without pre-labeled training set. Knowl Inf Syst 29, 727–738 (2011). https://doi.org/10.1007/s10115-011-0387-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0387-3

Keywords

Navigation