Abstract
In this paper, we investigate a comprehensive learning algorithm for text classification without pre-labeled training set based on incremental learning. In order to overcome the high cost in getting labeled training examples, this approach reforms fuzzy partition clustering to obtain a small quantity of labeled training data. Then the incremental learning of Bayesian classifier is applied. The model of the proposed classifier is composed of a Naïve-Bayes-based incremental learning algorithm and a modified fuzzy partition clustering method. For improved efficiency, a feature reduction is designed based on the Quadratic Entropy in Mutual Information. We perform experiments to demonstrate the performance of the approach, and the results show that our approach is feasible and effective.
Similar content being viewed by others
References
Alcob JR (2002) Incremental learning of tree augmented Naive Bayes classifiers. Lect Note Comput Sci 2527: 32–41
Arai K, Bu XQ (2007) ISODATA clustering with parameter (threshold for merge and split) estimation based on GA: genetic algorithm. In: Reports of the faculty of science and engineering, Saga University 36(1):17–23
Belacel N et al (2002) Fuzzy j-means: a new heuristic for fuzzy clustering. Pattern Recogn 35: 2193–2200
Chapelle O, Schölkopf B, Zien A (2006) Semi-supervised learning. The MIT Press, Cambridge, pp 15–33
El-Halees A (2007) Arabic text classification using maximum entropy. J Ser Nat Stud Eng 15(1): 157–167
El-Kourdi M, Bensaid A, Rachidi T (2004) Automatic Arabic document categorization based on the Naïve Bayes algorithm. In: Semitic ‘04 proceedings of the workshop on computational approaches to Arabic script-based languages, pp 51–58
Exarchos TP, Tsipouras MG, Papaloukas C, Fotiadis DI (2009) An optimized sequential pattern matching methodology for sequence classification. Knowl Info Syst 19(2): 249–264
Fadili MJ, Ruan S, Bloyet D, Mazoyer B (2000) A multistep unsupervised fuzzy clustering analysis of fMRI time series. Human Brain Map 10: 160–178
Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29: 131–163
Gifford HC, King MA, de Vries DJ, Soares EJ (2000) Channelized hotelling and human observer correlation for lesion detection in hepatic SPECT imaging. J Nucl 41: 514–521
Gong X, Sun J, Shi Z (2002) An active bayesian network classifier. J Comput Res Develop 5: 574–579
Hutter M, Zaffalon M (2005) Distribution of mutual information distribution of mutual information from complete and incomplete data. Comput Stat Data Anal 48(3): 633–657
Jensen JR (1996) Introductory digital image processing—a remote sensing perspective. Prentice Hall Inc., New Jersey, pp 197–256
Kim S-B, Seo H-C, Rim H-C (2008) Poisson Naive Bayes for text classification with feature weighting. In: Proceedings of the sixth international workshop on information retrieval with Asian languages, pp 33–40
Laila K (2006) Arabic Text classification using NGram frequency statistics a comparative study. In: DMIN, pp 78–82
Li P, Liang Q, Wu X (2009) Parameter estimation in semi-random decision tree ensembling on streaming data. In: The 13th Pacific-Asia conference on knowledge discovery and data mining, pp 376–388
Liang Q, Li P, Hung P, Wu X (2009) Clustering web services for automatic categorization. In: IEEE international conference on services computing (SCC ‘09), pp 380–387
Liu L, He H, Lu Y et al (2007) Application research of support vector machine in web information classification. J Chin Comput Syst 28(2): 337–340
Liu L, Li Z, Xiong L (2008) The application research of decision support vector machine in web information classification. In: Proceeding of the 12th international conference on computer supported cooperative work in design, pp 196–199
Lung K (2005) A cluster validity index for fuzzy clustering. Pattern Recogn Lett 25: 1275–1291
Mesleh AA (2007) Chi square feature extraction based Svms Arabic language text categorization system. J Comput Sci 3(6): 430–435
Park H-S, Cho S-B (2007) Evolutionary fuzzy cluster analysis with Bayesian validation of gene expression profiles. J Intell Fuzzy Syst 18: 543–559
Rennie JDM (1999) Improving multi-class text classification with Naive Bayes. In: Master thesis, Carnegie Mellon University, pp 10–37
Ruspini EH (1969) A new approach to clustering. Info Cont 15: 22–32
Saad MF, Alimi AM (2009) Modified fuzzy possibilistic C-means. In: Proceedings of the international multiConference of engineers and computer scientists 1:18–20
Sawaf H, Zaplo J, Ney H (2001) Statistical classification methods for Arabic news articles. Arabic natural language processing, workshop on the ACL
Shi Z (2002) Knowledge discovery. Tsinghua University Press, Beijing, pp 169–198
Takahashi K, Takamura H, Okumura M (2009) Direct estimation of class membership probabilities for multiclass classification using multiple scores. Knowl Info Syst 19(2): 185–210
Thabtah F, Hadi W, Al-shammare G, AlHawari S (2008) VSMs with K-Nearest neighbour to categorise Arabic text data. In: Proceedings of the world congress on engineering and computer science, pp 22–24
Torkkola K, Campbell W (2000) Mutual information in learning feature transformations. In: ICML ‘00 proceedings of the seventeenth international conference on machine learning, pp 1015–1022
Torkkola K (2001a) Nonlinear feature transforms using maximum mutual information. In: Proceedings of the IJCNN, vol 4. pp 2756–2761
Torkkola K (2002b) On Feature Extraction By Mutual Information Maximization. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 1. pp 821–824
Valenzuela O, Rojas I, Herrera LJ, Guillén A, Rojas F (2006) Feature selection using mutual information and neutal networks. In: Monografías del Seminario Matemático García de Galdeano, vol 33. pp 331–340
West M (2003) Bayesian factor regression models in the paradigm. Bayesian Stat 7: 723–732
Woon WL, Madnick S (2009) Asymmetric information distances for automated taxonomy construction. Knowl Info Syst 21(1): 91–111
Yun C, Liang Z (2002) Information theory and coding. Electron Industry Press, Beijing
Yoshida ML, Hruschka Jr ER (2007) Quasi-incremental Bayesian classifier. KDD, In: ECMPL workshops
Zhu X, Goldberg AB, Khot T (2009) Some new directions in graph-based semisupervised learning. In: IEEE international conference on multimedia and expo
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, L., Liang, Q. A high-performing comprehensive learning algorithm for text classification without pre-labeled training set. Knowl Inf Syst 29, 727–738 (2011). https://doi.org/10.1007/s10115-011-0387-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0387-3