A high-performing comprehensive learning algorithm for text classification without pre-labeled training set

Liu, Lizhen; Liang, Qianhui

doi:10.1007/s10115-011-0387-3

A high-performing comprehensive learning algorithm for text classification without pre-labeled training set

Regular Paper
Published: 30 March 2011

Volume 29, pages 727–738, (2011)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Lizhen Liu¹ &
Qianhui Liang²

189 Accesses
10 Citations
Explore all metrics

Abstract

In this paper, we investigate a comprehensive learning algorithm for text classification without pre-labeled training set based on incremental learning. In order to overcome the high cost in getting labeled training examples, this approach reforms fuzzy partition clustering to obtain a small quantity of labeled training data. Then the incremental learning of Bayesian classifier is applied. The model of the proposed classifier is composed of a Naïve-Bayes-based incremental learning algorithm and a modified fuzzy partition clustering method. For improved efficiency, a feature reduction is designed based on the Quadratic Entropy in Mutual Information. We perform experiments to demonstrate the performance of the approach, and the results show that our approach is feasible and effective.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text Categorization Using an Automatically Generated Labelled Dataset: An Evaluation Study

Hybrid Feature-Based Multi-label Text Classification—A Framework

Semi-supervised Text Categorization Using Recursive K-means Clustering

References

Alcob JR (2002) Incremental learning of tree augmented Naive Bayes classifiers. Lect Note Comput Sci 2527: 32–41
Article Google Scholar
Arai K, Bu XQ (2007) ISODATA clustering with parameter (threshold for merge and split) estimation based on GA: genetic algorithm. In: Reports of the faculty of science and engineering, Saga University 36(1):17–23
Belacel N et al (2002) Fuzzy j-means: a new heuristic for fuzzy clustering. Pattern Recogn 35: 2193–2200
Article MATH Google Scholar
Chapelle O, Schölkopf B, Zien A (2006) Semi-supervised learning. The MIT Press, Cambridge, pp 15–33
Google Scholar
El-Halees A (2007) Arabic text classification using maximum entropy. J Ser Nat Stud Eng 15(1): 157–167
Google Scholar
El-Kourdi M, Bensaid A, Rachidi T (2004) Automatic Arabic document categorization based on the Naïve Bayes algorithm. In: Semitic ‘04 proceedings of the workshop on computational approaches to Arabic script-based languages, pp 51–58
Exarchos TP, Tsipouras MG, Papaloukas C, Fotiadis DI (2009) An optimized sequential pattern matching methodology for sequence classification. Knowl Info Syst 19(2): 249–264
Article Google Scholar
Fadili MJ, Ruan S, Bloyet D, Mazoyer B (2000) A multistep unsupervised fuzzy clustering analysis of fMRI time series. Human Brain Map 10: 160–178
Article Google Scholar
Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29: 131–163
Article MATH Google Scholar
Gifford HC, King MA, de Vries DJ, Soares EJ (2000) Channelized hotelling and human observer correlation for lesion detection in hepatic SPECT imaging. J Nucl 41: 514–521
Google Scholar
Gong X, Sun J, Shi Z (2002) An active bayesian network classifier. J Comput Res Develop 5: 574–579
Google Scholar
Hutter M, Zaffalon M (2005) Distribution of mutual information distribution of mutual information from complete and incomplete data. Comput Stat Data Anal 48(3): 633–657
Article MathSciNet MATH Google Scholar
Jensen JR (1996) Introductory digital image processing—a remote sensing perspective. Prentice Hall Inc., New Jersey, pp 197–256
Google Scholar
Kim S-B, Seo H-C, Rim H-C (2008) Poisson Naive Bayes for text classification with feature weighting. In: Proceedings of the sixth international workshop on information retrieval with Asian languages, pp 33–40
Laila K (2006) Arabic Text classification using NGram frequency statistics a comparative study. In: DMIN, pp 78–82
Li P, Liang Q, Wu X (2009) Parameter estimation in semi-random decision tree ensembling on streaming data. In: The 13th Pacific-Asia conference on knowledge discovery and data mining, pp 376–388
Liang Q, Li P, Hung P, Wu X (2009) Clustering web services for automatic categorization. In: IEEE international conference on services computing (SCC ‘09), pp 380–387
Liu L, He H, Lu Y et al (2007) Application research of support vector machine in web information classification. J Chin Comput Syst 28(2): 337–340
Google Scholar
Liu L, Li Z, Xiong L (2008) The application research of decision support vector machine in web information classification. In: Proceeding of the 12th international conference on computer supported cooperative work in design, pp 196–199
Lung K (2005) A cluster validity index for fuzzy clustering. Pattern Recogn Lett 25: 1275–1291
Google Scholar
Mesleh AA (2007) Chi square feature extraction based Svms Arabic language text categorization system. J Comput Sci 3(6): 430–435
Article Google Scholar
Park H-S, Cho S-B (2007) Evolutionary fuzzy cluster analysis with Bayesian validation of gene expression profiles. J Intell Fuzzy Syst 18: 543–559
MATH Google Scholar
Rennie JDM (1999) Improving multi-class text classification with Naive Bayes. In: Master thesis, Carnegie Mellon University, pp 10–37
Ruspini EH (1969) A new approach to clustering. Info Cont 15: 22–32
Article MATH Google Scholar
Saad MF, Alimi AM (2009) Modified fuzzy possibilistic C-means. In: Proceedings of the international multiConference of engineers and computer scientists 1:18–20
Sawaf H, Zaplo J, Ney H (2001) Statistical classification methods for Arabic news articles. Arabic natural language processing, workshop on the ACL
Shi Z (2002) Knowledge discovery. Tsinghua University Press, Beijing, pp 169–198
Google Scholar
Takahashi K, Takamura H, Okumura M (2009) Direct estimation of class membership probabilities for multiclass classification using multiple scores. Knowl Info Syst 19(2): 185–210
Article Google Scholar
Thabtah F, Hadi W, Al-shammare G, AlHawari S (2008) VSMs with K-Nearest neighbour to categorise Arabic text data. In: Proceedings of the world congress on engineering and computer science, pp 22–24
Torkkola K, Campbell W (2000) Mutual information in learning feature transformations. In: ICML ‘00 proceedings of the seventeenth international conference on machine learning, pp 1015–1022
Torkkola K (2001a) Nonlinear feature transforms using maximum mutual information. In: Proceedings of the IJCNN, vol 4. pp 2756–2761
Torkkola K (2002b) On Feature Extraction By Mutual Information Maximization. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol 1. pp 821–824
Valenzuela O, Rojas I, Herrera LJ, Guillén A, Rojas F (2006) Feature selection using mutual information and neutal networks. In: Monografías del Seminario Matemático García de Galdeano, vol 33. pp 331–340
West M (2003) Bayesian factor regression models in the paradigm. Bayesian Stat 7: 723–732
Google Scholar
Woon WL, Madnick S (2009) Asymmetric information distances for automated taxonomy construction. Knowl Info Syst 21(1): 91–111
Article Google Scholar
Yun C, Liang Z (2002) Information theory and coding. Electron Industry Press, Beijing
Google Scholar
Yoshida ML, Hruschka Jr ER (2007) Quasi-incremental Bayesian classifier. KDD, In: ECMPL workshops
Zhu X, Goldberg AB, Khot T (2009) Some new directions in graph-based semisupervised learning. In: IEEE international conference on multimedia and expo

Download references

Author information

Authors and Affiliations

Information Engineering College, Capital Normal University, Beijing, People’s Republic of China
Lizhen Liu
HP Labs, Singapore, Singapore
Qianhui Liang

Authors

Lizhen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qianhui Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qianhui Liang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, L., Liang, Q. A high-performing comprehensive learning algorithm for text classification without pre-labeled training set. Knowl Inf Syst 29, 727–738 (2011). https://doi.org/10.1007/s10115-011-0387-3

Download citation

Received: 13 December 2009
Revised: 12 January 2011
Accepted: 22 February 2011
Published: 30 March 2011
Issue Date: December 2011
DOI: https://doi.org/10.1007/s10115-011-0387-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A high-performing comprehensive learning algorithm for text classification without pre-labeled training set

Abstract

Access this article

Similar content being viewed by others

Text Categorization Using an Automatically Generated Labelled Dataset: An Evaluation Study

Hybrid Feature-Based Multi-label Text Classification—A Framework

Semi-supervised Text Categorization Using Recursive K-means Clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A high-performing comprehensive learning algorithm for text classification without pre-labeled training set

Abstract

Access this article

Similar content being viewed by others

Text Categorization Using an Automatically Generated Labelled Dataset: An Evaluation Study

Hybrid Feature-Based Multi-label Text Classification—A Framework

Semi-supervised Text Categorization Using Recursive K-means Clustering

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation