research-article

A multi-classifier system for text categorization

Author:
Shubhamoy Dey

Indian Institute of Management, Prabandh Shikhar, Rau, Indore, India

Indian Institute of Management, Prabandh Shikhar, Rau, Indore, India
View Profile

RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied ComputationNovember 2011Pages 325–329https://doi.org/10.1145/2103380.2103448

Published:02 November 2011Publication History

RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied Computation

Pages 325–329

ABSTRACT

Text categorization, the assignment of text documents to one or more pre-defined categories, is one of the most intensely researched text mining tasks. The task may be subdivided into two main parts: the representation of the text documents by some form of a numerical vector space, and the application of a suitable supervised learning technique. This research is focused on the second part of the problem. The work presented in this paper proposes the construction of a classification model for each of the (pre-defined) categories or themes present in a corpus using a term-frequency based 'keyword' identification and document scoring technique. The documents misclassified by each of these (category-specific) classifier models are then re-classified with the help of the other models. The effectiveness of the approach is demonstrated by experiments on two publicly available BBC News corpuses. Good classification accuracy is observed for each of the two corpuses. Specifically, the macro-averaged and micro-averaged F-measures of the proposed method (on evaluation the dataset) for the BBC Sports corpus are 94.7% and 94.3% respectively.

References

BBC News, DOI = http://news.bbc.co.uk/Google Scholar
BBC Sports News, DOI = http://news.bbc.co.uk/sport1/hi/default.stmGoogle Scholar
Bekkerman, R., and Allan, R. 2003. Using Bigrams in Text Categorization. CIIR Technical Report IR-408. University of Massachusetts, Amherst, USA.Google Scholar
Brew, A., Greene, D., and Cunningham, P. 2010. Taking the Pulse of the Web: Assessing Sentiment on Topics in Online Media. In Proceedings of the Web Science Conference (WebSci 2010).Google Scholar
Caropreso, M. F., Matwin, S., and Sebastiani, F. 2001. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Text Databases and Document Management: Theory and Practice, Amita G. Chin, Ed., 78--102. Google ScholarDigital Library
Chen, Q., Zheng, D., Zhao, T., and Li, S. 2008. A Fusion of Multiple Classifiers Approach Based on Reliability function for Text Categorization. In Proceedings of the Fifth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD '08). Google ScholarDigital Library
Chen, Y-T., and Chen, M. C. 2011. Using chi-square statistics to measure similarities for text categorization. Expert Systems with Applications 38 (2011), 3085--3090 Google ScholarDigital Library
Chiang, D-A., Keh, H-C., Huang, H-H., and Chyr, D. 2008. The Chinese text categorization system with association rule and category priority. Expert Systems with Applications 35 (2008), 102--110 Google ScholarDigital Library
Delen. D., and Crossland, M. D. 2008. Seeding the survey and analysis of research literature with text mining. Expert Systems with Applications 34 (2008), 1707--1720 Google ScholarDigital Library
Dumais, S. T., Platt, J., Heckerman, D., and Sahami. M. 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of CIKM'98, 7th ACM International Conference on Information and Knowledge Management (Bethesda, US, 1998), 148--155 Google ScholarDigital Library
Gopal, R., Marsden, J. R., and Vanthienen, J. 2011. Information mining --- Reflections on recent advancements and the road ahead in data, text, and media mining, Decision Support Systems (In Press, 2011), DOI = 10.1016/j.dss.2011.01.008 Google ScholarDigital Library
Greene, D., and Cunningham, P. 2006. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International Conference on Machine learning (ICML 2006). Google ScholarDigital Library
Joachims, T. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In D. H. Fisher, editor, In Proceedings of the14th International Conference on Machine Learning (ICML'97, Nashville, USA), 143--151 Google ScholarDigital Library
Khreisat, L. 2009. A machine learning approach for Arabic text classification using N-gram frequency statistics. Journal of Informetrics 3 (2009), 72--77Google ScholarCross Ref
Li, X., Luo, J., and Yin, M. 2010. E-mail Filtering Based on Analysis of Structural Features and Text Classification. In Proceedings of the 2nd International Workshop on Intelligent Systems and Applications (ISA)Google Scholar
Li, Y., Lin, H., and Yang, Z. 2007. Two Approaches for Biomedical Text Classification. In Proceedings. of the 1st International Conference on Bioinformatics and Biomedical EngineeringGoogle Scholar
Lim, H-S. 2002. An Improved KNN Learning based Korean Text Classifier With Heuristic Information. In Proceedings of the 9th International Conference on Neural Information Processing (ICONIP '02)Google Scholar
Pal, J. K., and Saha, A. 2010. Identifying Themes in Social Media and Detecting Sentiments. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Google ScholarDigital Library
Porter, M. F. 1980. An Algorithm for Suffix Stripping. Program 14(3), 130--137Google ScholarDigital Library
Rullo, P., Policicchio, V. L., Cumbo, C., and Iiritano, S. 2011. Olex: Effective Rule Learning for Text Categorization. IEEE Transactions on Knowledge and Data Engineering 21(8), 1118--1132 Google ScholarDigital Library
Sebastiani, F., 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys. 34, (2002), 1--47. Google ScholarDigital Library
Suzuki, M., Yamagishi, N., shida, T., Goto, M., and Hirasawa, S. 2010. On a New Model for Automatic Text Categorization Based on Vector Space Model. In Proceedings of the IEEE International Conference on Systems Man and Cybernetics (SMC)Google Scholar
Tan, C. M., Wang, Y. F., and Lee, C. D. 2002. The use of bigrams to enhance text categorization. Information Processing and Management 38(4), 529--546. Google ScholarDigital Library
Toraman, C., Can, F., and Kocberber, S. 2011. Developing a Text Categorization Template for Turkish News Portals. In Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications (INISTA)Google Scholar
Upasana, S., and Chakravarty, S. 2010. A Survey of Text Classification Techniques for E-mail Filtering. In Proceedings of the Second International Conference on Machine Learning and Computing (ICMLC) Google ScholarDigital Library
Wang, Z., and Qian, X. 2008. Text Categorization Based on LDA and SVM. In Proceedings of the International Conference on Computer Science and Software Engineering Google ScholarDigital Library
Wanjun, Y., Xiaoguang, S. 2010. Research on Text Categorization Based on Machine Learning. In Proceedings of the IEEE International Conference on Advanced Management Science (ICAMS)Google ScholarCross Ref
Wei a, C-P., Lin, Y-T., and Yang, C. C. 2011. Cross-lingual text categorization: Conquering language boundaries in globalized environments. Information Processing and Management 47 (2011), 786--804 Google ScholarDigital Library
Xu, J-S. 2007. A New Method of Text Categorization. In Proceedings of the International Conference on Machine Learning and CyberneticsGoogle Scholar
Zhang, W., Yoshida, T., and Tang, W. 2011. A comparative study of TF-IDF, LSI and multi-words for text classification. Expert Systems with Applications 38 (2011), 2758--2765 Google ScholarDigital Library

Index Terms

A multi-classifier system for text categorization
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Read More
A generalized cluster centroid based classifier for text categorization

In this paper, a Generalized Cluster Centroid based Classifier (GCCC) and its variants for text categorization are proposed by utilizing a clustering algorithm to integrate two well-known classifiers, i.e., the K-nearest-neighbor (KNN) classifier and ...
Read More
Improving linear classifier for Chinese text categorization

The goal of this paper is to derive extra representatives from each class to compensate for the potential weakness of linear classifiers that compute one representative for each class. To evaluate the effectiveness of our approach, we compared with ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied Computation
November 2011
355 pages
ISBN:9781450310871
DOI:10.1145/2103380
General Chairs:
Rex E. Gantenbein
University of Wyoming
,
Tei-Wei Kuo
National Taiwan University, Taiwan
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 November 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data mining
information retrieval
text categorization
text mining
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate393of1,581submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 246
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A multi-classifier system for text categorization

RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied Computation

ABSTRACT

References

Cited By

Index Terms

Recommendations

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

A generalized cluster centroid based classifier for text categorization

Improving linear classifier for Chinese text categorization