ABSTRACT
Text categorization, the assignment of text documents to one or more pre-defined categories, is one of the most intensely researched text mining tasks. The task may be subdivided into two main parts: the representation of the text documents by some form of a numerical vector space, and the application of a suitable supervised learning technique. This research is focused on the second part of the problem. The work presented in this paper proposes the construction of a classification model for each of the (pre-defined) categories or themes present in a corpus using a term-frequency based 'keyword' identification and document scoring technique. The documents misclassified by each of these (category-specific) classifier models are then re-classified with the help of the other models. The effectiveness of the approach is demonstrated by experiments on two publicly available BBC News corpuses. Good classification accuracy is observed for each of the two corpuses. Specifically, the macro-averaged and micro-averaged F-measures of the proposed method (on evaluation the dataset) for the BBC Sports corpus are 94.7% and 94.3% respectively.
- BBC News, DOI = http://news.bbc.co.uk/Google Scholar
- BBC Sports News, DOI = http://news.bbc.co.uk/sport1/hi/default.stmGoogle Scholar
- Bekkerman, R., and Allan, R. 2003. Using Bigrams in Text Categorization. CIIR Technical Report IR-408. University of Massachusetts, Amherst, USA.Google Scholar
- Brew, A., Greene, D., and Cunningham, P. 2010. Taking the Pulse of the Web: Assessing Sentiment on Topics in Online Media. In Proceedings of the Web Science Conference (WebSci 2010).Google Scholar
- Caropreso, M. F., Matwin, S., and Sebastiani, F. 2001. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Text Databases and Document Management: Theory and Practice, Amita G. Chin, Ed., 78--102. Google ScholarDigital Library
- Chen, Q., Zheng, D., Zhao, T., and Li, S. 2008. A Fusion of Multiple Classifiers Approach Based on Reliability function for Text Categorization. In Proceedings of the Fifth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD '08). Google ScholarDigital Library
- Chen, Y-T., and Chen, M. C. 2011. Using chi-square statistics to measure similarities for text categorization. Expert Systems with Applications 38 (2011), 3085--3090 Google ScholarDigital Library
- Chiang, D-A., Keh, H-C., Huang, H-H., and Chyr, D. 2008. The Chinese text categorization system with association rule and category priority. Expert Systems with Applications 35 (2008), 102--110 Google ScholarDigital Library
- Delen. D., and Crossland, M. D. 2008. Seeding the survey and analysis of research literature with text mining. Expert Systems with Applications 34 (2008), 1707--1720 Google ScholarDigital Library
- Dumais, S. T., Platt, J., Heckerman, D., and Sahami. M. 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of CIKM'98, 7th ACM International Conference on Information and Knowledge Management (Bethesda, US, 1998), 148--155 Google ScholarDigital Library
- Gopal, R., Marsden, J. R., and Vanthienen, J. 2011. Information mining --- Reflections on recent advancements and the road ahead in data, text, and media mining, Decision Support Systems (In Press, 2011), DOI = 10.1016/j.dss.2011.01.008 Google ScholarDigital Library
- Greene, D., and Cunningham, P. 2006. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International Conference on Machine learning (ICML 2006). Google ScholarDigital Library
- Joachims, T. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In D. H. Fisher, editor, In Proceedings of the14th International Conference on Machine Learning (ICML'97, Nashville, USA), 143--151 Google ScholarDigital Library
- Khreisat, L. 2009. A machine learning approach for Arabic text classification using N-gram frequency statistics. Journal of Informetrics 3 (2009), 72--77Google ScholarCross Ref
- Li, X., Luo, J., and Yin, M. 2010. E-mail Filtering Based on Analysis of Structural Features and Text Classification. In Proceedings of the 2nd International Workshop on Intelligent Systems and Applications (ISA)Google Scholar
- Li, Y., Lin, H., and Yang, Z. 2007. Two Approaches for Biomedical Text Classification. In Proceedings. of the 1st International Conference on Bioinformatics and Biomedical EngineeringGoogle Scholar
- Lim, H-S. 2002. An Improved KNN Learning based Korean Text Classifier With Heuristic Information. In Proceedings of the 9th International Conference on Neural Information Processing (ICONIP '02)Google Scholar
- Pal, J. K., and Saha, A. 2010. Identifying Themes in Social Media and Detecting Sentiments. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Google ScholarDigital Library
- Porter, M. F. 1980. An Algorithm for Suffix Stripping. Program 14(3), 130--137Google ScholarDigital Library
- Rullo, P., Policicchio, V. L., Cumbo, C., and Iiritano, S. 2011. Olex: Effective Rule Learning for Text Categorization. IEEE Transactions on Knowledge and Data Engineering 21(8), 1118--1132 Google ScholarDigital Library
- Sebastiani, F., 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys. 34, (2002), 1--47. Google ScholarDigital Library
- Suzuki, M., Yamagishi, N., shida, T., Goto, M., and Hirasawa, S. 2010. On a New Model for Automatic Text Categorization Based on Vector Space Model. In Proceedings of the IEEE International Conference on Systems Man and Cybernetics (SMC)Google Scholar
- Tan, C. M., Wang, Y. F., and Lee, C. D. 2002. The use of bigrams to enhance text categorization. Information Processing and Management 38(4), 529--546. Google ScholarDigital Library
- Toraman, C., Can, F., and Kocberber, S. 2011. Developing a Text Categorization Template for Turkish News Portals. In Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications (INISTA)Google Scholar
- Upasana, S., and Chakravarty, S. 2010. A Survey of Text Classification Techniques for E-mail Filtering. In Proceedings of the Second International Conference on Machine Learning and Computing (ICMLC) Google ScholarDigital Library
- Wang, Z., and Qian, X. 2008. Text Categorization Based on LDA and SVM. In Proceedings of the International Conference on Computer Science and Software Engineering Google ScholarDigital Library
- Wanjun, Y., Xiaoguang, S. 2010. Research on Text Categorization Based on Machine Learning. In Proceedings of the IEEE International Conference on Advanced Management Science (ICAMS)Google ScholarCross Ref
- Wei a, C-P., Lin, Y-T., and Yang, C. C. 2011. Cross-lingual text categorization: Conquering language boundaries in globalized environments. Information Processing and Management 47 (2011), 786--804 Google ScholarDigital Library
- Xu, J-S. 2007. A New Method of Text Categorization. In Proceedings of the International Conference on Machine Learning and CyberneticsGoogle Scholar
- Zhang, W., Yoshida, T., and Tang, W. 2011. A comparative study of TF-IDF, LSI and multi-words for text classification. Expert Systems with Applications 38 (2011), 2758--2765 Google ScholarDigital Library
Index Terms
- A multi-classifier system for text categorization
Recommendations
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values
Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
A generalized cluster centroid based classifier for text categorization
In this paper, a Generalized Cluster Centroid based Classifier (GCCC) and its variants for text categorization are proposed by utilizing a clustering algorithm to integrate two well-known classifiers, i.e., the K-nearest-neighbor (KNN) classifier and ...
Improving linear classifier for Chinese text categorization
The goal of this paper is to derive extra representatives from each class to compensate for the potential weakness of linear classifiers that compute one representative for each class. To evaluate the effectiveness of our approach, we compared with ...
Comments