skip to main content
10.1145/2103380.2103448acmconferencesArticle/Chapter ViewAbstractPublication PagesracsConference Proceedingsconference-collections
research-article

A multi-classifier system for text categorization

Authors Info & Claims
Published:02 November 2011Publication History

ABSTRACT

Text categorization, the assignment of text documents to one or more pre-defined categories, is one of the most intensely researched text mining tasks. The task may be subdivided into two main parts: the representation of the text documents by some form of a numerical vector space, and the application of a suitable supervised learning technique. This research is focused on the second part of the problem. The work presented in this paper proposes the construction of a classification model for each of the (pre-defined) categories or themes present in a corpus using a term-frequency based 'keyword' identification and document scoring technique. The documents misclassified by each of these (category-specific) classifier models are then re-classified with the help of the other models. The effectiveness of the approach is demonstrated by experiments on two publicly available BBC News corpuses. Good classification accuracy is observed for each of the two corpuses. Specifically, the macro-averaged and micro-averaged F-measures of the proposed method (on evaluation the dataset) for the BBC Sports corpus are 94.7% and 94.3% respectively.

References

  1. BBC News, DOI = http://news.bbc.co.uk/Google ScholarGoogle Scholar
  2. BBC Sports News, DOI = http://news.bbc.co.uk/sport1/hi/default.stmGoogle ScholarGoogle Scholar
  3. Bekkerman, R., and Allan, R. 2003. Using Bigrams in Text Categorization. CIIR Technical Report IR-408. University of Massachusetts, Amherst, USA.Google ScholarGoogle Scholar
  4. Brew, A., Greene, D., and Cunningham, P. 2010. Taking the Pulse of the Web: Assessing Sentiment on Topics in Online Media. In Proceedings of the Web Science Conference (WebSci 2010).Google ScholarGoogle Scholar
  5. Caropreso, M. F., Matwin, S., and Sebastiani, F. 2001. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Text Databases and Document Management: Theory and Practice, Amita G. Chin, Ed., 78--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chen, Q., Zheng, D., Zhao, T., and Li, S. 2008. A Fusion of Multiple Classifiers Approach Based on Reliability function for Text Categorization. In Proceedings of the Fifth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD '08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chen, Y-T., and Chen, M. C. 2011. Using chi-square statistics to measure similarities for text categorization. Expert Systems with Applications 38 (2011), 3085--3090 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chiang, D-A., Keh, H-C., Huang, H-H., and Chyr, D. 2008. The Chinese text categorization system with association rule and category priority. Expert Systems with Applications 35 (2008), 102--110 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Delen. D., and Crossland, M. D. 2008. Seeding the survey and analysis of research literature with text mining. Expert Systems with Applications 34 (2008), 1707--1720 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dumais, S. T., Platt, J., Heckerman, D., and Sahami. M. 1998. Inductive learning algorithms and representations for text categorization. In Proceedings of CIKM'98, 7th ACM International Conference on Information and Knowledge Management (Bethesda, US, 1998), 148--155 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Gopal, R., Marsden, J. R., and Vanthienen, J. 2011. Information mining --- Reflections on recent advancements and the road ahead in data, text, and media mining, Decision Support Systems (In Press, 2011), DOI = 10.1016/j.dss.2011.01.008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Greene, D., and Cunningham, P. 2006. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International Conference on Machine learning (ICML 2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Joachims, T. 1997. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In D. H. Fisher, editor, In Proceedings of the14th International Conference on Machine Learning (ICML'97, Nashville, USA), 143--151 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Khreisat, L. 2009. A machine learning approach for Arabic text classification using N-gram frequency statistics. Journal of Informetrics 3 (2009), 72--77Google ScholarGoogle ScholarCross RefCross Ref
  15. Li, X., Luo, J., and Yin, M. 2010. E-mail Filtering Based on Analysis of Structural Features and Text Classification. In Proceedings of the 2nd International Workshop on Intelligent Systems and Applications (ISA)Google ScholarGoogle Scholar
  16. Li, Y., Lin, H., and Yang, Z. 2007. Two Approaches for Biomedical Text Classification. In Proceedings. of the 1st International Conference on Bioinformatics and Biomedical EngineeringGoogle ScholarGoogle Scholar
  17. Lim, H-S. 2002. An Improved KNN Learning based Korean Text Classifier With Heuristic Information. In Proceedings of the 9th International Conference on Neural Information Processing (ICONIP '02)Google ScholarGoogle Scholar
  18. Pal, J. K., and Saha, A. 2010. Identifying Themes in Social Media and Detecting Sentiments. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining (ASONAM) Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Porter, M. F. 1980. An Algorithm for Suffix Stripping. Program 14(3), 130--137Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Rullo, P., Policicchio, V. L., Cumbo, C., and Iiritano, S. 2011. Olex: Effective Rule Learning for Text Categorization. IEEE Transactions on Knowledge and Data Engineering 21(8), 1118--1132 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sebastiani, F., 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys. 34, (2002), 1--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Suzuki, M., Yamagishi, N., shida, T., Goto, M., and Hirasawa, S. 2010. On a New Model for Automatic Text Categorization Based on Vector Space Model. In Proceedings of the IEEE International Conference on Systems Man and Cybernetics (SMC)Google ScholarGoogle Scholar
  23. Tan, C. M., Wang, Y. F., and Lee, C. D. 2002. The use of bigrams to enhance text categorization. Information Processing and Management 38(4), 529--546. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Toraman, C., Can, F., and Kocberber, S. 2011. Developing a Text Categorization Template for Turkish News Portals. In Proceedings of the International Symposium on Innovations in Intelligent Systems and Applications (INISTA)Google ScholarGoogle Scholar
  25. Upasana, S., and Chakravarty, S. 2010. A Survey of Text Classification Techniques for E-mail Filtering. In Proceedings of the Second International Conference on Machine Learning and Computing (ICMLC) Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Wang, Z., and Qian, X. 2008. Text Categorization Based on LDA and SVM. In Proceedings of the International Conference on Computer Science and Software Engineering Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Wanjun, Y., Xiaoguang, S. 2010. Research on Text Categorization Based on Machine Learning. In Proceedings of the IEEE International Conference on Advanced Management Science (ICAMS)Google ScholarGoogle ScholarCross RefCross Ref
  28. Wei a, C-P., Lin, Y-T., and Yang, C. C. 2011. Cross-lingual text categorization: Conquering language boundaries in globalized environments. Information Processing and Management 47 (2011), 786--804 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Xu, J-S. 2007. A New Method of Text Categorization. In Proceedings of the International Conference on Machine Learning and CyberneticsGoogle ScholarGoogle Scholar
  30. Zhang, W., Yoshida, T., and Tang, W. 2011. A comparative study of TF-IDF, LSI and multi-words for text classification. Expert Systems with Applications 38 (2011), 2758--2765 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A multi-classifier system for text categorization

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied Computation
      November 2011
      355 pages
      ISBN:9781450310871
      DOI:10.1145/2103380

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 November 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate393of1,581submissions,25%
    • Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader