Text Categorization Using an Ensemble Classifier Based on a Mean Co-association Matrix

Moreira-Matias, Luís; Mendes-Moreira, João; Gama, João; Brazdil, Pavel

doi:10.1007/978-3-642-31537-4_41

Luís Moreira-Matias^20,21,
João Mendes-Moreira^20,21,
João Gama^21,22 &
…
Pavel Brazdil^21,22

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7376))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

5941 Accesses
3 Citations

Abstract

Text Categorization (TC) has attracted the attention of the research community in the last decade. Algorithms like Support Vector Machines, Naïve Bayes or k Nearest Neighbors have been used with good performance, confirmed by several comparative studies. Recently, several ensemble classifiers were also introduced in TC. However, many of those can only provide a category for a given new sample. Instead, in this paper, we propose a methodology – MECAC – to build an ensemble of classifiers that has two advantages to other ensemble methods: 1) it can be run using parallel computing, saving processing time and 2) it can extract important statistics from the obtained clusters. It uses the mean co-association matrix to solve binary TC problems. Our experiments revealed that our framework performed, on average, 2.04% better than the best individual classifier on the tested datasets. These results were statistically validated for a significance level of 0.05 using the Friedman Test.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In: ICML 1997, pp. 412–420 (1997)
Google Scholar
Yang, Y., Liu, X.: A Re-Examination of Text Categorization Methods. In: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)
Google Scholar
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1, 69–90 (1999)
Article Google Scholar
Colas, F., Brazdil, P.: Comparison of SVM and Some Older Classification Algorithms in Text Classification Tasks. In: Artificial Intelligence in Theory and Practice, pp. 169–178 (2006)
Google Scholar
Cho, S., Lee, J.: Learning Neural Network Ensemble for Practical Text Classification. In: Liu, J., Cheung, Y.-m., Yin, H. (eds.) IDEAL 2003. LNCS, vol. 2690, pp. 1032–1036. Springer, Heidelberg (2003)
Google Scholar
Bi, Y., Bell, D.A., Wang, H., Guo, G., Greer, K.: Combining Multiple Classifiers Using Dempster’s Rule of Combination for Text Categorization. In: Torra, V., Narukawa, Y. (eds.) MDAI 2004. LNCS (LNAI), vol. 3131, pp. 127–138. Springer, Heidelberg (2004)
Chapter Google Scholar
Zhang, T., Oles, F.: Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval 4, 5–31 (2001)
Article MATH Google Scholar
Monti, S., Tamayo, P., Mesirov, J., Golub, T.: Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning 52, 91–118 (2003)
Article MATH Google Scholar
Bottcher, M., Hoppner, F., Spiliopoulou, M.: On Exploiting the Power of Time in Data Mining. SIGKDD Explor. Newsl. 10, 3–11 (2008)
Article Google Scholar
http://www.daviddlewis.com/resources/testcollections/reuters21578/
Khan, A., Baharudin, B., Lee, L., Khan, K.: A Review of Machine Learning Algorithms for Text-Documents Classification. Journal of Advances in Information Technology 1 (2010)
Google Scholar
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: 14th International Conference on Machine Learning, ICML 1997, pp. 143–151 (1997)
Google Scholar
Nardiello, P., Sebastiani, F., Sperduti, A.: Discretizing Continuous Attributes in AdaBoost for Text Categorization. Advances in Information Retrieval (2003)
Google Scholar
Dunn, J.: Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics 4, 95–104
Google Scholar
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On Clustering Validation Techniques. Journal of Intelligent Information Systems 17, 107–145 (2001)
Article MATH Google Scholar
Meila, M.: Comparing clusterings–an information based distance. Journal of Multivariate Analysis 98, 873–895 (2007)
Article MathSciNet MATH Google Scholar
R Development Core Team: R: A Language and Environment for Statistical Computing., Vienna, Austria (2005)
Google Scholar
Salton, G., Allan, J., Buckley, C., Singhal, A.: Automatic analysis, theme generation, and summarization of machine-readable texts. Readings in Information Retrieval, 478–483 (1997)
Google Scholar
Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 659–661. ACM, McLean (2002)
Google Scholar
Venables, W., Ripley, B.: Modern Applied Statistics with S, New York, USA (2002)
Google Scholar
Chang, C., Lin, C.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 1–27 (2011)
Google Scholar
Hornik, K., Buchta, C., Zeileis, A.: Open-source machine learning: R meets Weka. Computational Statistics 24, 225–232 (2009)
Article MathSciNet MATH Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Article MathSciNet Google Scholar
Cohen, J.: A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement 20, 37–46 (1960)
Article Google Scholar
Iman, R., Davenport, J.: Approximations of the critical region of the Friedman statistic. Communications in Statistics 571–595 (1980)
Google Scholar
Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection. In: 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Canada, pp. 688–693 (2002)
Google Scholar
Mendes-Moreira, J., Jorge, A.M., Soares, C., de Sousa, J.F.: Ensemble Learning: A Study on Different Variants of the Dynamic Selection Approach. In: Perner, P. (ed.) MLDM 2009. LNCS, vol. 5632, pp. 191–205. Springer, Heidelberg (2009)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Engenharia Informática, Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, s/n, 4200-465, Porto, Portugal
Luís Moreira-Matias & João Mendes-Moreira
LIAAD-INESC Porto L.A., Rua de Ceuta, 118, 6º, 4050-190, Porto, Portugal
Luís Moreira-Matias, João Mendes-Moreira, João Gama & Pavel Brazdil
Faculdade de Economia, Universidade do Porto, Rua Dr. Roberto Frias, s/n, 4200-465, Porto, Portugal
João Gama & Pavel Brazdil

Authors

Luís Moreira-Matias
View author publications
You can also search for this author in PubMed Google Scholar
João Mendes-Moreira
View author publications
You can also search for this author in PubMed Google Scholar
João Gama
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Brazdil
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and Applied Computer Sciences, IBaI, Kohlenstraße 2, 04107, Leipzig, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moreira-Matias, L., Mendes-Moreira, J., Gama, J., Brazdil, P. (2012). Text Categorization Using an Ensemble Classifier Based on a Mean Co-association Matrix. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2012. Lecture Notes in Computer Science(), vol 7376. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31537-4_41

Download citation

DOI: https://doi.org/10.1007/978-3-642-31537-4_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31536-7
Online ISBN: 978-3-642-31537-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics