On the Behavior of SVM and Some Older Algorithms in Binary Text Classification Tasks

Colas, Fabrice; Brazdil, Pavel

doi:10.1007/11846406_6

On the Behavior of SVM and Some Older Algorithms in Binary Text Classification Tasks

Fabrice Colas²¹ &
Pavel Brazdil²²

Conference paper

1064 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4188))

Abstract

Document classification has already been widely studied. In fact, some studies compared feature selection techniques or feature space transformation whereas some others compared the performance of different algorithms. Recently, following the rising interest towards the Support Vector Machine, various studies showed that the SVM outperforms other classification algorithms. So should we just not bother about other classification algorithms and opt always for SVM?

We have decided to investigate this issue and compared SVM to kNN and naive Bayes on binary classification tasks. An important issue is to compare optimized versions of these algorithms, which is what we have done. Our results show all the classifiers achieved comparable performance on most problems. One surprising result is that SVM was not a clear winner, despite quite good overall performance. If a suitable preprocessing is used with kNN, this algorithm continues to achieve very good results and scales up well with the number of documents, which is not the case for SVM. As for naive Bayes, it also achieved good performance.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Rogati, M., Yang, Y.: High-performing feature selection for text classification. In: 11th International Conference on Information and Knowledge Management, pp. 659–661 (2002)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: 14th International Conference on Machine Learning, pp. 412–420 (1997)
Google Scholar
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Joachims, T.: Making large-scale support vector machine learning practical. In: Advances in Kernel Methods: Support Vector Machines (1998)
Google Scholar
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: 7th International Conference on Information and Knowledge Management, pp. 148–155 (1998)
Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: 22nd International Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)
Google Scholar
Zhang, T., Oles, F.J.: Text categorization based on regularized linear classification methods. Information Retrieval, 5–31 (2001)
Google Scholar
Fürnkranz, J.: Pairwise classification as an ensemble technique. In: 13th European Conference on Machine Learning, pp. 97–110 (2002)
Google Scholar
McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www.cs.cmu.edu/~mccallum/bow
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval, 69–90 (1999)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Google Scholar
Daelemans, W., Hoste, V., Meulder, F.D., Naudts, B.: Combined optimization of feature selection and algorithm parameters in machine learning of language. In: 14th European Conference of Machine Learning, pp. 84–95 (2003)
Google Scholar
Yang, Y.: A scalability analysis of classifiers in text categorization. In: 26th International Conference on Research and Development in Information Retrieval (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

LIACS, Leiden University, The Netherlands
Fabrice Colas
LIACC-NIAAD, University of Porto, Portugal
Pavel Brazdil

Authors

Fabrice Colas
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Brazdil
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Botanická 68a, CZ-602 00, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Department of Computer Graphics and Design, Masaryk University, Botanická 68a, 60200, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Colas, F., Brazdil, P. (2006). On the Behavior of SVM and Some Older Algorithms in Binary Text Classification Tasks. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2006. Lecture Notes in Computer Science(), vol 4188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846406_6

Download citation

DOI: https://doi.org/10.1007/11846406_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39090-9
Online ISBN: 978-3-540-39091-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics