Abstract
Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Benedetto, B., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88(048702) (2002)
Benedetto, B., Caglioti, E., Loreto, V.: On J. Goodman’s comment, to Language trees and zipping (July 2004), http://arxiv.org/abs/cond-mat/0203275
Benedetto, D., Caglioti, E.: Benedetto, Caglioti, and Loreto reply. Physical Review Letters 90(089804) (2003)
Cavnar, W.B., Tenkle, J.M.: N-gram-based text categorization. In: Proc. of the 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1994), pp. 161–175 (1994)
Chen, S.F., Goodman, J.: An Empirical Study of smoothing techniques for language modeling. In: Proc. of the Thirty-Fourth Annual Meeting of the Association for Computational Linguistics (1998)
Damashek, M.: Gauging similarity with n-grams: Language-independent categorization of text. Science 267(5199), 843–848 (1995)
Eyheramendy, S., Lewis, D.D., Madigan, D.: On the naive Bayes model for text categorization. In: Proc. of the Ninth International Workshop on Artificial Intelligence and Statistics (2003)
Frank, E., Chui, C., Witten, I.H.: Text Categorization Using Compression Models. In: Proc. of DCC 2000, IEEE Data Compression Conference, pp. 200–209 (2000)
Ghani, R.: Using Error Correcting Codes for Efficient Text Classification with a Large Number of Categories. KDD project report. Masters Thesis. Center for Automated Learning and Discovery, Carnegie Mellon University (2001)
Goodman, J.T.: A Bit of Progress in Language Modeling, Extended Version. Computer Speech and Language, 403-434 (October 2001)
Goodman, J.: Extended comment on language trees and zipping, http://arxiv.org/abs/cond-mat/0202383
gzip, a GNU license compression tool, version 1.3.3 (2002-03-08). Copyright, Free Software Foundation Copyright 1992-1993 Jean-loup Gailly (2002)
Khmelev, D., Teahan, W.: A repetition based measure for verification of text collections and for text categorization. In: Proc. of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 104–110 (2003)
Khmelev, D.V., Teahan, W.J.: Comment: Language trees and zipping. Physical Review Letters 90(089803) (2003)
Khmelev, D., Tweedie, F.: Using Markov Chains for Identification of Writers. Literary and Linguistic Computing 16(4), 299–307 (2001)
Kukushkina, O.V., Polikarpov, A.A., Khmelev, D.V.: Using literal and grammatical statistics for authorship attribution. Problems of Information Transmission 37, 172–184 (2001)
Lowenstern, D., Hirsh, H., Noordiwier, M., Yianilos, P.: DNA Sequence Classification Using Compression-Based Induction. DIMACS Technical Report 95-04 (1995)
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: Proc. of the AAAI 1998 Workshop on Learning for Text Categorization (1998)
Mitchell, T.: Tutorial on machine Learning over natural language documents, http://www.cs.cmu.edu/~tom/text-learning.ps
Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist, 2nd edn. Springer, New York (1964), Applied Bayesian and Classical Inference (1984)
Nelson, M.R.: LZW source code. Dr. Dobb’s Journal (October 1989), Also available at, http://www.dogma.net/markn/articles/lzw/lzw.htm
Peng, F., Schuurmans, D.: Combining Naive Bayes and n-gram language models for text classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 335–350. Springer, Heidelberg (2003)
Peng, F., Schuurmans, D., Wang, S.: Augmenting Naive Bayes classifiers with statistical language models. Information Retrieval 7, 317–345 (2004)
Peng, F., Schuurmans, D., Wang, S.: Language and task independent text categorization with simple language models. In: Proc. Human Language Technology Conference of the North American Chapter of the ACL, pp. 189–196 (2003)
RAR compression tool by RAR Labs, Inc., Version 3.30 (January 22, 2004). Copyright (c) Eugene Roshal (1993-2004), www.rarlab.com
Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of Naive Bayes text classifiers. In: Proc. of the Twentieth International Conference on Machine Learning (2003)
Rorshal, Eugene (RAR Labs Inc.): Personal communication (2004)
Schechter, B.: Fun with your zip program: Sort through texts, and more, New York Times, April 30 (2002)
Shkarin, D.: Improving the efficiency of PPM algorithm. Problems of information transmission 34(3), 44–54 (2001), In Russian. English description available at http://www.dogma.net/DataCompression/Miscellaneous/PPMII_DCC02.pdf
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Teahan, W.J.: Modelling English Text. PhD thesis, University of Waikato (1998)
Teahan, W.J.: Text classification and segmentation using minimum cross-entropy. In: Proc. RIAO 2000, 6th International Conference Recherche d’Information Assistee par Ordinateur (2000)
Teahan, W.J., Harper, D.J.: Using compression-based language models for text categorization. In: Proc. of the Workshop on Language Modeling and Information Retrieval (2001)
Thaper, N.: Using Compression For Source Based Classification Of Text. Master’s Thesis, Massachusetts Institute of Technology (2001)
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1/2), 67–88 (1999)
Yang, Y., Pederson, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of 14th International Conference on Machine Learning, ICML 1997 (1997)
Zhang, T.: Personal communication (2004)
Zhang, T., Oles, J.F.: Text Categorization Based on Regularized Linear Classification Methods. Information retrieval 4, 5–31 (2001)
Zhang, J., Jin, R., Yang, Y., Hauptmann, A.G.: Modified Logistic Regression: An Approximation to SVM and Its Applications in Large-Scale Text Categorization. In: Proc. of the 20th International Conference on Machine Learning, pp. 888–895 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Marton, Y., Wu, N., Hellerstein, L. (2005). On Compression-Based Text Classification. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_22
Download citation
DOI: https://doi.org/10.1007/978-3-540-31865-1_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25295-5
Online ISBN: 978-3-540-31865-1
eBook Packages: Computer ScienceComputer Science (R0)