Elsevier

Knowledge-Based Systems

Volume 118, 15 February 2017, Pages 152-164
Knowledge-Based Systems

MDLText: An efficient and lightweight text classifier

https://doi.org/10.1016/j.knosys.2016.11.018Get rights and content

Highlights

  • A novel multinomial text classification method based on the minimum description length principle is proposed.

  • The proposed approach is efficient, lightweight, scalable, multiclass, and sufficiently robust to prevent overfitting.

  • Experiments were performed using forty-five text corpora, in batch learning and online learning learning contexts.

  • The results indicate that our proposed approach outperformed the most-known benchmark text classification techniques.

Abstract

In many areas, the volume of text information is increasing rapidly, thereby demanding efficient text classification approaches. Several methods are available at present, but most exhibit declining performance as the dimensionality of the problem increases, or they incur high computational costs for training, which limit their application in real scenarios. Thus, it is necessary to develop a method that can process high dimensional data in a rapid manner. In this study, we propose the MDLText, an efficient, lightweight, scalable, and fast multinomial text classifier, which is based on the minimum description length principle. MDLText exhibits fast incremental learning as well as being sufficiently robust to prevent overfitting, which are desirable features in real-world applications, large-scale problems, and online scenarios. Our experiments were carefully designed to ensure that we obtained statistically sound results, which demonstrated that the proposed approach achieves a good balance between predictive power and computational efficiency.

Introduction

The amount of digital information stored in text format is increasing every day. Books, newspapers, magazines, and many other examples of traditional printed resources are now available as digital media. Even preliminary documents must be scanned to facilitate their control, safety, and access. In addition, many existing communication services, such as email, online instant messengers, social networks, comments on websites and blogs generate textual information, which needs to be analyzed to ensure its security as well as to improve content organization in order to provide a satisfactory user experience.

Efficient text classification approaches are needed increasingly to deal with the rapidly growing volume of user-generated content and the vast amount of digital information that is currently available. Several machine learning-based methods have been used, such as the established support vector machines (SVM) [9], [14], naïve Bayes [44], decision trees (DT) [13], [50], and k-nearest neighbors (KNN) [15].

In real-world applications, the dimensionality of the data is usually huge, which affects the performance of most well-known machine learning algorithms [68]. High-dimensional data can lead to the so-called curse of dimensionality and it may also be difficult to train the classifiers without incurring high computational costs.

A text classification technique should ideally be trained rapidly, such as naïve Bayes, and sufficiently robust to prevent overfitting, such as SVM. However, the performance of naïve Bayes classifiers unsurprisingly degrades when applied to high-dimensional data [4]. Moreover, SVMs often require costly and time-consuming training, especially in real-world applications where the number of classes is large and the feature space is huge [68].

Some well-known machine learning methods (e.g., SVM with traditional kernel functions) cannot be employed to deal with real-world text classification problems because they require that all of the examples should be stored in memory, or they must be presented simultaneously in a process known as batch or offline learning. Thus, various online machine learning approaches have been proposed. In general, these methods are simpler and faster than batch learning-based alternatives because they process and learn with one example at a time. However, if batch learning-based methods can be applied, they usually perform better than online learning techniques [16].

To address this problem, we propose MDLText as a novel multinomial text classification approach, which is based on the minimum description length (MDL) principle [53], [54]. This method has the inherent ability to prevent overfitting because it selects a model that fits the data well, while it naturally favors less complex models. Other classifiers include some way for preventing overfitting, however, MDLText does this in a natural way and hence it avoids employing additional regularization alternatives. Furthermore, it has a low computational cost, even for problems comprising a large volume of documents and a high-dimensional feature space. It can also be applied to multi-class problems without tricks, which avoids using approaches such as one-against-one and variants [29]. Moreover, MDLText can be used in online and dynamic scenarios because it allows incremental learning.

To support our claims, we performed a comprehensive performance evaluation to determine the efficiency and effectiveness of the proposed text classifier in batch and online learning scenarios, where we employed 45 well-known text corpora from various domains. Moreover, we compared our method with benchmark machine learning algorithms, which were trained with text documents represented by three different established term weighting schemes: binary, term-frequency (TF), and term frequency-inverse document frequency (TF-IDF) [60], [61].

The remainder of this paper is organized as follows. In Section 2, we briefly describe the related work available in literature. Basic concepts related to the MDL principle are presented in Section 3. In Section 4, we explain the proposed text classification method. The experimental settings are described in Section 5. Section 6 presents the results obtained from the batch and online learning tasks. Finally, we give our main conclusions and suggestions for future research in Section 7.

Section snippets

Related work

Machine learning methods have been used to solve several established text classification problems in a large amount of scenarios, such as sentiment analysis [47], news articles [25], [33], [57], [69], scientific documents [33], and emails [4], [57], [65]. Recently, some researches have also applied machine learning methods to classify short message service (SMS) [1], [3], blog spam [5], and social media [2].

The most widely used model for text classification is the vector space model (VSM) [61]

The MDL principle

The MDL principle states that if we need to choose between two or more models to fit some data, we should select that with the smallest description length [53], [54]. This means that less complex models are preferable [7], [24]. This idea is a formalization of the principle of parsimony, which is also known as Occams razor [18].

The MDL is based on Kolmogorov complexity, which can briefly be defined as the smallest program size that describes data represented by a binary sequence [40]. The lower

Mathematical basis of MDLText

Given an unlabeled text document d, the MDLText uses the main equation of the MDL principle (Eq. (1)) to predict the class of the document. The set of potential classes c1,c2,,c|C| represents the set of potential models M, while d represents the data X. Therefore, d receives the label j, which corresponds to class cj with the minimum overall description length related to d: c(d)=argmincL(d|cj).

We have ignored the description length of the potential classes (models) because, as we stated in

Experimental setup

We carefully performed a comprehensive evaluation to assess the performance of the proposed MDLText, where we used a large number of well-known and public benchmark corpora, which comprised 45 datasets [57], [72]. Table 1 summarizes the main statistics for each of the datasets used in this study. The amount of samples (text documents) is represented by |D|,|d| corresponds to the number of features (vocabulary) and |C| is the number of classes. The last column presents the amount of documents

Experiments and results

In the following, we describe the results obtained in our experiments using batch and online learning tasks.

Conclusions and future work

In this study, we proposed a novel multinomial text classification technique based on the MDL principle. The proposed method has desirable features, such as incremental learning, low computational cost, and sufficient robustness to prevent overfitting, thereby leading to high predictive power and efficiency, which allow its application to real-world, online, and large-scale text classification problems.

To assess its performance, we conducted a comprehensive evaluation using 45 large, real,

Acknowledgment

The authors are grateful for financial support from the Brazilian agencies FAPESP, Capes, and CNPq (grant 141089/2013-0).

References (73)

  • L. Zhang et al.

    Two feature weighting approaches for naive bayes text classifiers

    Knowl. Based Syst.

    (2016)
  • T.C. Alberto et al.

    Tubespam: Comment Spam Filtering on Youtube

    Proceedings of the 14th International Conference on Machine Learning and Applications (ICMLA’15)

    (Dec. 2015)
  • T.A. Almeida et al.

    Spam filtering: how the dimensionality reduction affects the accuracy of naive bayes classifiers

    J. Internet Serv. Appl.

    (Feb. 2011)
  • M. Alsaleh et al.

    Combating Comment Spam with Machine Learning Approaches

    Proceedings of the 14th International Conference on Machine Learning and Applications (ICMLA’15)

    (Dec. 2015)
  • F. Assis et al.

    Exponential Differential Document Count – a Feature Selection Factor for Improving Bayesian Filters Accuracy

    Proceedings of the 2006 MIT Spam Conference (SP’06)

    (2006)
  • A. Barron et al.

    The minimum description length principle in coding and modeling

    IEEE Trans. Inf. Theory

    (Oct. 1998)
  • N. Begum et al.

    Towards a Minimum Description Length Based Stopping Criterion for Semi-supervised Time Series Classification

    Proceedings of the 14th IEEE International Conference on Information Reuse and Integration (IRI’13)

    (Aug. 2013)
  • B.E. Boser et al.

    A Training Algorithm for Optimal Margin Classifiers

    Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (COLT’92)

    (Jul. 1992)
  • A. Bosin et al.

    High-dimensional Micro-array Data Classification Using Minimum Description Length and Domain Expert Knowledge

    Advances in Applied Artificial Intelligence, Vol. 4031 of Lecture Notes in Computer Science

    (2006)
  • L. Bottou

    Large-scale Machine Learning with Stochastic Gradient Descent

  • L. Breiman

    Random forests

    Mach. Learn.

    (Oct. 2001)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • C. Cortes et al.

    Support-vector networks

    Mach. Learn.

    (Sep. 1995)
  • T.M. Cover et al.

    Nearest neighbor pattern classification

    IEEE Trans. Inf. Theory

    (Jan. 1967)
  • K. Crammer et al.

    Confidence-weighted linear classification for text categorization

    J. Mach. Learn. Res.

    (Jun. 2012)
  • J. Demsar

    Statistical comparisons of classifiers over multiple data sets

    J. Mach. Learn. Res.

    (Dec. 2006)
  • P. Domingos

    The role of occam’s razor in knowledge discovery

    Data Min. Knowl. Discov.

    (1999)
  • R.O. Duda et al.

    Pattern Classification

    (2000)
  • H.J. Escalante et al.

    Term-weighting learning via genetic programming for text classification

    Knowl. Based Syst.

    (Jul. 2015)
  • N. Friedman et al.

    Bayesian network classifiers

    Mach. Learn.

    (Nov. 1997)
  • P.D. Grünwald

    A Tutorial Introduction to the Minimum Description Length Principle

    Advances in Minimum Description Length: Theory and Applications

    (2005)
  • P.D. Grünwald et al.

    Advances in Minimum Description Length: Theory and Applications

    (2005)
  • M. Gutlein et al.

    Large-scale Attribute Selection Using Wrappers

    Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Data Mining (CIDM’09)

    (Mar. 2009)
  • E.-H.S. Han et al.

    Text Categorization Using Weight Adjusted K-nearest Neighbor Classification

    Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’01)

    (Apr. 2001)
  • C.-J. Hsieh et al.

    A Dual Coordinate Descent Method for Large-scale Linear SVM

    Proceedings of the 25th International Conference on Machine Learning (ICML’08)

    (Jun. 2008)
  • C. Hsu et al.

    A practical guide to support vector classification

    Tech. rep.

    (2003)
  • Cited by (25)

    • ML-MDLText: An efficient and lightweight multilabel text classifier with incremental learning

      2020, Applied Soft Computing Journal
      Citation Excerpt :

      However, as there is no standard procedure for calculating the description length of a model, these studies interpret and apply the MDL principle in very different ways [6]. In this study, we apply MDL in a new online multilabel classification method using the interpretation proposed by Silva et al. [6] to calculate the description length of a model in text classification problems. An important characteristic that stands out in MDL-based methods is that they tend to select the least complex model, which is also the model that can best capture the patterns of the data [6].

    • Gaussian Mixture Descriptors Learner

      2020, Knowledge-Based Systems
      Citation Excerpt :

      According to the authors, their method obtained better results than benchmark methods for spam classification. Silva et al. [9] extended the method proposed by Almeida and Yamakami [7] so that it can be used in other binary and multiclass text classification problems. Later, Silva et al. [8] extended the method proposed by Silva et al. [9] so that it can be applied to other classification problems.

    • Towards automatic filtering of fake reviews

      2018, Neurocomputing
      Citation Excerpt :

      It is an online and multinomial text classification method based on the minimum description length principle. Silva et al. [51] claim that the advantages of this method are its incremental learning capability, low computational cost, and robustness to prevent overfitting which are desirable characteristics for real-world, online, and large-scale text classification problems. Silva et al. [51] evaluated this method using 45 text datasets from various domains in online and offline learning scenarios.

    View all citing articles on Scopus
    View full text