MDLText: An efficient and lightweight text classifier
Introduction
The amount of digital information stored in text format is increasing every day. Books, newspapers, magazines, and many other examples of traditional printed resources are now available as digital media. Even preliminary documents must be scanned to facilitate their control, safety, and access. In addition, many existing communication services, such as email, online instant messengers, social networks, comments on websites and blogs generate textual information, which needs to be analyzed to ensure its security as well as to improve content organization in order to provide a satisfactory user experience.
Efficient text classification approaches are needed increasingly to deal with the rapidly growing volume of user-generated content and the vast amount of digital information that is currently available. Several machine learning-based methods have been used, such as the established support vector machines (SVM) [9], [14], naïve Bayes [44], decision trees (DT) [13], [50], and k-nearest neighbors (KNN) [15].
In real-world applications, the dimensionality of the data is usually huge, which affects the performance of most well-known machine learning algorithms [68]. High-dimensional data can lead to the so-called curse of dimensionality and it may also be difficult to train the classifiers without incurring high computational costs.
A text classification technique should ideally be trained rapidly, such as naïve Bayes, and sufficiently robust to prevent overfitting, such as SVM. However, the performance of naïve Bayes classifiers unsurprisingly degrades when applied to high-dimensional data [4]. Moreover, SVMs often require costly and time-consuming training, especially in real-world applications where the number of classes is large and the feature space is huge [68].
Some well-known machine learning methods (e.g., SVM with traditional kernel functions) cannot be employed to deal with real-world text classification problems because they require that all of the examples should be stored in memory, or they must be presented simultaneously in a process known as batch or offline learning. Thus, various online machine learning approaches have been proposed. In general, these methods are simpler and faster than batch learning-based alternatives because they process and learn with one example at a time. However, if batch learning-based methods can be applied, they usually perform better than online learning techniques [16].
To address this problem, we propose MDLText as a novel multinomial text classification approach, which is based on the minimum description length (MDL) principle [53], [54]. This method has the inherent ability to prevent overfitting because it selects a model that fits the data well, while it naturally favors less complex models. Other classifiers include some way for preventing overfitting, however, MDLText does this in a natural way and hence it avoids employing additional regularization alternatives. Furthermore, it has a low computational cost, even for problems comprising a large volume of documents and a high-dimensional feature space. It can also be applied to multi-class problems without tricks, which avoids using approaches such as one-against-one and variants [29]. Moreover, MDLText can be used in online and dynamic scenarios because it allows incremental learning.
To support our claims, we performed a comprehensive performance evaluation to determine the efficiency and effectiveness of the proposed text classifier in batch and online learning scenarios, where we employed 45 well-known text corpora from various domains. Moreover, we compared our method with benchmark machine learning algorithms, which were trained with text documents represented by three different established term weighting schemes: binary, term-frequency (TF), and term frequency-inverse document frequency (TF-IDF) [60], [61].
The remainder of this paper is organized as follows. In Section 2, we briefly describe the related work available in literature. Basic concepts related to the MDL principle are presented in Section 3. In Section 4, we explain the proposed text classification method. The experimental settings are described in Section 5. Section 6 presents the results obtained from the batch and online learning tasks. Finally, we give our main conclusions and suggestions for future research in Section 7.
Section snippets
Related work
Machine learning methods have been used to solve several established text classification problems in a large amount of scenarios, such as sentiment analysis [47], news articles [25], [33], [57], [69], scientific documents [33], and emails [4], [57], [65]. Recently, some researches have also applied machine learning methods to classify short message service (SMS) [1], [3], blog spam [5], and social media [2].
The most widely used model for text classification is the vector space model (VSM) [61]
The MDL principle
The MDL principle states that if we need to choose between two or more models to fit some data, we should select that with the smallest description length [53], [54]. This means that less complex models are preferable [7], [24]. This idea is a formalization of the principle of parsimony, which is also known as Occams razor [18].
The MDL is based on Kolmogorov complexity, which can briefly be defined as the smallest program size that describes data represented by a binary sequence [40]. The lower
Mathematical basis of MDLText
Given an unlabeled text document d, the MDLText uses the main equation of the MDL principle (Eq. (1)) to predict the class of the document. The set of potential classes represents the set of potential models M, while d represents the data X. Therefore, d receives the label j, which corresponds to class cj with the minimum overall description length related to d:
We have ignored the description length of the potential classes (models) because, as we stated in
Experimental setup
We carefully performed a comprehensive evaluation to assess the performance of the proposed MDLText, where we used a large number of well-known and public benchmark corpora, which comprised 45 datasets [57], [72]. Table 1 summarizes the main statistics for each of the datasets used in this study. The amount of samples (text documents) is represented by corresponds to the number of features (vocabulary) and is the number of classes. The last column presents the amount of documents
Experiments and results
In the following, we describe the results obtained in our experiments using batch and online learning tasks.
Conclusions and future work
In this study, we proposed a novel multinomial text classification technique based on the MDL principle. The proposed method has desirable features, such as incremental learning, low computational cost, and sufficient robustness to prevent overfitting, thereby leading to high predictive power and efficiency, which allow its application to real-world, online, and large-scale text classification problems.
To assess its performance, we conducted a comprehensive evaluation using 45 large, real,
Acknowledgment
The authors are grateful for financial support from the Brazilian agencies FAPESP, Capes, and CNPq (grant 141089/2013-0).
References (73)
- et al.
Semi-supervised learning using frequent itemset and ensemble learning for SMS classification
Expert Syst. Appl.
(Feb. 2015) - et al.
Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering
Knowl. Based Syst.
(May 2016) - et al.
Large margin classification using the perceptron algorithm
Mach. Learn.
(Dec. 1999) - et al.
Discriminatively weighted naive bayes and its application in text classification
Int. J. Artif. Intell. Tools
(2012) Efficient Classification Method for Large Dataset
Proceedings of the 2006 International Conference on Machine Learning and Cybernetics (ICMLC’06)
(Aug. 2006)- et al.
Bag of tricks for efficient text classification
(Jul. 2016) - et al.
Benchmarking text collections for classification and clustering tasks
Tech. rep. 395
(2013) Machine learning in automated text categorization
ACM Comput. Surv.
(Mar. 2002)- et al.
A novel probabilistic feature selection method for text classification
Knowl. Based Syst.
(Dec. 2012) - et al.
Forestexter: an efficient random forest algorithm for imbalanced text categorization
Knowl. Based Syst.
(Sep. 2014)
Two feature weighting approaches for naive bayes text classifiers
Knowl. Based Syst.
Tubespam: Comment Spam Filtering on Youtube
Proceedings of the 14th International Conference on Machine Learning and Applications (ICMLA’15)
Spam filtering: how the dimensionality reduction affects the accuracy of naive bayes classifiers
J. Internet Serv. Appl.
Combating Comment Spam with Machine Learning Approaches
Proceedings of the 14th International Conference on Machine Learning and Applications (ICMLA’15)
Exponential Differential Document Count – a Feature Selection Factor for Improving Bayesian Filters Accuracy
Proceedings of the 2006 MIT Spam Conference (SP’06)
The minimum description length principle in coding and modeling
IEEE Trans. Inf. Theory
Towards a Minimum Description Length Based Stopping Criterion for Semi-supervised Time Series Classification
Proceedings of the 14th IEEE International Conference on Information Reuse and Integration (IRI’13)
A Training Algorithm for Optimal Margin Classifiers
Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (COLT’92)
High-dimensional Micro-array Data Classification Using Minimum Description Length and Domain Expert Knowledge
Advances in Applied Artificial Intelligence, Vol. 4031 of Lecture Notes in Computer Science
Large-scale Machine Learning with Stochastic Gradient Descent
Random forests
Mach. Learn.
Classification and Regression Trees
Support-vector networks
Mach. Learn.
Nearest neighbor pattern classification
IEEE Trans. Inf. Theory
Confidence-weighted linear classification for text categorization
J. Mach. Learn. Res.
Statistical comparisons of classifiers over multiple data sets
J. Mach. Learn. Res.
The role of occam’s razor in knowledge discovery
Data Min. Knowl. Discov.
Pattern Classification
Term-weighting learning via genetic programming for text classification
Knowl. Based Syst.
Bayesian network classifiers
Mach. Learn.
A Tutorial Introduction to the Minimum Description Length Principle
Advances in Minimum Description Length: Theory and Applications
Advances in Minimum Description Length: Theory and Applications
Large-scale Attribute Selection Using Wrappers
Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Data Mining (CIDM’09)
Text Categorization Using Weight Adjusted K-nearest Neighbor Classification
Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’01)
A Dual Coordinate Descent Method for Large-scale Linear SVM
Proceedings of the 25th International Conference on Machine Learning (ICML’08)
A practical guide to support vector classification
Tech. rep.
Cited by (25)
CoFA for QoS based secure communication using adaptive chaos dynamical system in fog-integrated cloud
2022, Digital Signal Processing: A Review JournalML-MDLText: An efficient and lightweight multilabel text classifier with incremental learning
2020, Applied Soft Computing JournalCitation Excerpt :However, as there is no standard procedure for calculating the description length of a model, these studies interpret and apply the MDL principle in very different ways [6]. In this study, we apply MDL in a new online multilabel classification method using the interpretation proposed by Silva et al. [6] to calculate the description length of a model in text classification problems. An important characteristic that stands out in MDL-based methods is that they tend to select the least complex model, which is also the model that can best capture the patterns of the data [6].
Gaussian Mixture Descriptors Learner
2020, Knowledge-Based SystemsCitation Excerpt :According to the authors, their method obtained better results than benchmark methods for spam classification. Silva et al. [9] extended the method proposed by Almeida and Yamakami [7] so that it can be used in other binary and multiclass text classification problems. Later, Silva et al. [8] extended the method proposed by Silva et al. [9] so that it can be applied to other classification problems.
A review of soft techniques for SMS spam classification: Methods, approaches and applications
2019, Engineering Applications of Artificial IntelligenceTowards automatic filtering of fake reviews
2018, NeurocomputingCitation Excerpt :It is an online and multinomial text classification method based on the minimum description length principle. Silva et al. [51] claim that the advantages of this method are its incremental learning capability, low computational cost, and robustness to prevent overfitting which are desirable characteristics for real-world, online, and large-scale text classification problems. Silva et al. [51] evaluated this method using 45 text datasets from various domains in online and offline learning scenarios.
A recent overview of the state-of-the-art elements of text classification
2018, Expert Systems with Applications