A discrete mixture-based kernel for SVMs: Application to spam and image categorization
Introduction
The technological developments of the last few decades have increased the volume of information (text, images and videos) on the Internet and Intranets. Different approaches have been proposed to manage, filter and retrieve these information. Two main categories of approaches are: model-based approaches and discriminative classifiers. Model-based approaches are based on generative probabilistic models and discriminative classifiers allow the construction of flexible decision boundaries. SVM is a well-known example of discriminative classifiers (Boser et al., 1992, Vapnik, 1999). As a theoretically rich method and because of their advantages such as their use of over-fitting protection independently from the number of features and their effectiveness in the case of sparse data, SVMs have been widely used in many applications. Finite mixture models, on the other hand, provide a principled and effective way for clustering (McLachlan & Peel, 2000). The majority of the work done with both techniques has focused on continuous data. This paper, however, concerns the modeling and classification of count data which are an important component in many applications and information management tasks (Bouguila, 2008, Bouguila and Ziou, 2007). To reach this goal, we use both mixture models and SVMs approaches in a way that combines their respective advantages. Indeed, combining model-based and discriminative approaches has been shown to be effective in different applications (Jaakkola & Haussler, 1999). We propose, then, a mixture model-based kernel for SVMs to classify count data. The proposed mixture model-based kernel assume that the data follow a multinomial Dirichlet mixture (MDM) distribution (Bouguila & Ziou, 2007) letting the data tell us as much as possible about its structure. The MDM model is learned using a modified deterministic annealing expectation maximization (DAEM) algorithm and a minimum description length (MDL) type criterion. The advantages of using a data-driven kernel, instead of “blindly” choosing classic kernels, are shown through some experiments involving spam filtering and image database categorization.
The organization of the remainder of this paper is as follows: In Section 2, we propose our data-driven kernel. Section 3, outlines the proposed mixture model estimation and selection. Experimental results are presented in Section 4. Finally, we conclude the paper in Section 5.
Section snippets
SVMs
SVM is basically a learning machine for two-group1 classification problems (Boser et al., 1992). Assume that we have a data set of N V-dimensional vectors with labels yi ∈ {−1, 1} belonging to either of two linearly separable classes C1 and C2. Let each be a vector representing a document or an image i,
Parameters estimation
Given the set of vectors representing a sample from our mixture model, an important problem is the estimation of the Θ parameter and the selection of the optimal number of clusters M. The estimation of the parameters of a mixture model is a well-known missing data problem resolved in general using the EM algorithm (Dempster, Laird, & Rubin, 1977). In the case of mixture models, the missing data are the vectors , where Zij = 1 if j is the mixture components from which
Experiments design and comparison with other kernels
In this section, we present our results on two interesting applications which are spam filtering and hierarchical classification of vacation images. Through these two applications, we compare the effectiveness of our MDM kernel with different other kernels which are: a Fisher kernel based on finite multinomial mixture (MM), polynomial kernel , Gaussian kernel and a generalized form of RBF kernels
Conclusions
In this paper, we have proposed a classification scheme for count data that incorporates both finite multinomial Dirichlet mixture models and SVMs. Via some applications concerning spam filtering and image databases categorization, model-based generated kernels are shown to outperform traditional kernels. It is well-known, however, that the nature of spam changes overtime and that the image databases are in general dynamic. Then, future works can be devoted to the development of an online
Acknowledgements
The completion of this research was made possible thanks to the Natural Sciences and Engineering Research Council of Canada (NSERC), a NATEQ Nouveaux Chercheurs Grant, and a start-up grant from Concordia University.
References (54)
- et al.
Unsupervised learning of a finite discrete mixture: Applications to texture modeling and image databases summarization
Journal of Visual Communication and Image Representation
(2007) - et al.
Learning to Classify E-Mail
Information Sciences
(2007) - et al.
Deterministic annealing EM algorithm
Neural Networks
(1998) - et al.
On image classification: City images vs landscapes
Pattern Recognition
(1998) Natural gradient works efficiently in learning
Neural Computation
(1998)- et al.
Latent Dirichlet allocation
Journal of Machine Learning Research
(2003) - Boser, B.E., Guyon, I.M., & Vapnik, V.N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of...
- Bouguila, N. (2007). Spatial color image databases summarization. In Proceedings of the IEEE international conference...
Clustering of count data using generalized Dirichlet multinomial distributions
IEEE Transactions on Knowledge and Data Engineering
(2008)- Bouguila, N., & Ziou, D. (2004). Improving content based image retrieval systems using finite multinomial Dirichlet...
Unsupervised selection of a finite Dirichlet mixture model: An MML-based approach
IEEE Transactions on Knowledge and Data Engineering
Novel mixtures based on the Dirichlet distribution: Application to data and image classification
A Graphical Model for Context-Aware Visual Content Recommendation
IEEE Transactions on Multimedia
A component-wise EM algorithm for mixtures
Journal of Computational and Graphical Statistics
Support vector machines for histogram-based image classification
IEEE Transactions on Neural Networks
Spam!
Communications of the ACM
An evaluation of statistical spam filtering techniques
ACM Transactions on Asian Language Information Processing
Maximum likelihood from incomplete data via the EM algorithm
Journal of the Royal Statistical Society B
Support vector machines for spam categorization
IEEE Transactions on Neural Networks
Using SVMs for text categorization
IEEE Intelligent Systems
Space-alternating generalized expectation-maximization algorithm
IEEE Transactions on Signal Processing
Cited by (48)
Multi-source information fusion to identify water supply pipe leakage based on SVM and VMD
2022, Information Processing and ManagementCitation Excerpt :Since statistical learning theory is a theory that specialises in the laws of machine learning under small sample conditions, SVM has a unique advantage in solving problems with small samples (Hu, Gan, Zhu, Liu & Shi, 2021; Tian, Mirzabagheri, Tirandazi & Bamakan, 2020). Therefore, it has been used by scholars in different fields (Bouguila & Amayri, 2009; Ferretti et al., 2018; Liu, Yu, Huang & An, 2011, Alemán Alemán Carreón, Nonaka, Hentona & Yamashiro, 2019, Li, Wu & Wang, 2020, Xin & Wu, 2020). On non-linear problems, SVM introduce penalty parameters and kernel functions to transform them into linear problems in high-dimensional spaces and thus achieve effective classification(Jain et al., 2018).
Machine learning for email spam filtering: review, approaches and open research problems
2019, HeliyonCitation Excerpt :Many researchers and academicians have proposed different email spam classification techniques which have been successfully used to classify data into groups. These methods include probabilistic, decision tree, artificial immune system [4], support vector machine (SVM) [5], artificial neural networks (ANN) [6], and case-based technique [7]. It have been shown in literature that it is possible to use these classification methods for spam mail filtering by using content-based filtering technique that will identify certain features (normally keywords frequently utilised in spam emails).
Unsupervised nested Dirichlet finite mixture model for clustering
2023, Applied IntelligenceSemantic Graph Based Convolutional Neural Network for Spam e-mail Classification in Cybercrime Applications
2023, International Journal of Computers, Communications and ControlHighly Accurate Spam Detection with the Help of Feature Selection and Data Transformation
2023, International Arab Journal of Information Technology