A discrete mixture-based kernel for SVMs: Application to spam and image categorization

https://doi.org/10.1016/j.ipm.2009.05.005Get rights and content

Abstract

In this paper, we investigate the problem of training support vector machines (SVMs) on count data. Multinomial Dirichlet mixture models allow us to model efficiently count data. On the other hand, SVMs permit good discrimination. We propose, then, a hybrid model that appropriately combines their advantages. Finite mixture models are introduced, as an SVM kernel, to incorporate prior knowledge about the nature of data involved in the problem at hand. For the learning of our mixture model, we propose a deterministic annealing component-wise EM algorithm mixed with a minimum description length type criterion. In the context of this model, we compare different kernels. Through some applications involving spam and image database categorization, we find that our data-driven kernel performs better.

Introduction

The technological developments of the last few decades have increased the volume of information (text, images and videos) on the Internet and Intranets. Different approaches have been proposed to manage, filter and retrieve these information. Two main categories of approaches are: model-based approaches and discriminative classifiers. Model-based approaches are based on generative probabilistic models and discriminative classifiers allow the construction of flexible decision boundaries. SVM is a well-known example of discriminative classifiers (Boser et al., 1992, Vapnik, 1999). As a theoretically rich method and because of their advantages such as their use of over-fitting protection independently from the number of features and their effectiveness in the case of sparse data, SVMs have been widely used in many applications. Finite mixture models, on the other hand, provide a principled and effective way for clustering (McLachlan & Peel, 2000). The majority of the work done with both techniques has focused on continuous data. This paper, however, concerns the modeling and classification of count data which are an important component in many applications and information management tasks (Bouguila, 2008, Bouguila and Ziou, 2007). To reach this goal, we use both mixture models and SVMs approaches in a way that combines their respective advantages. Indeed, combining model-based and discriminative approaches has been shown to be effective in different applications (Jaakkola & Haussler, 1999). We propose, then, a mixture model-based kernel for SVMs to classify count data. The proposed mixture model-based kernel assume that the data follow a multinomial Dirichlet mixture (MDM) distribution (Bouguila & Ziou, 2007) letting the data tell us as much as possible about its structure. The MDM model is learned using a modified deterministic annealing expectation maximization (DAEM) algorithm and a minimum description length (MDL) type criterion. The advantages of using a data-driven kernel, instead of “blindly” choosing classic kernels, are shown through some experiments involving spam filtering and image database categorization.

The organization of the remainder of this paper is as follows: In Section 2, we propose our data-driven kernel. Section 3, outlines the proposed mixture model estimation and selection. Experimental results are presented in Section 4. Finally, we conclude the paper in Section 5.

Section snippets

SVMs

SVM is basically a learning machine for two-group1 classification problems (Boser et al., 1992). Assume that we have a data set of N V-dimensional vectors X=(X1,,XN) with labels yi  {−1, 1} belonging to either of two linearly separable classes C1 and C2. Let each Xi=(Xi1,,XiV),i=1,,N be a vector representing a document or an image i,

Parameters estimation

Given the set of vectors X representing a sample from our mixture model, an important problem is the estimation of the Θ parameter and the selection of the optimal number of clusters M. The estimation of the parameters of a mixture model is a well-known missing data problem resolved in general using the EM algorithm (Dempster, Laird, & Rubin, 1977). In the case of mixture models, the missing data are the vectors Zi=(Zi1,,ZiM),i=1,,N, where Zij = 1 if j is the mixture components from which Xi

Experiments design and comparison with other kernels

In this section, we present our results on two interesting applications which are spam filtering and hierarchical classification of vacation images. Through these two applications, we compare the effectiveness of our MDM kernel with different other kernels which are: a Fisher kernel based on finite multinomial mixture (MM), polynomial kernel Kpoly(Xi,Xj)=(XiXj+1)p, Gaussian kernel KGaussian(Xi,Xj)=e-ρ||Xi-Xj||2,KSigmoid=tanh(XiXj+1) and a generalized form of RBF kernels Kd-RBF(Xi,Xj)

Conclusions

In this paper, we have proposed a classification scheme for count data that incorporates both finite multinomial Dirichlet mixture models and SVMs. Via some applications concerning spam filtering and image databases categorization, model-based generated kernels are shown to outperform traditional kernels. It is well-known, however, that the nature of spam changes overtime and that the image databases are in general dynamic. Then, future works can be devoted to the development of an online

Acknowledgements

The completion of this research was made possible thanks to the Natural Sciences and Engineering Research Council of Canada (NSERC), a NATEQ Nouveaux Chercheurs Grant, and a start-up grant from Concordia University.

References (54)

  • N. Bouguila et al.

    Unsupervised selection of a finite Dirichlet mixture model: An MML-based approach

    IEEE Transactions on Knowledge and Data Engineering

    (2006)
  • N. Bouguila et al.

    Novel mixtures based on the Dirichlet distribution: Application to data and image classification

  • S. Boutemedjet et al.

    A Graphical Model for Context-Aware Visual Content Recommendation

    IEEE Transactions on Multimedia

    (2008)
  • G. Celeux et al.

    A component-wise EM algorithm for mixtures

    Journal of Computational and Graphical Statistics

    (2001)
  • Chandalia, G., & Beal, M. (2006). Using Fisher Kernels from topic models for dimensionality reduction. In Proceedings...
  • O. Chapelle et al.

    Support vector machines for histogram-based image classification

    IEEE Transactions on Neural Networks

    (1999)
  • Cohen, W.W. (1996). Learning rules that classify e-mail. In Proceedings of the AAAI spring symposium on machine...
  • Cormack, G. V. (2006). Harnessing unlabeled examples through iterative application of dynamic markov modeling. In S....
  • L.F. Cranor et al.

    Spam!

    Communications of the ACM

    (1998)
  • L.F. Cranor et al.

    An evaluation of statistical spam filtering techniques

    ACM Transactions on Asian Language Information Processing

    (2004)
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    Journal of the Royal Statistical Society B

    (1977)
  • H. Drucker et al.

    Support vector machines for spam categorization

    IEEE Transactions on Neural Networks

    (1999)
  • S. Dumais

    Using SVMs for text categorization

    IEEE Intelligent Systems

    (1998)
  • Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text...
  • Elkan, C. (2005). Deriving TF-IDF as a Fisher Kernel. In Proceedings of the 12th international conference on string...
  • Elkan, C. (2006). Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial...
  • J.A. Fessler et al.

    Space-alternating generalized expectation-maximization algorithm

    IEEE Transactions on Signal Processing

    (1994)
  • Cited by (48)

    • Multi-source information fusion to identify water supply pipe leakage based on SVM and VMD

      2022, Information Processing and Management
      Citation Excerpt :

      Since statistical learning theory is a theory that specialises in the laws of machine learning under small sample conditions, SVM has a unique advantage in solving problems with small samples (Hu, Gan, Zhu, Liu & Shi, 2021; Tian, Mirzabagheri, Tirandazi & Bamakan, 2020). Therefore, it has been used by scholars in different fields (Bouguila & Amayri, 2009; Ferretti et al., 2018; Liu, Yu, Huang & An, 2011, Alemán Alemán Carreón, Nonaka, Hentona & Yamashiro, 2019, Li, Wu & Wang, 2020, Xin & Wu, 2020). On non-linear problems, SVM introduce penalty parameters and kernel functions to transform them into linear problems in high-dimensional spaces and thus achieve effective classification(Jain et al., 2018).

    • Machine learning for email spam filtering: review, approaches and open research problems

      2019, Heliyon
      Citation Excerpt :

      Many researchers and academicians have proposed different email spam classification techniques which have been successfully used to classify data into groups. These methods include probabilistic, decision tree, artificial immune system [4], support vector machine (SVM) [5], artificial neural networks (ANN) [6], and case-based technique [7]. It have been shown in literature that it is possible to use these classification methods for spam mail filtering by using content-based filtering technique that will identify certain features (normally keywords frequently utilised in spam emails).

    View all citing articles on Scopus
    View full text