Elsevier

Pattern Recognition Letters

Volume 36, 15 January 2014, Pages 107-116
Pattern Recognition Letters

Geometrical and computational aspects of Spectral Support Estimation for novelty detection

https://doi.org/10.1016/j.patrec.2013.09.025Get rights and content

Highlights

  • We discuss separating kernels, providing a geometrical interpretation of the property.

  • We review the SSE algorithm, emphasizing its geometrical and computational aspects.

  • We discuss the SSE dependence on its hyper-parameters.

  • We show the effectiveness of the SSE for novelty detection on real benchmark datasets.

Abstract

In this paper we discuss the Spectral Support Estimation algorithm (De Vito et al., 2010) by analyzing its geometrical and computational properties. The estimator is non-parametric and the model selection depends on three parameters whose role is clarified by simulations on a two-dimensional space. The performance of the algorithm for novelty detection is tested and compared with its main competitors on a collection of real benchmark datasets of different sizes and types.

Introduction

Support estimation emerged in the sixties in statistics with the seminal works of Rényi and Sulanke, 1963, Geffroy, 1964, and in the last decades became crucial in different fields of machine learning and pattern recognition as, just to mention a few, one class estimation (Schölkopf et al., 2001), novelty and anomaly detection (Markou and Singh, 2003, Chandola et al., 2009). These problems find applications in different domains where it is difficult to gather negative examples (as it often happens in biological and biomedical problems) or when the negative class is not well defined (as in object detection problems in computer vision).

Support estimation deals with the following setting. The population data are represented by d-dimensional column vectors of features, but they live in a proper subset CRd distributed according to some probability distribution p(x)dv(x), where dv is a suitable infinitesimal volume element of C. For example, C could be a curve in Rd,dv is the arc length and p(x) is the density distribution of the data on the curve. Both the set C and the distribution p(x)dv(x) are known only through a training set {x1,,xn} of examples drawn independently from the population according to p(x)dv(x). The aim of support estimation is to find a subset CnRd such that Cn is similar to C, if n is large enough.

In this paper, we assume the set C is the support of the probability distribution ρ according to which the examples are drawn. Then C is defined as the smallest closed subset of Rd with the property that ρ(C)=1.

To this purpose we review the Spectral Support Estimation algorithm introduced in De Vito et al. (2010) with an emphasis on its geometrical and computational properties and on its applicability to real novelty detection problems.

To have good estimators some geometrical a priori assumption on C is needed. For example, if C is convex, a choice for Cn is the convex hull of the training set, as first proposed in Dümbgen and Walther (1996). If C is an arbitrary set with non-zero d-dimensional Lebesgue measure, Devroye and Wise (1980) define Cn has the union of the balls of center xi and radius with going to 0 when the number of data increases. A different point of view is taken by the so-called plug-in estimators. In such approach one first provides an estimator of the probability density and then Cn is defined as the region with high density (Cuevas and Fraiman, 1997).

However, in many applications the data approximatively live on a low dimensional submanifold, whose Lebesgue measure is clearly zero, and one may take advantage of this a priori information by using some recent ideas about dimensionality reduction, as for example, manifold learning algorithms (Donoho and Grimes, 2003, Belkin et al., 2006, and references therein) and kernel Principal Component Analysis (Schölkopf et al., 1998). Based on this idea, Hoffmann (2007) proposes a new algorithm for novelty detection, which can be seen as a support estimation problem. This point of view is further developed in De Vito et al. (2010), where a new class of consistent estimators, called Spectral Support Estimators (SSE), is proposed.

The contribution of this paper is threefold. First, we review the SSE algorithm emphasizing its geometrical and computational aspects (while we refer the reader interested in its statistical properties to De Vito et al. (2010)). Second, we discuss the dependence of the algorithms on its hyper-parameters with the help of a thorough qualitative analysis on synthetic data. This analysis also allows us to show the quality of the estimated support, which adapt nicely and smoothly to the training data, similarly to kernel PCA (Hoffmann, 2007). Third, we show the appropriateness of the algorithm on a large choice of real data and compare its performances against well known competitors, namely K-Nearest Neighbours, Parzen windows (Tarassenko et al., 1995), one class Support Vector Machines (Schölkopf et al., 2001), and kernel PCA for novelty detection (Hoffmann, 2007). To make the match fair, for each algorithm we select the optimal choice for the hyper-parameters following a procedure developed in Rudi et al. (2012).

To have an intuition of the SSE algorithm, suppose C is a r-dimensional linear subspace of Rd. Consider the d×d-matrixT=Cxxp(x)dx,here the volume element dv of Cis simply the r-dimensional Lebesgue measure dx. It is easy to check that the null space of T is the orthogonal complement of C in Rd, that is, C is the linear span of all the eigenvectors of T with non-zero eigenvalues. Since a consistent estimator of T is the empirical matrix Tn=1ni=1nxixi, one can define Cn as the linear span of the eigenvectors of Tn whose eigenvalue is bigger than a threshold λ. As in supervised learning, the thresholding ensures a stable solution with respect to the noise. Now, if λ goes to zero when n increases, Cn becomes closer and closer to C, providing us with a consistent estimator. Furthermore, to test if a new point x of Rd belongs to C or not, a simple decision rule is given by Fn(x)==1r(ux)2, where u1,,ur are the eigenvectors spanning Cn. Indeed, 0Fn(x)xx for all xRd, but it is close to xx (that is, the norm of x is near to the projection of x over Cn) if and only if x is near to C. Note that if Tn is replaced by the covariance matrix, then Cn is nothing else than the principal component analysis.

More in general, if C is not a linear subspace the above algorithm does not work, as it happens in binary classification problems with linear Support Vector Machines if the two classes are not linearly separated. This suggests the use the kernel trick which requires a feature map Φ, mapping the input space Rd into the feature space H, with Φ(C) a linear subspace of H. This strong condition is satisfied by the separating reproducing kernels introduced in De Vito et al. (2010).

The paper is organized as follows. Section 2 introduces the separating kernels by emphasizing their geometrical properties. In Section 3 we review the SSE algorithm and in Section 4 we discuss how the algorithm is influenced by the choice of the parameters, supporting our theoretical analysis with simulations on synthetic data. In Section 5 we compare SSE with other methods from the literature on a vast selection of real datasets. Section 6 is left to a final discussion.

Section snippets

Separating kernels

We now set the mathematical framework, we discuss the role of the separating kernels and we give some examples of separating kernels.

A spectral algorithm for support estimation

In this section we introduce the algorithm by discussing its geometrical interpretation, We then derive a simple implementation of the algorithm and observe how the algorithm can be implemented in different methods according to a specific choice of a filter (similarly to what was done in Lo Gerfo et al. (2008) for the supervised case). We conclude the section with an analysis of the algorithm computational cost.

The parameters choice

In this section we give a qualitative evaluation on synthetic data of the effect of parameters choice on the SSE algorithm.

The SSE algorithm described in the previous section depends on two parameters: the regularization parameter λ and the threshold τ. Also we have the parameters of the kernel, for instance γ the width of the Laplacian kernel. If λ=τ=0 the separability property implies that Cn={x1,,xn} is over-fitting the data and if λ goes to infinity, Fn(x) becomes equal to the trivial

Real datasets

In this section we carry out a thorough experimental analysis on a selection of benchmark datasets of different size (n), dimension of the environment space (d) and nature. Lacking specific benchmarks for support estimation, we consider the closely related novelty detection problem, and start from benchmark multi-class datasets learning one class at a time.

Table 1 summarizes the characteristics of each dataset and highlights the variety of the different application settings considered: the MNIST

Discussion

The paper presented an extensive discussion on the computational and geometrical properties of a recently proposed algorithm for Support Estimation, the so called Spectral Support Estimation (SSE) algorithm. The main aim of the paper was to review the estimator providing a geometrical interpretation, and highlighting the properties of the kernel functions to be adopted. A further aim of the paper was to gain a deeper understanding of its computational aspects, deriving a simple implementation

References (24)

  • H. Hoffmann

    Kernel PCA for novelty detection

    Pattern Recognition

    (2007)
  • M. Markou et al.

    Novelty detection: a review–part 1: statistical approaches

    Signal Processing

    (2003)
  • Bache, K., Lichman, M. 2013. UCI Machine Learning Repository. URL...
  • M. Belkin et al.

    Manifold regularization: a geometric framework for learning from labeled and unlabeled examples

    Journal of Machine Learning Research

    (2006)
  • V. Chandola et al.

    Anomaly detection: a survey

    ACM Computing Surveys (CSUR)

    (2009)
  • A. Cuevas et al.

    A plug-in approach to support estimation

    The Annals of Statistics

    (1997)
  • E. De Vito et al.

    Spectral regularization for support estimation

    Advances in Neural Information Processing Systems, NIPS Foundation

    (2010)
  • L. Devroye et al.

    Detection of abnormal behavior via nonparametric estimation of the support

    SIAM Journal on Applied Mathematics

    (1980)
  • D.L. Donoho et al.

    Hessian eigenmaps: locally linear embedding techniques for high-dimensional data

    Proceedings of the National Academy of Sciences of the United States of America

    (2003)
  • L. Dümbgen et al.

    Rates of convergence for random approximations of convex sets

    Advances in Applied Probability

    (1996)
  • J. Geffroy

    Sur un probleme d’estimation géométrique

    Publications Institute of Statistics of the University of Paris

    (1964)
  • N. Halko et al.

    Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions

    SIAM Review

    (2011)
  • Cited by (4)

    View full text