Geometrical and computational aspects of Spectral Support Estimation for novelty detection
Introduction
Support estimation emerged in the sixties in statistics with the seminal works of Rényi and Sulanke, 1963, Geffroy, 1964, and in the last decades became crucial in different fields of machine learning and pattern recognition as, just to mention a few, one class estimation (Schölkopf et al., 2001), novelty and anomaly detection (Markou and Singh, 2003, Chandola et al., 2009). These problems find applications in different domains where it is difficult to gather negative examples (as it often happens in biological and biomedical problems) or when the negative class is not well defined (as in object detection problems in computer vision).
Support estimation deals with the following setting. The population data are represented by d-dimensional column vectors of features, but they live in a proper subset distributed according to some probability distribution , where dv is a suitable infinitesimal volume element of C. For example, C could be a curve in is the arc length and is the density distribution of the data on the curve. Both the set C and the distribution are known only through a training set of examples drawn independently from the population according to . The aim of support estimation is to find a subset such that is similar to C, if n is large enough.
In this paper, we assume the set C is the support of the probability distribution according to which the examples are drawn. Then C is defined as the smallest closed subset of with the property that .
To this purpose we review the Spectral Support Estimation algorithm introduced in De Vito et al. (2010) with an emphasis on its geometrical and computational properties and on its applicability to real novelty detection problems.
To have good estimators some geometrical a priori assumption on C is needed. For example, if C is convex, a choice for is the convex hull of the training set, as first proposed in Dümbgen and Walther (1996). If C is an arbitrary set with non-zero d-dimensional Lebesgue measure, Devroye and Wise (1980) define has the union of the balls of center and radius ∊ with ∊ going to 0 when the number of data increases. A different point of view is taken by the so-called plug-in estimators. In such approach one first provides an estimator of the probability density and then is defined as the region with high density (Cuevas and Fraiman, 1997).
However, in many applications the data approximatively live on a low dimensional submanifold, whose Lebesgue measure is clearly zero, and one may take advantage of this a priori information by using some recent ideas about dimensionality reduction, as for example, manifold learning algorithms (Donoho and Grimes, 2003, Belkin et al., 2006, and references therein) and kernel Principal Component Analysis (Schölkopf et al., 1998). Based on this idea, Hoffmann (2007) proposes a new algorithm for novelty detection, which can be seen as a support estimation problem. This point of view is further developed in De Vito et al. (2010), where a new class of consistent estimators, called Spectral Support Estimators (SSE), is proposed.
The contribution of this paper is threefold. First, we review the SSE algorithm emphasizing its geometrical and computational aspects (while we refer the reader interested in its statistical properties to De Vito et al. (2010)). Second, we discuss the dependence of the algorithms on its hyper-parameters with the help of a thorough qualitative analysis on synthetic data. This analysis also allows us to show the quality of the estimated support, which adapt nicely and smoothly to the training data, similarly to kernel PCA (Hoffmann, 2007). Third, we show the appropriateness of the algorithm on a large choice of real data and compare its performances against well known competitors, namely K-Nearest Neighbours, Parzen windows (Tarassenko et al., 1995), one class Support Vector Machines (Schölkopf et al., 2001), and kernel PCA for novelty detection (Hoffmann, 2007). To make the match fair, for each algorithm we select the optimal choice for the hyper-parameters following a procedure developed in Rudi et al. (2012).
To have an intuition of the SSE algorithm, suppose C is a r-dimensional linear subspace of . Consider the -matrixhere the volume element dv of Cis simply the r-dimensional Lebesgue measure dx. It is easy to check that the null space of T is the orthogonal complement of C in , that is, C is the linear span of all the eigenvectors of T with non-zero eigenvalues. Since a consistent estimator of T is the empirical matrix , one can define as the linear span of the eigenvectors of whose eigenvalue is bigger than a threshold λ. As in supervised learning, the thresholding ensures a stable solution with respect to the noise. Now, if λ goes to zero when n increases, becomes closer and closer to C, providing us with a consistent estimator. Furthermore, to test if a new point x of belongs to C or not, a simple decision rule is given by , where are the eigenvectors spanning . Indeed, for all , but it is close to (that is, the norm of x is near to the projection of x over ) if and only if x is near to C. Note that if is replaced by the covariance matrix, then is nothing else than the principal component analysis.
More in general, if C is not a linear subspace the above algorithm does not work, as it happens in binary classification problems with linear Support Vector Machines if the two classes are not linearly separated. This suggests the use the kernel trick which requires a feature map , mapping the input space into the feature space , with a linear subspace of . This strong condition is satisfied by the separating reproducing kernels introduced in De Vito et al. (2010).
The paper is organized as follows. Section 2 introduces the separating kernels by emphasizing their geometrical properties. In Section 3 we review the SSE algorithm and in Section 4 we discuss how the algorithm is influenced by the choice of the parameters, supporting our theoretical analysis with simulations on synthetic data. In Section 5 we compare SSE with other methods from the literature on a vast selection of real datasets. Section 6 is left to a final discussion.
Section snippets
Separating kernels
We now set the mathematical framework, we discuss the role of the separating kernels and we give some examples of separating kernels.
A spectral algorithm for support estimation
In this section we introduce the algorithm by discussing its geometrical interpretation, We then derive a simple implementation of the algorithm and observe how the algorithm can be implemented in different methods according to a specific choice of a filter (similarly to what was done in Lo Gerfo et al. (2008) for the supervised case). We conclude the section with an analysis of the algorithm computational cost.
The parameters choice
In this section we give a qualitative evaluation on synthetic data of the effect of parameters choice on the SSE algorithm.
The SSE algorithm described in the previous section depends on two parameters: the regularization parameter λ and the threshold τ. Also we have the parameters of the kernel, for instance γ the width of the Laplacian kernel. If the separability property implies that is over-fitting the data and if λ goes to infinity, becomes equal to the trivial
Real datasets
In this section we carry out a thorough experimental analysis on a selection of benchmark datasets of different size (n), dimension of the environment space (d) and nature. Lacking specific benchmarks for support estimation, we consider the closely related novelty detection problem, and start from benchmark multi-class datasets learning one class at a time.
Table 1 summarizes the characteristics of each dataset and highlights the variety of the different application settings considered: the MNIST
Discussion
The paper presented an extensive discussion on the computational and geometrical properties of a recently proposed algorithm for Support Estimation, the so called Spectral Support Estimation (SSE) algorithm. The main aim of the paper was to review the estimator providing a geometrical interpretation, and highlighting the properties of the kernel functions to be adopted. A further aim of the paper was to gain a deeper understanding of its computational aspects, deriving a simple implementation
References (24)
Kernel PCA for novelty detection
Pattern Recognition
(2007)- et al.
Novelty detection: a review–part 1: statistical approaches
Signal Processing
(2003) - Bache, K., Lichman, M. 2013. UCI Machine Learning Repository. URL...
- et al.
Manifold regularization: a geometric framework for learning from labeled and unlabeled examples
Journal of Machine Learning Research
(2006) - et al.
Anomaly detection: a survey
ACM Computing Surveys (CSUR)
(2009) - et al.
A plug-in approach to support estimation
The Annals of Statistics
(1997) - et al.
Spectral regularization for support estimation
Advances in Neural Information Processing Systems, NIPS Foundation
(2010) - et al.
Detection of abnormal behavior via nonparametric estimation of the support
SIAM Journal on Applied Mathematics
(1980) - et al.
Hessian eigenmaps: locally linear embedding techniques for high-dimensional data
Proceedings of the National Academy of Sciences of the United States of America
(2003) - et al.
Rates of convergence for random approximations of convex sets
Advances in Applied Probability
(1996)
Sur un probleme d’estimation géométrique
Publications Institute of Statistics of the University of Paris
Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions
SIAM Review
Cited by (4)
Decontamination of mutual contamination models
2019, Journal of Machine Learning ResearchRegularized Kernel Algorithms for Support Estimation
2017, Frontiers in Applied Mathematics and StatisticsLearning Sets and Subspaces
2014, Regularization, Optimization, Kernels, and Support Vector Machines