Elsevier

Pattern Recognition

Volume 46, Issue 11, November 2013, Pages 3140-3147
Pattern Recognition

On some convergence properties of the subspace constrained mean shift

https://doi.org/10.1016/j.patcog.2013.04.014Get rights and content

Highlights

  • We investigate convergence properties of the MS and SCMS algorithms.

  • The SCMS is shown to inherit important convergence properties of the MS algorithm.

  • Theoretical guarantees for stopping criteria are provided.

  • Three variations of the SCMS are proposed and tested.

Abstract

Subspace constrained mean shift (SCMS) is a non-parametric, iterative algorithm that has recently been proposed to find principal curves and surfaces based on a new definition involving the gradient and Hessian of a kernel probability density estimate. Although simulation results using synthetic and real data have demonstrated the usefulness of the SCMS algorithm, a rigorous study of its convergence is still missing. This paper aims to take initial steps in this direction by showing that the SCMS algorithm inherits some important convergence properties of the mean shift (MS) algorithm. In particular, the monotonicity and convergence of the density estimate values along the sequence of output values of the algorithm is shown. Also, it is shown that the distance between consecutive points of the output sequence converges to zero, as does the projection of the gradient vector onto the subspace spanned by the Dd eigenvectors corresponding to the D-d largest eigenvalues of the local inverse covariance matrix. These last two properties provide theoretical guarantees for stopping criteria. By modifying the projection step, three variations of the SCMS algorithm are proposed and the running times and performance of the resulting algorithms are compared.

Introduction

Dimensionality reduction and manifold-learning techniques provide compact and meaningful representations which facilitate compression, classification, and visualization of high dimensional data. In many applications it is a realistic assumption that the observed high dimensional data have an intrinsically low dimensional structure, so that the data points lie on or near a low dimensional manifold, embedded in the high dimensional space. A multitude of different algorithms have been introduced to find or approximate such a low-dimensional manifold; see, e.g. [1] for an overview.

In some situations, the observed data can be modeled as low-dimensional “clean” data corrupted by high-dimensional noise. In this case, applying common linear or nonlinear dimensionality reduction techniques on the noisy observations may not lead to a meaningful low dimensional representation. Partly to overcome this problem, nonlinear generalizations of principal components, called principal curves (and surfaces) have been proposed. The first formal definition of a principal curve was given by Hastie and Stuetzle [2]. According to their definition, a principal curve is a smooth (one-dimensional) curve that passes through the “middle of a data set.” More formally, a smooth, parameterized curve that does not intersect itself and has finite length inside any bounded ball is a principal curve of a probability distribution if each of its points is the (conditional) mean of the distribution given the set of points that project to it.

Several definitions of principal curves and algorithms to construct them have been proposed based on, or inspired by, Hastie and Stuetzle's original definition (see [3], [4], [5], [6], [7], [8] among others). The aim of these new definitions and algorithms was to address some of the shortcomings of the original (and subsequent) definition(s) and to extend the range of potential applications. Recently, an interesting new definition of principal curves and surfaces has been proposed by Ozertem and Erdogmus [9]. According to this definition, given a smooth (at least twice continuously differentiable) probability density function (pdf) f on RD, a d-dimensional principal surface (d<D) is the collection of all points where the gradient of f is orthogonal to exactly Dd eigenvectors of the Hessian of f, and the eigenvalues corresponding to these eigenvectors are negative. Thus each point on the principal surface is a local maximum of the pdf in a (Dd)-dimensional affine subspace and the principal surface is a d-dimensional ridge of the pdf. An attractive property of this new definition is that the smoothness of principal curves and surfaces is not stipulated by their definition, but rather it is inherited from the smoothness of the underlying pdf or its estimate.

To estimate principal curves/surfaces based on the new definition, Ozertem and Erdogmus [9] proposed the so-called subspace constrained mean shift (SCMS) algorithm. It is a generalization of the well-known mean shift (MS) algorithm [10], [11], [12] that iteratively tries to find modes of a pdf (estimated from data samples) in a local subspace. On synthetic data sets the performance of the SCMS algorithm is comparable to (and in some situations better than) the principal curve algorithms of Hastie and Stuetzle [2] and Kégl et al. [7], and it is computationally less demanding. Moreover, in contrast to most previous principal curve algorithms, the SCMS algorithm can naturally handle loops and self-intersections, and it easily generalizes from principal curves to surfaces. Applications to time-series denoising and independent component analysis (among others) were also presented in [9]. Recently, the present authors have successfully applied a version of the SCMS algorithm to vector quantization of noisy sources [13].

Based on an assertion in [12] that the MS algorithm converges, Ozertem and Erdogmus claimed that their SCMS algorithm converges to a principal curve/surface. However, Li et al. [14] pointed out a seemingly fundamental mistake in the proof of the convergence of the MS algorithm in [12]. Thus it seems that, similar to most previous principal curve algorithms (with the exception of [7], [8]), no optimality properties for the SCMS algorithm have been proved.

The purpose of this paper is to investigate some convergence properties of the SCMS algorithm. While we cannot prove that the sequence produced by the algorithm converges (let alone to a principal curve/surface), we show a convergence result concerning the estimated pdf values along the output sequence, which is indicative of the ridge property of the newly defined principal curves. We also show that the two stopping criteria proposed in [9] indeed ensure that the algorithm stops after a finite number of steps. Since these criteria are based on the fact that any point on the principal curve/surface is a fixed point of the SCMS algorithm, these results can be considered as steps toward proving the optimality of the SCMS algorithm, or an improved version of it. In addition, we introduce three variations of the SCMS algorithm for which our convergence results also apply. The performance of these algorithms is compared through simulations.

Section snippets

Locally defined principal curves and surfaces

Let f be a pdf on RD that is at least twice continuously differentiable with gradient f and Hessian H. For d{0,1,,D1}, Ozertem and Ergodmus defined the d-dimensional principal surfaces associated with the pdf f as follows:

Definition 1

Ozertem and Erdogmus [9]

The d-dimensional principal surface Pd associated with pdf f is the collection of all points xRD such that the gradient f(x) is orthogonal to exactly Dd eigenvectors of the Hessian H(x), and the eigenvalues of H(x) corresponding to these Dd orthogonal eigenvectors are

The mean shift algorithm

The MS algorithm is a non-parametric, iterative technique for locating modes of a pdf obtained via a kernel density estimate (see, e.g. [15]) from a given data set. These modes play an important role in many machine learning applications, such as classification [12], image segmentation [16], and object tracking [17].

The MS algorithm iteratively updates its mode estimate to a weighted average of the neighboring data points to find a stationary point of the estimated pdf [12]. Specifically, let {x

Subspace constrained mean shift algorithms

Under some regularity conditions, the set of local maxima of a pdf is exactly the zero-dimensional principal manifold P0 resulting from Definition 1 for d=0. The SCMS algorithm [9] generalizes the MS algorithm to estimate higher order principal curves and surfaces (d1). Similar to the MS algorithm, the SCMS algorithm starts from a finite data set sampled from the probability distribution, forms a kernel density estimate f^ based on the data, and in each iteration it evaluates the MS vector.

Simulation results

An inspection of the proof of Proposition 2 shows that all three statements remain valid if Vj, j=1,2,, is an arbitrary sequence of D×(Dd) matrices having orthonormal columns. Thus for the convergence results to hold, Vj does not have to be the matrix whose columns are the Dd orthonormal eigenvectors corresponding to the largest eigenvalues of Σ^1(yj).

Of course, for the outputs of the algorithm to be meaningful the columns of Vj should be (nearly) orthogonal to the gradient of f^ at points

Discussion

We studied the SCMS algorithm for finding principal curves and proved convergence result indicating that it inherits some important convergence properties of the MS algorithm. The more challenging problem of proving the convergence of the sequence generated by the SCMS algorithm is the subject of future research. Further along this line, the study of the optimality of the SCMS algorithm (i.e., its convergence to a principal curve/surface), seems to necessitate a more careful study of the

Conflict of interest statement

None declared.

Acknowledgments

This research was supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada.

Youness Aliyari Ghassabeh received the B.Sc. degree from University of Tehran in 2004 and the M.Sc. degree from K.N. Toosi University of Technology, Tehran, Iran in 2006, both in electrical engineering. He is a Ph.D. student in the Department of Mathematics and Statistics, Queens University, Kingston, Canada. His research interests include machine learning, statistical pattern recognition, image processing, source coding, and information theory.

References (25)

  • P. Delicado

    Another look at principal curves and surfaces

    Journal of Multivariate Analysis

    (2001)
  • X. Li et al.

    A note on the convergence of the mean shift

    Pattern Recognition

    (2007)
  • J.A. Lee et al.

    Nonlinear Dimensionality Reduction

    (2007)
  • T. Hastie et al.

    Principal curves

    Journal of the American Statistical Association

    (1989)
  • J.D. Banfield et al.

    Ice floe identification in satellite images using mathematical morphology and clustering about principal curves

    Journal of the American Statistical Association

    (1992)
  • R. Tibshirani

    Principal curves revisited

    Statistics and Computation

    (1992)
  • K.Y. Chang et al.

    A unified model for probabilistic principal surfaces

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2001)
  • B. Kegl et al.

    Learning and design of principal curves

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2000)
  • S. Sandilya et al.

    Principal curves with bounded turn

    IEEE Transactions on Information Theory

    (2002)
  • U. Ozertem et al.

    Locally defined principal curves and surfaces

    Journal of Machine Learning Research

    (2011)
  • K. Fukunaga et al.

    Estimation of the gradient of a density function, with applications in pattern recognition

    IEEE Transactions on Information Theory

    (1975)
  • Y. Cheng

    Mean shift, mode seeking and clustering

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1995)
  • Cited by (33)

    • Filaments of crime: Informing policing via thresholded ridge estimation

      2021, Decision Support Systems
      Citation Excerpt :

      SCMS with thresholding.Unlabelled Image The convergence properties of the SCMS algorithm have been analyzed [35], showing that the method inherits some properties of the previous mean shift algorithm [30], most importantly its monotonicity and the convergence of density estimates along the output sequence, together with other properties that offer theoretical guarantees for stopping criteria. For an up-to-date contextualization of the approach in the broader field of topological data analysis, we refer the reader to suitable overview [74], as well as to a more general analysis of non-parametric density ridge estimation [64].

    • The mean shift algorithm and its relation to kernel regression

      2016, Information Sciences
      Citation Excerpt :

      Unfortunately, those convergence results were not generalized to higher dimensions. The convergence of the algorithm when the set of the stationary points of the density function is finite (equivalently when the stationary points are isolated) has been shown in [16,26] and Ghassabeh proved that in this case if the algorithm starts from a neighborhood of a stationary point, then it stays in vicinity of that point and eventually converges to it [12]. However, a useful and general condition to guarantee isolated or a finite number of the stationary points for extensively used kernels has not yet been found.

    • Sconce: A cosmic web finder for spherical and conic geometries

      2022, Monthly Notices of the Royal Astronomical Society
    View all citing articles on Scopus

    Youness Aliyari Ghassabeh received the B.Sc. degree from University of Tehran in 2004 and the M.Sc. degree from K.N. Toosi University of Technology, Tehran, Iran in 2006, both in electrical engineering. He is a Ph.D. student in the Department of Mathematics and Statistics, Queens University, Kingston, Canada. His research interests include machine learning, statistical pattern recognition, image processing, source coding, and information theory.

    Tamás Linder received the M.S. degree from the Technical University of Budapest in 1988 and the Ph.D. degree from the Hungarian Academy of Sciences in 1992, both in electrical engineering. He is now a Professor in the Department of Mathematics and Statistics at Queen's University, Canada. His research interests include communications and information theory, source coding and vector quantization, machine learning, and statistical pattern recognition.

    Glen Takahara received the M.S. and Ph.D. degrees from Carnegie Mellon University in 1990 and 1994, respectively, both in statistics. He is an Associate Professor in the Department of Mathematics and Statistics at Queen's University, Canada. His interests include Bayesian statistics, statistical algorithms, machine learning, and probability modeling.

    View full text