On some convergence properties of the subspace constrained mean shift
Introduction
Dimensionality reduction and manifold-learning techniques provide compact and meaningful representations which facilitate compression, classification, and visualization of high dimensional data. In many applications it is a realistic assumption that the observed high dimensional data have an intrinsically low dimensional structure, so that the data points lie on or near a low dimensional manifold, embedded in the high dimensional space. A multitude of different algorithms have been introduced to find or approximate such a low-dimensional manifold; see, e.g. [1] for an overview.
In some situations, the observed data can be modeled as low-dimensional “clean” data corrupted by high-dimensional noise. In this case, applying common linear or nonlinear dimensionality reduction techniques on the noisy observations may not lead to a meaningful low dimensional representation. Partly to overcome this problem, nonlinear generalizations of principal components, called principal curves (and surfaces) have been proposed. The first formal definition of a principal curve was given by Hastie and Stuetzle [2]. According to their definition, a principal curve is a smooth (one-dimensional) curve that passes through the “middle of a data set.” More formally, a smooth, parameterized curve that does not intersect itself and has finite length inside any bounded ball is a principal curve of a probability distribution if each of its points is the (conditional) mean of the distribution given the set of points that project to it.
Several definitions of principal curves and algorithms to construct them have been proposed based on, or inspired by, Hastie and Stuetzle's original definition (see [3], [4], [5], [6], [7], [8] among others). The aim of these new definitions and algorithms was to address some of the shortcomings of the original (and subsequent) definition(s) and to extend the range of potential applications. Recently, an interesting new definition of principal curves and surfaces has been proposed by Ozertem and Erdogmus [9]. According to this definition, given a smooth (at least twice continuously differentiable) probability density function (pdf) f on , a d-dimensional principal surface () is the collection of all points where the gradient of f is orthogonal to exactly eigenvectors of the Hessian of f, and the eigenvalues corresponding to these eigenvectors are negative. Thus each point on the principal surface is a local maximum of the pdf in a affine subspace and the principal surface is a d-dimensional ridge of the pdf. An attractive property of this new definition is that the smoothness of principal curves and surfaces is not stipulated by their definition, but rather it is inherited from the smoothness of the underlying pdf or its estimate.
To estimate principal curves/surfaces based on the new definition, Ozertem and Erdogmus [9] proposed the so-called subspace constrained mean shift (SCMS) algorithm. It is a generalization of the well-known mean shift (MS) algorithm [10], [11], [12] that iteratively tries to find modes of a pdf (estimated from data samples) in a local subspace. On synthetic data sets the performance of the SCMS algorithm is comparable to (and in some situations better than) the principal curve algorithms of Hastie and Stuetzle [2] and Kégl et al. [7], and it is computationally less demanding. Moreover, in contrast to most previous principal curve algorithms, the SCMS algorithm can naturally handle loops and self-intersections, and it easily generalizes from principal curves to surfaces. Applications to time-series denoising and independent component analysis (among others) were also presented in [9]. Recently, the present authors have successfully applied a version of the SCMS algorithm to vector quantization of noisy sources [13].
Based on an assertion in [12] that the MS algorithm converges, Ozertem and Erdogmus claimed that their SCMS algorithm converges to a principal curve/surface. However, Li et al. [14] pointed out a seemingly fundamental mistake in the proof of the convergence of the MS algorithm in [12]. Thus it seems that, similar to most previous principal curve algorithms (with the exception of [7], [8]), no optimality properties for the SCMS algorithm have been proved.
The purpose of this paper is to investigate some convergence properties of the SCMS algorithm. While we cannot prove that the sequence produced by the algorithm converges (let alone to a principal curve/surface), we show a convergence result concerning the estimated pdf values along the output sequence, which is indicative of the ridge property of the newly defined principal curves. We also show that the two stopping criteria proposed in [9] indeed ensure that the algorithm stops after a finite number of steps. Since these criteria are based on the fact that any point on the principal curve/surface is a fixed point of the SCMS algorithm, these results can be considered as steps toward proving the optimality of the SCMS algorithm, or an improved version of it. In addition, we introduce three variations of the SCMS algorithm for which our convergence results also apply. The performance of these algorithms is compared through simulations.
Section snippets
Locally defined principal curves and surfaces
Let f be a pdf on that is at least twice continuously differentiable with gradient and Hessian . For , Ozertem and Ergodmus defined the d-dimensional principal surfaces associated with the pdf f as follows: Definition 1 The d-dimensional principal surface associated with pdf f is the collection of all points such that the gradient is orthogonal to exactly eigenvectors of the Hessian , and the eigenvalues of corresponding to these orthogonal eigenvectors areOzertem and Erdogmus [9]
The mean shift algorithm
The MS algorithm is a non-parametric, iterative technique for locating modes of a pdf obtained via a kernel density estimate (see, e.g. [15]) from a given data set. These modes play an important role in many machine learning applications, such as classification [12], image segmentation [16], and object tracking [17].
The MS algorithm iteratively updates its mode estimate to a weighted average of the neighboring data points to find a stationary point of the estimated pdf [12]. Specifically, let
Subspace constrained mean shift algorithms
Under some regularity conditions, the set of local maxima of a pdf is exactly the zero-dimensional principal manifold resulting from Definition 1 for d=0. The SCMS algorithm [9] generalizes the MS algorithm to estimate higher order principal curves and surfaces (). Similar to the MS algorithm, the SCMS algorithm starts from a finite data set sampled from the probability distribution, forms a kernel density estimate based on the data, and in each iteration it evaluates the MS vector.
Simulation results
An inspection of the proof of Proposition 2 shows that all three statements remain valid if , , is an arbitrary sequence of matrices having orthonormal columns. Thus for the convergence results to hold, does not have to be the matrix whose columns are the orthonormal eigenvectors corresponding to the largest eigenvalues of .
Of course, for the outputs of the algorithm to be meaningful the columns of should be (nearly) orthogonal to the gradient of at points
Discussion
We studied the SCMS algorithm for finding principal curves and proved convergence result indicating that it inherits some important convergence properties of the MS algorithm. The more challenging problem of proving the convergence of the sequence generated by the SCMS algorithm is the subject of future research. Further along this line, the study of the optimality of the SCMS algorithm (i.e., its convergence to a principal curve/surface), seems to necessitate a more careful study of the
Conflict of interest statement
None declared.
Acknowledgments
This research was supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada.
Youness Aliyari Ghassabeh received the B.Sc. degree from University of Tehran in 2004 and the M.Sc. degree from K.N. Toosi University of Technology, Tehran, Iran in 2006, both in electrical engineering. He is a Ph.D. student in the Department of Mathematics and Statistics, Queens University, Kingston, Canada. His research interests include machine learning, statistical pattern recognition, image processing, source coding, and information theory.
References (25)
Another look at principal curves and surfaces
Journal of Multivariate Analysis
(2001)- et al.
A note on the convergence of the mean shift
Pattern Recognition
(2007) - et al.
Nonlinear Dimensionality Reduction
(2007) - et al.
Principal curves
Journal of the American Statistical Association
(1989) - et al.
Ice floe identification in satellite images using mathematical morphology and clustering about principal curves
Journal of the American Statistical Association
(1992) Principal curves revisited
Statistics and Computation
(1992)- et al.
A unified model for probabilistic principal surfaces
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2001) - et al.
Learning and design of principal curves
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2000) - et al.
Principal curves with bounded turn
IEEE Transactions on Information Theory
(2002) - et al.
Locally defined principal curves and surfaces
Journal of Machine Learning Research
(2011)
Estimation of the gradient of a density function, with applications in pattern recognition
IEEE Transactions on Information Theory
Mean shift, mode seeking and clustering
IEEE Transactions on Pattern Analysis and Machine Intelligence
Cited by (33)
Filaments of crime: Informing policing via thresholded ridge estimation
2021, Decision Support SystemsCitation Excerpt :SCMS with thresholding.Unlabelled Image The convergence properties of the SCMS algorithm have been analyzed [35], showing that the method inherits some properties of the previous mean shift algorithm [30], most importantly its monotonicity and the convergence of density estimates along the output sequence, together with other properties that offer theoretical guarantees for stopping criteria. For an up-to-date contextualization of the approach in the broader field of topological data analysis, we refer the reader to suitable overview [74], as well as to a more general analysis of non-parametric density ridge estimation [64].
The mean shift algorithm and its relation to kernel regression
2016, Information SciencesCitation Excerpt :Unfortunately, those convergence results were not generalized to higher dimensions. The convergence of the algorithm when the set of the stationary points of the density function is finite (equivalently when the stationary points are isolated) has been shown in [16,26] and Ghassabeh proved that in this case if the algorithm starts from a neighborhood of a stationary point, then it stays in vicinity of that point and eventually converges to it [12]. However, a useful and general condition to guarantee isolated or a finite number of the stationary points for extensively used kernels has not yet been found.
A sufficient condition for the convergence of the mean shift algorithm with Gaussian kernel
2015, Journal of Multivariate AnalysisConvergence Analysis of Mean Shift
2023, arXivANALYZING ANIMAL ESCAPE DATA WITH CIRCULAR NONPARAMETRIC MULTIMODAL REGRESSION
2023, Annals of Applied StatisticsSconce: A cosmic web finder for spherical and conic geometries
2022, Monthly Notices of the Royal Astronomical Society
Youness Aliyari Ghassabeh received the B.Sc. degree from University of Tehran in 2004 and the M.Sc. degree from K.N. Toosi University of Technology, Tehran, Iran in 2006, both in electrical engineering. He is a Ph.D. student in the Department of Mathematics and Statistics, Queens University, Kingston, Canada. His research interests include machine learning, statistical pattern recognition, image processing, source coding, and information theory.
Tamás Linder received the M.S. degree from the Technical University of Budapest in 1988 and the Ph.D. degree from the Hungarian Academy of Sciences in 1992, both in electrical engineering. He is now a Professor in the Department of Mathematics and Statistics at Queen's University, Canada. His research interests include communications and information theory, source coding and vector quantization, machine learning, and statistical pattern recognition.
Glen Takahara received the M.S. and Ph.D. degrees from Carnegie Mellon University in 1990 and 1994, respectively, both in statistics. He is an Associate Professor in the Department of Mathematics and Statistics at Queen's University, Canada. His interests include Bayesian statistics, statistical algorithms, machine learning, and probability modeling.