The effect of finite sample size on on-line K-means

doi:10.1016/S0925-2312(01)00626-9

Neurocomputing

Volume 48, Issues 1–4, October 2002, Pages 511-539

https://doi.org/10.1016/S0925-2312(01)00626-9 Get rights and content

Abstract

The asymptotic convergence of on-line algorithms when the number of training samples becomes infinite is well understood from a theoretical point of view (Adaptive Algorithms and Stochastic Approximations, Springer, Berlin, 1990, Advances in Neural Processing Systems 7, MIT Press, Boston, 1995, Theory and Practice of Recursive Identification, MIT Press, Boston, 1983). However, much less is known about the real convergence of these algorithms when the data sample size is finite. In this paper, we address the study of the real convergence of the popular K-means algorithm (Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics and Probablity, Vol. 1, 1967, 281) when it deals with finite data resources.

Introduction

On-line algorithms are among the simplest optimization processes in the learning phase of artificial adaptive systems. This feature makes them attractive for handling large (e.g. real-world) training sets using moderate computational effort. In these cases, since the number of examples (N) is usually fixed, the learning algorithm stores the complete training data in memory and passes (e.g. cyclically) through them over and over until a stopping criterion is met.

In some cases this learning scenario is unavoidable since the derivation of the algorithm is intrinsically on-line (e.g. supervised LVQ algorithms [12]). In other cases, since the algorithm is related to a particular cost function, its use is supported by the empirical evidence that the convergence speed is accelerated over batch versions if the data is redundant, a typical situation in real-life data [3], [4]. Another typical argument in favor of on-line versions is that they can be considered as ‘noisy’ versions of batch algorithms, so they could escape from a local minimum more easily.

Although the use of on-line learning algorithms is a common practice when there is a cyclic or random access to a fixed set of examples, there is no guarantee of convergence. From a theoretical point of view, convergence is only guaranteed when N tends to infinity [15]. (This holds both in the case of using a constant and decreasing step size [3].) Moreover, the theory that studies these systems does not usually provide any hints about the practical convergence rate. Hence, to get more insight into the finite-sample properties of these algorithms, the theoretic analysis must always be complemented by simulation studies.

This paper addresses the study of the finite-sample convergence of the popular clustering K-means algorithm [16]. This algorithm is widely employed in vector quantization [10] as initialization for other more powerful learning systems (like Radial Basis functions [17] or Kohonen's LVQ algorithms [12]), and also as a learning algorithm for data-dependent partitioning classifiers [9]. We study this on-line learning when it uses a cyclic or any other presentation of the training data and a constant or variable step size. Emphasis is placed on comparing the cyclic on-line version with batch and infinite-sample on-line versions. For the study of the learning algorithm we will make a non-statistical analysis, which is mainly based on the use of the discrete-time dynamical systems theory [8], [18], [19].

The organization of this paper is as follows. Section 2 reviews on-line and batch versions of the K-means algorithm. In Section 3, we present our study of the finite-sample convergence introducing an asymptotic first-order model of the on-line K-means. To give a complete view, we also include convergence studies of the other versions. Section 4 briefly describes the generalization performance of these algorithms. Next, experimental results are presented in order to validate the proposed model of the finite-sample on-line K-means. Finally, we include some discussion and conclusions.

Section snippets

Optimal K-means

We wish to design a codebook $C$ (or set of prototypes of size K) for a vector quantizer VQ. A VQ of dimension p and size K is defined as a mapping from a p-dimensional Euclidean space, $R^{p}$ , into a set or codebook $C ={w_{j},j=1,…,K}$ . Associated with every code vector $w_{j}$ there is a region of influence R_j where VQ maps any input vector that falls into it to $w_{j}$ . Since we use a nearest neighbor quantizer, R_j is defined by $R_{j} = x | ∥ x − w_{j} ∥= min i=1,…,K ∥ x − w_{i} ∥ .$ Thus, $VQ (x)$ can be expressed as $VQ (x)= ∑ j=1 K 1(x ∈R_{j}) w_{j} where$

Study of convergence

K-means works in input regions of high probability (i.e. regions with high density of input patterns) and places codevectors to approximate discretely the probability density or the empirical density of samples observed in the training set if the real density is unknown.

In this section, we will study the convergence of the three versions of K-means, presented in Section 2, and what the relations between these three solutions are. More precisely, we will present a new study of the convergence of

Generalization properties

The K-means learning model approximates locally RV $x$ by $x ≈ VQ (x)= w_{j} = ∑ i=1 K 1(x ∈R_{i}) w_{i},$ where $w_{j}$ is the nearest codevector to $x$ .

The expected quantization error will measure how well we approximate for all possible cases, $L=E_{X} [∥ x − VQ (x)∥^{2}].$ It is easy to show that this error function can be decomposed as the sum of an approximation error and an estimation error, $L=E_{X} [∥ x − VQ (x)∥^{2}]=E_{X} [∥ x − VQ_{opt} (x)∥^{2}]+E_{X} [∥ VQ_{opt} (x)− VQ (x)∥^{2}], VQ_{opt} (x)= ∑ i=1 K 1(x ∈R_{i})E_{X|R_{i}} (x |R_{i}).$ The approximation error is the error induced by the kind

Experimental results

In the experimental part of our work, we will study the real convergence properties of the on-line K-means algorithm using artificial data to see how good the linearized model of K-means near the basin of attraction is. Since the most simple and intuitive expressions have been derived for constant step size and cyclic sampling of the training set, we will only perform simulations in this particular case.

Training data were sampled from a 2-dimensional normal distribution with $0$ mean and the

On-line K-means for constant step size and cyclic sampling

The step size α affects the fixed points $p_{j}$ and their stability considerably. But it is the relation between of α and N_j (the number of training points that are used to compute the fixed point $p_{j}$ ) that determines the behavior and value of $p_{j}$ . If this ratio is small enough (e.g. <2), the fixed points are stable and their values tend to be the empirical estimators of their counterpart optimal points. Otherwise, the fixed points tend to move away slightly from the estimators. Each training example

Conclusions

General expressions of the finite-sample convergence of the on-line K-means algorithm have been presented where the fixed points of the K-means are average weighted conditioned means that depend on the training data, the step size function and the method of sampling. In particular, we have derived a closed formula for cyclic presentation and constant step size using a linear model which is valid near the attraction basis of the non-linear discrete-time dynamical system. In fact, we have

Acknowledgements

The authors acknowledge the valuable comments of the reviewers on a previous version of this paper, which helped to improve the presentation of the mathematics. This research was supported in part by Spanish CICYT action TIC96-0889.

Sergio Bermejo received his M.Sc. Degree and Ph.D. Degrees in Telecommunications Engineering in 1996 and 2000, respectively, from the Universitat Politècnica de Catalunya (UPC). He holds an Assistant Professor position at UPC's Department of Electronic Engineering. His research interests include Statistical Pattern Recognition and Machine Learning.

References (19)

S. Bermejo et al.
Finite-sample convergence of on-line LVQ1 and the BLVQ1 algorithm
Neural Process Lett.
(2001)
Y. Bengio, Artificial neural networks and their application to sequence recognition, Ph.D. Thesis. Department of...
A. Benveniste et al.
Adaptive Algorithms and Stochastic Approximations
(1990)
C.M. Bishop
Neural Networks for Pattern Recognition
(1995)
L. Bottou et al.
Convergence Properties of k-means
L. Bottou
Online Learning and Stochastic Approximations
J.E. Dennis et al.
A view of unconstrained optimization
R.L. Devaney
An introduction to Chaotic Dynamical Systems
(1989)
L. Devroye et al.
A Probabilistic Theory of Pattern Recognition
(1996)

There are more references available in the full text version of this article.

Cited by (11)

The incremental online k-means clustering algorithm and its application to color quantization
2022, Expert Systems with Applications
Citation Excerpt :
Note that, unlike BKM, OKM traverses the data points in random order, which aims to reduce OKM’s sensitivity to the order in which the data points are processed. Studies have shown that for online learning algorithms like OKM, random traversal is preferable to cyclical traversal, which is used in BKM (Bermejo & Cabestany, 2002). This is because cyclical presentation may bias an online learning algorithm, especially when dealing with redundant data sets such as image data.
Color quantization is a common image processing operation with various applications in computer graphics, image processing, and computer vision. Color quantization is essentially a large-scale combinatorial optimization problem. Many clustering algorithms, both of hierarchical and partitional types, have been applied to this problem since the 1980s. In general, hierarchical color quantization algorithms are faster, whereas partitional ones produce better results provided that they are initialized properly. In this paper, we propose a novel partitional color quantization algorithm based on a binary splitting formulation of MacQueen’s online k-means algorithm. Unlike MacQueen’s original algorithm, the proposed algorithm is both deterministic and free of initialization. Experiments on a diverse set of public test images demonstrate that the proposed algorithm is significantly faster than two popular batch k-means algorithms while yielding nearly identical results. In other words, unlike previously proposed k-means variants, our algorithm addresses both the initialization and acceleration issues of k-means without sacrificing the simplicity of the algorithm. The presented algorithm may be of independent interest as a general-purpose clustering algorithm.
Smart motion detection sensor based on video processing using self-organizing maps
2016, Expert Systems with Applications
Citation Excerpt :
In particular, it has been applied to several areas of computer vision, such as color quantization (Dekker, 1994; Palomo & Domínguez, 2014; Papamarkos, 1999; Xiao, Leung, Lam, & Ho, 2012), and image segmentation (Bhandarkar, Koh, & Suk, 1997; Dong & Xie, 2005; Lacerda & Mello, 2013; Maddalena & Petrosino, 2008a). The SOM is based on an incremental (online) learning process, which has better ability to escape from local minima than batch learning (Bermejo & Cabestany, 2002) and consumes less computational time in color quantization problems (Chang, Pengfei, Xiao, & Srikanthan, 2005). Moreover, it has been employed previously to detect foreground objects in video sequences (López-Rubio, Luque-Baena, & Domínguez, 2011; Maddalena & Petrosino, 2008a).
Most current approaches to computer vision are based on expensive, high performance hardware to meet the heavy computational requirements of the employed algorithms. These system architectures are severely limited in their practical application due to financial and technical limitations. In this work a different strategy is used, namely the development of an inexpensive and easy to deploy computer vision system for motion detection. This is achieved by three means. First of all, an affordable and flexible hardware platform is employed. Secondly, the motion detection algorithm is specifically tailored to involve a very small computational load. Thirdly, a fixed point programming paradigm is followed in implementing the system so as to further reduce the computational requirements. The proposed system is experimentally compared to the standard motion detector for a wide range of benchmark videos. The reported results indicate that our proposal attains substantially better performance, while it remains affordable and easy to install in practice.
Neural networks: An overview of early research, current frameworks and new challenges
2016, Neurocomputing
Citation Excerpt :
Several excellent books dedicated to neural networks and machine learning were published in this period, such as those by Haykin [82] and Luo and Unbehauen [83]. In the fourth and last period, which began in approximately 2000 and continues until now, no models have become so popular and aroused such interest as those produced in previous phases, nevertheless the theoretical study of previous models has notably deepened, with exhaustive studies into topics such as convergence analysis, statistical equilibrium, stability [84–88], estimation of states and control of synchronization, aiming to optimize and improve the models [89–95]. The quantitative analysis of neural networks with discontinuous activation functions was also a hot topic in this period [96–99].
This paper presents a comprehensive overview of modelling, simulation and implementation of neural networks, taking into account that two aims have emerged in this area: the improvement of our understanding of the behaviour of the nervous system and the need to find inspiration from it to build systems with the advantages provided by nature to perform certain relevant tasks. The development and evolution of different topics related to neural networks is described (simulators, implementations, and real-world applications) showing that the field has acquired maturity and consolidation, proven by its competitiveness in solving real-world problems. The paper also shows how, over time, artificial neural networks have contributed to fundamental concepts at the birth and development of other disciplines such as Computational Neuroscience, Neuro-engineering, Computational Intelligence and Machine Learning. A better understanding of the human brain is considered one of the challenges of this century, and to achieve it, as this paper goes on to describe, several important national and multinational projects and initiatives are marking the way to follow in neural-network research.
Sample-size adaptive self-organization map for color images quantization
2007, Pattern Recognition Letters
The paper presents a sample-size adaptive SOM (SA-SOM) algorithm for color quantization of images to adapt to the variations of network parameters and training sample size. The sweep size of neighborhood function is modulated by the size of the training data. In addition, the minimax distortion principle which is modulated by training sample size is used to search winning neuron. Based on the SA-SOM, we use the sampling ratio of training data, rather than the conventional weight change between adjacent sweeps, as a stop criterion, to significantly speed up the learning process. The experimental results show that the SA-SOM achieves much better PSNR quality, and smaller PSNR variation under various combinations of network parameters.
Forty years of color quantization: a modern, algorithmic survey
2023, Artificial Intelligence Review
Fast color quantization using MacQueen’s k-means algorithm
2020, Journal of Real-Time Image Processing

View all citing articles on Scopus

Joan Cabestany currently holds a Professor position at the Department of Electronic Engineering of the Universitat Politècnica de Catalunya (UPC). He obtained his M.Sc. Degree and Ph.D. Degrees in Telecommunications Engineering in 1976 and 1982, respectively, both from the Universitat Politècnica de Catalunya. His research interests include Analog and Digital Electronic Systems Design, Configurable and Programmable Electronic Systems, and Neural Networks Models and applications.

View full text

The effect of finite sample size on on-line K-means

Abstract

Introduction

Section snippets

Optimal K-means

Study of convergence

Generalization properties

Experimental results

On-line K-means for constant step size and cyclic sampling

Conclusions

Acknowledgements

Finite-sample convergence of on-line LVQ1 and the BLVQ1 algorithm

Neural Process Lett.

Adaptive Algorithms and Stochastic Approximations

Neural Networks for Pattern Recognition

Convergence Properties of k-means

Online Learning and Stochastic Approximations

A view of unconstrained optimization

An introduction to Chaotic Dynamical Systems

A Probabilistic Theory of Pattern Recognition