Elsevier

Neurocomputing

Volume 450, 25 August 2021, Pages 230-241
Neurocomputing

A local search algorithm for k-means with outliers

https://doi.org/10.1016/j.neucom.2021.04.028Get rights and content

Abstract

k-Means is a well-studied clustering problem that finds applications in many fields related to unsupervised learning. It is known that k-means clustering is highly sensitive to the isolated points (called outliers). Such outliers can significantly influence the final cluster configuration and should be removed to obtain quality solutions. In this paper, we study the k-means with outliers problem. Given a set of data points, the problem is to remove a set of outliers such that the k-means clustering cost of the remaining points is minimized. Designing efficient algorithms for this problem remains an active area of research due to its important role in dealing with noisy data. We consider a relaxed objective function and propose a local search algorithm for k-means with outliers. It is shown that the algorithm has better performance guarantees than previously implemented methods. In particular, it yields a constant-factor bi-criteria approximation solution to the problem. Moreover, we show experimentally that the algorithm performs much better than its provable guarantee and dominates other state-of-the-art methods for the problem.

Introduction

k-means is one of the most fundamental problems in computer science. The problem considers a set PRd of data points and an integer k>0. The goal is to find a set SRd of k points such that the objective function jPminiSΔ(i,j) is minimized (the k points are called clustering centers), where Δ(i,j) denotes the squared distance between i and j. This problem has been paid lots of attention in both theoretical and practical points of view [1], [2], [3], [4], [5], [6], [7].

Although the k-means problem has been well studied, algorithms developed for it can significantly deteriorate their performance when applied to real-world data. One reason is that the problem formulation is not robust to noisy data. That is, a few isolated points have a great influence on the clustering cost [8]. A hypothetical scenario where the clustering result is disproportionately affected by isolated points is shown in Fig. 1. In many clustering applications, we can obtain better solutions if such isolated points are identified and allowed to be ignored. One example is in gene clustering analysis, where algorithms allowing scattered genes not being clustered seem more effective [9], [10].

To deal with noisy data, Charikar et al. [11] introduced the problem of clustering with outliers. It is a variant of the standard clustering problem, requiring that a set of z points are allowed to be removed, where z>0 is a given integer. The removed points are labeled as outliers and ignored in the objective function. By discarding the set of outliers, the clustering cost can be dramatically reduced, which results in improved clustering quality. The problem is also interesting from the outlier detection point of view [8], [12]. Compared with the methods separately removing outliers and clustering, the advantages of algorithms developed for the clustering with outliers problem lie in that they remove more interpretable outliers that can be contextualized by the clusters, which in turn helps yield more compact clusters [13]. Indeed, it has been observed that such a joint view of outlier detection and clustering performs better even in the single task of outlier removal [8], [12].

In this paper, we consider the k-means with outliers problem, which can be defined as follows.

Definition 1

Given a set P of n points in Rd and two positive integers k and z, the goal is to identify a set SRd of no more than k centers and a set ZP of no more than z outliers, such that the objective functionjPZΔ(j,S)is minimized, where Δ(j,S) denotes the squared distance from j to its nearest center in S.

During the past decades, the k-means with outliers problem has found wide applications in many fields involving noisy data, such as microarray analysis [9], [10], image processing [14], and text classification [15]. Krishnaswamy et al. [16] gave a (53.002+)-approximation (for any >0) for the k-means with outliers problem based on an iterative LP-rounding technique, which is the current best approximation bound for the problem. The k-means with outliers problem also admits polynomial time approximation schemes for fixed k [17] or dimensionality [6], [18]. Despite their theoretical significance, these algorithms are fairly complicated and thus hard to be implemented for practical purpose. On the other hand, practical algorithms have been proposed for k-means with outliers, such as the linear programming-based algorithm given in [13], which optimizes an objective function of facility location using a subgradient approach, and the k-means-algorithm introduced in [8]. Starting with a set of randomly selected clustering centers, the k-means-algorithm iteratively computes the outliers defined by the centers and updates the centers based on the newly identified outliers. However, these algorithms are at best heuristics, and no performance guarantee is known for them.

In practice, a commonly used way for relaxing the k-means with outliers problem is to assume that the number of removed outliers is allowed to be slightly more than the desired number. This is feasible in applications where the clusters are tolerant to small disturbances. Under such assumption, practical algorithms with provable performance guarantees have been obtained. Bhaskara et al. [19] showed that a variant of the k-means++ algorithm (introduced by Arthur and Vassilvitskii [3], which selects a set of points as the clustering centers based on a randomized sampling method) yields an O(logk)-approximation for the problem, which violates the number of outliers by a factor of O(logk). The algorithm in [19] is similar to k-means++, except that the sampling probabilities are modified and the points far away from the clustering centers are selected with lower probabilities, such that it can be robust to outliers. Gupta et al. [12] gave a local search algorithm for k-means with outliers. To deal with the challenges caused by the outliers, the algorithm in [12] was designed to be quite conservative when identifying outliers: It discards z additional outliers if the cost of the solution can be significantly reduced in each local search step. It was shown that this local search algorithm yields an O(1)-approximation for k-means with outliers, but identifies O(zklogcost0) points as outliers, where cost0 denotes the cost of the randomly selected initial solution. Im et al. [20] gave a density-based preprocessing method for identifying outliers. For each to-be-clustered point, this method calculates the number of points located in a closed ball centered at it. A point is labeled as a light one if relatively few points are around it, and labeled as an outlier if only has light points around it. It was shown that the method has the guarantee of achieving an O(1)-approximation and removing at most O(zk) outliers, and for the case where each cluster in an optimal solution contains more than O(z) points, the method removes no more than O(z) outliers.

In this paper, we ask: can we solve the k-means with outliers problem better if slightly more than z points can be labeled as outliers and removed. It was known that if O(zk) points are labeled as outliers, then we can obtain O(1)-approximation solutions to the problem using practical algorithms [12], [20], but what can we do if the number of outliers is only allowed to be violated by a constant factor?

Our main result is a positive one in this direction. We give an efficient single-swap local search algorithm for k-means with outliers, which yields an O(1)-approximation for the problem and identifies O(z) points as outliers.

Theorem 1

Given an instance (P,k,z) of the k-means with outliers problem, there is a single-swap local search algorithm that identifies a set SRd of k centers and a set ZP of outliers, such that jPZΔ(j,S)=O(opt) and |Z|=O(z), where opt denotes the cost of an optimal solution.

When compared to the sampling-based algorithm given in [19] and the density-based algorithm given in [20], our algorithm differs in that it considers a relaxed objective function, which is optimized based on a local search approach. In the relaxed objective function, the constraint on the number of outliers is ruled out but the penalty for its violation is taken into account. Our algorithm starts with a set of randomly selected k centers, and then tries to swap a center with a non-center point. It terminates if no such swap yields a significantly improved solution. Otherwise, it iterates with the improved solution. Such a local search approach is similar to the one used in [12], but quite different when identifying outliers. The algorithm in [12] removes additional outliers in each iteration if it is beneficial. This provides a clear way for bounding the clustering cost, but labels a large number of “inliers” as outliers in theoretical point of view (i.e., it removes O(zklogcost0) outliers in total). Based on the relaxed objective function, we show that removing O(z) outliers is sufficient to ensure a constant-factor approximation.

Our algorithm is also similar to the local search algorithm for the standard k-means problem [2], [21], but proving that the algorithm has performance guarantees is much more difficult due to the existence of outliers. We make use of the fact that the cost of the local optimum given by the local search algorithm cannot be significantly decreased by swapping a center with a non-center point. The performance guarantee is obtained by analysing the cost changes induced by a set of such swaps. The main difficulty in extending the analysis for the standard k-means problem [2], [21] to k-means with outliers lies in the difference in the points removed by the optimal and local optimal solutions. This difference means that we have to consider the upper bound on the clustering cost induced by some points removed in an optimal solution. We give a new approach for constructing swap pairs and analysing the cost changes induced by the swaps to deal with such obstacle, which is the crucial step in obtaining the bi-criteria constant-factor approximation.

In the experiments, we show that our algorithm performs much better than what Theorem 1 suggests, even for the case where the removed outliers are not allowed to be more than the specified number. We experimentally compare our algorithm with the local search algorithm given in [12], and show that our algorithm outperforms the latter in both aspects of running time and quality of solution. Similar results are found when our algorithm is compared against some other state-of-the-art methods, such as the recent variant of the k-means++ algorithm given in [19] and the density-based algorithm given in [20].

Given two points i,jRd, let δ(i,j) and Δ(i,j) denote the distance and squared distance for i to j, respectively. For any ARd and jRd, define Δ(j,A)=miniAΔ(j,i), and let Γ(A)=argminiRdjAΔ(i,j) be the center minimizing the 1-mean clustering cost of A.

The following result is useful in bounding the clustering cost of the points.

Lemma 1

For any γ>0 and i,j,lRd, we have Δ(i,j)(1+γ)Δ(i,l)+(1+1γ)Δ(l,j).

Proof

Observe thatΔ(i,j)(δ(i,l)+δ(l,j))2=Δ(i,l)+Δ(l,j)+2δ(i,l)δ(l,j)Δ(i,l)+Δ(l,j)+γΔ(i,l)+1γΔ(l,j),where the first step follows from triangle inequality.

We will also use the following well known property of the squared Euclidean metric.

Lemma 2

(Kanungo et al. [2]) For any iRd and ARd, we have jAΔ(i,j)=|A|Δ(i,Γ(A))+jAΔ(Γ(A),j).

Let (P,k,z) be an instance of the k-means with outliers problem. Let (S,Z) denote a local optimum generated by our local search algorithm and (S,Z) be an optimal solution. We need to show that (S,Z) is a bi-criteria constant-factor approximation solution to the problem (i.e., jPZΔ(j,S)=O(jPZΔ(j,S)) and |Z|=O(z)). We use the technique of Lagrangian relaxation and consider a relaxed objective function, which eliminates the constraint on the number of removed outliers but pays the penalty for its violation (see Section 2.1). In Section 3.2, we show that minimizing this relaxed objective function is sufficient to obtain the desired approximation solution. Given a solution (S,Z) to the problem, let Φ(S,Z) denote its value for the relaxed objective function. Our main task is to prove that Φ(S,Z)=O(Φ(S,Z)). We carefully select a set S of potential swaps between the centers opened by the local optimum and optimal solution. The change in the cost induced by these swaps is bounded by uΦ(S,Z)-vΦ(S,Z) for two constants u and v (see Lemma 6 and Lemma 7). Given that no swap from S can significantly reduce the cost of (S,Z) (since (S,Z) is a local optimum), we have -Φ(S,Z)uΦ(S,Z)-vΦ(S,Z) for some small constant , which implies that Φ(S,Z) is within a constant times Φ(S,Z). These ideas lead to the proof of Theorem 1.

The clustering problem is to partition a given set of points into several clusters such that points in the same cluster are more similar to each other. Considerable efforts have been devoted to developing clustering methods, including partitional methods [22], [23], [24], message-passing methods [25], density-based methods such as DBSCAN [26] and the density peak search algorithm introduced in [27], hierarchical methods such as BIRCH [28] and ROCK [29], probability model-based methods [30], [31], and local learning-based methods [32], [33]. There have also been algorithms developed for outlier detection [34], [35], [36], [37], [38], [39], [40]. These algorithms identify outliers by measuring how isolated each point is with respect to its surrounding neighborhood. In this paper, we deal with the problems of clustering and outlier detection in an integrated way, as done in [8], [13], [19], [12], [20].

Depending on the objective function, the clustering problem has many different variants, among which the k-means problem is one of the most extensively studied versions. It was known that the k-means problem is NP-hard even in a planar space [41], and even for k=2 [42]. This leads to considerable efforts devoted to obtaining its approximation algorithms. The first constant-factor approximation for the k-means problem was given by Jain and Vazirani [43], who introduced a (108+)-approximation using the techniques of primal–dual and Lagrangian relaxation. Kanungo et al. [2] later showed that a simple local search algorithm yields a (9+)-approximation. Most recently, Ahmadian et al. [7] improved the approximation ratio to 6.357+ based on the framework outlined in [43].

Section snippets

The algorithm

In this section we give our algorithm for k-means with outliers. The algorithm considers an instance (P,k,z) of the problem. It outputs a set S of clustering centers and a set Z of outliers. Let S̃ and Z̃ be the sets of centers and outliers in an optimal solution to (P,k,z), respectively. Let opt=jPZ̃Δ(j,S̃) denote the cost of an optimal solution to (P,k,z).

Running time

We now show the running time of Algorithm 1. Let S0 denote the initial set of centers of the algorithm, and S be the set of the returned centers. We have the following result.

Lemma 5

Algorithm 1 runs in time O(n(k+z)(d+1k2logΦ(S0)Φ(S))).

Proof

For each jP, Algorithm 1 computes its distance to each iF, which takes O(nd(k+z)) time. To efficiently find the nearest center to each jP, we sort each center from initial set S0 by increasing distance to j. The sorting takes O(nklogk) time in total. These sequences

Experiments

The performance of our algorithm is evaluated in this section. We refer to our algorithm as relaxation based local search (RLS). We use =10-2 in the experiments.

Baselines. We compare RLS with the following state-of-the-art methods for clustering noisy data.

  • k-means – from [8]. It is an extension of the heuristic algorithm given by Lloyd [22]. The initial k centers are selected using the k-means++ algorithm [3].

  • T-k-means++ from [19]. It is a variant of the k-means++ algorithm [3], which yields

Conclusions

In this paper we give a local search algorithm for the k-means with outliers problem. We show that the algorithm has the guarantee of yielding a constant-factor bi-criteria approximation for the problem. The algorithm is easy to implement. The experimental results indicate that the algorithm outperforms other state-of-the-art algorithms on real datasets.

CRediT authorship contribution statement

Zhen Zhang: Writing - original draft, Writing - review & editing, Formal analysis. Qilong Feng: Conceptualization, Supervision. Junyu Huang: Software. Jinhui Xu: Validation, Data curation. Jianxin Wang: Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Zhen Zhang received the B.E. degree in computer science and technology from Central South University, China, where he is currently pursuing the Ph.D. degree in computer science and technology. His research interests include combinatorial optimization, approximation algorithms, and machine learning.

References (46)

  • S. Chawla et al.

    k-means–: a unified approach to clustering and outlier detection

  • A. Thalamuthu et al.

    Evaluation and comparison of gene clustering methods in microarray analysis

    Bioinform.

    (2006)
  • G.C. Tseng

    Penalized and weighted k-means for clustering with scattered objects and prior information in high-throughput biological data

    Bioinform.

    (2007)
  • M. Charikar et al.

    Algorithms for facility location problems with outliers

  • S. Gupta et al.

    Local search methods for k-means with outliers

    PVLDB

    (2017)
  • L. Ott et al.

    On integrated clustering and outlier detection

  • R.T. Ionescu et al.

    Detecting abnormal events in video using narrowed normality clusters

    IEEE Winter Conference on Applications of Computer Vision (WACV)

    (2019)
  • M. Imran et al.

    A robust framework for classifying evolving document streams in an expert-machine-crowd setting

  • R. Krishnaswamy et al.

    Constant approximation for k-median and k-means with outliers via iterative rounding

  • Q. Feng, Z. Zhang, Z. Huang, J. Xu, J. Wang, Improved algorithms for clustering with outliers, in: Proc. 30th...
  • Z. Friggstad, K. Khodamoradi, M. Rezapour, M.R. Salavatipour, Approximation schemes for clustering with outliers, ACM...
  • A. Bhaskara et al.

    Greedy sampling for approximate clustering in the presence of outliers

  • S. Im et al.

    Fast noise removal for k-means clustering

  • Cited by (28)

    • Equilibrium-based COVID-19 diagnosis from routine blood tests: A sparse deep convolutional model

      2023, Expert Systems with Applications
      Citation Excerpt :

      Outliers detection and elimination: outliers elimination helps to increase the accuracy of the classification model. Clustering-based approaches (Borlea et al., 2021) can be used for outlier detection (Zhang et al., 2021). However, for detecting anomalies in the adopted OSR dataset, we employed a tree-based approach, i.e., Isolation Forests algorithm (Liu et al., 2008).

    • An optimized GMM algorithm and its application in single-trial motor imagination recognition

      2022, Biomedical Signal Processing and Control
      Citation Excerpt :

      However, all the outlier removal methods have a positive effect on optimizing GMM but the parameters need to be constantly adjusted to get the best results. The problem of outliers that form clusters by themselves is not just faced by GMM [11,21,22]. Fig. 6 gives the clustering performance comparison between the traditional cluster methods and optimized methods, in which the iForest method is applied to other outlier sensitivity algorithms, such as K-means, FCM, and balanced iterative reducing and clustering using hierarchies (Birch), for the same dataset from subject A. Table 5 is the obtained results in accuracy for all subjects with C equals 0.05.

    View all citing articles on Scopus

    Zhen Zhang received the B.E. degree in computer science and technology from Central South University, China, where he is currently pursuing the Ph.D. degree in computer science and technology. His research interests include combinatorial optimization, approximation algorithms, and machine learning.

    Qilong Feng received the Ph.D. degree in computer science from Central South University, China, in 2010. Currently, he is a professor in School of Computer Science, Central South University, China. His research interests include algorithm analysis and optimization, clustering algorithms, and machine learning.

    Junyu Huang received the B.S. degree in biomedical engineering from Central South University, China, in 2017, and is currently working toward the Ph.D. degree in School of Computer Science, Central South University, China. His research interests include machine learning and approximation algorithms.

    Yutian Guo received the B.S. degree from Xi’an University of Finance and Economics, China, in 2018, and is currently working toward the M.S. degree in School of Computer Science, Central South University, China. His research interests include machine learning and approximation algorithms.

    Jinhui Xu received the B.S. and M.S. degrees in computer science from the University of Science and Technology of China in 1992 and 1995, respectively, and the PhD degree in computer science and engineering from the University of Notre Dame, in 2000. He is currently a professor of computer science and engineering with the State University of New York at Buffalo. His research interests include algorithms, computational geometry, machine learning, optimization, and their applications in medicine, biology, networking, and 3D printing.

    Jianxin Wang received the B.E. and M.E. degrees in computer science from Central South University of Technology, Changsha, China, and the Ph.D. degree in computer science from Central South University, Changsha, China. He is currently the Dean and also a Professor in the School of Computer Science and Engineering, Central South University, Changsha, China. He has published more than 200 papers in various international journals and refereed conferences. His current research interests include algorithm analysis and optimization, bioinformatics, and computer network. He is the Chair of the ACM Sigbio China.

    This work was supported by National Natural Science Foundation of China (61872450 and 71631008), and Hunan Provincial Science and Technology Program (2018WK4001).

    View full text