Elsevier

Expert Systems with Applications

Volume 67, January 2017, Pages 228-238
Expert Systems with Applications

An evolutionary algorithm for clustering data streams with a variable number of clusters

https://doi.org/10.1016/j.eswa.2016.09.020Get rights and content

Highlights

  • An evolutionary algorithm for clustering data stream is proposed.

  • Our algorithm allows estimating k automatically from the data in an online fashion.

  • It monitors eventual degradation in the quality of the induced clusters.

  • Results show our algorithm correctly detects, and react to, changes in a data stream.

  • The proposed method is very competitive in terms of accuracy and time processing.

Abstract

Several algorithms for clustering data streams based on k-Means have been proposed in the literature. However, most of them assume that the number of clusters, k, is known a priori by the user and can be kept fixed throughout the data analysis process. Besides the difficulty in choosing k, data stream clustering imposes several challenges to be addressed, such as addressing non-stationary, unbounded data that arrive in an online fashion. In this paper, we propose a Fast Evolutionary Algorithm for Clustering data streams (FEAC-Stream) that allows estimating k automatically from data in an online fashion. FEAC-Stream uses the Page–Hinkley Test to detect eventual degradation in the quality of the induced clusters, thereby triggering an evolutionary algorithm that re-estimates k accordingly. FEAC-Stream relies on the assumption that clusters of (partially unknown) data can provide useful information about the dynamics of the data stream. We illustrate the potential of FEAC-Stream in a set of experiments using both synthetic and real-world data streams, comparing it to four related algorithms, namely: CluStream-OMRk, CluStream-BkM, StreamKM++-OMRk and StreamKM++-BkM. The obtained results show that FEAC-Stream provides good data partitions and that it can detect, and accordingly react to, data changes.

Introduction

Advances in both hardware and software have enabled large-scale data acquisition. Currently, enormous amounts of data are being collected in dynamic environments, at high speeds. Such data are usually referred to as data streams. A data stream is an unbounded, ordered sequence of objects that must be accessed in order and that can be read only once or a small number of times (Guha, Meyerson, Mishra, Motwani, & O’Callaghan, 2003). In recent years, data streams have attracted significant attention because of relevant applications, e.g., see Gama (2010); Lughofer, Macian, Guardiola, and Klement (2010); Mouchawe (2010); Zhang, Zhu, Shi, Guo, and Wu (2011).

Data streams must be intelligently transformed into meaningful and actionable information, which can then be used to enable more effective decision-making. To accomplish that goal, machine learning algorithms that are capable of continuous learning over time play a pivotal role. Specifically, data streams require learning algorithms that can adapt models, eventually forgetting data samples that become obsolete. In this context, incremental algorithms are of great relevance because they can avoid the computationally intensive task of re-training the whole model while accounting for dynamic patterns in the data that change over time. Additionally, the data stream must be processed in a single-pass-like manner, i.e., the data stream cannot be read again due to storage limitations. Usually, the data objects are discarded after being processed.

A useful form of analyzing data streams involves clustering (Aggarwal, Han, Wang, Yu, 2004, Ailon, Jaiswal, Monteleoni, 2009, Gama, 2010, Shindler, Wong, Meyerson, 2011, Silva, de Faria, Barros, Hruschka, de Carvalho, & Gama). The literature on clustering is very large. Of the many algorithms that is available is k-Means, which is very popular for data mining due to its simplicity, scalability, and empirical success in many real-world applications (Jain, 2009, Wu, Kumar, Ross Quinlan, Ghosh, Yang, Motoda, McLachlan, Ng, Liu, Yu, Zhou, Steinbach, Hand, Steinberg, 2007). Several k-Means variants have been proposed to address data streams, e.g., see Ackermann et al. (2012); Aggarwal, Han, Wang, Yu, 2003, Aggarwal, Han, Wang, Yu, 2003; Guha et al. (2003); O’Callaghan, Meyerson, Motwani, Mishra, and Guha (2002). Despite the successful application of these algorithms to many real-world problems, they have a major limitation: the number of clusters, k, must be defined a priori.

From an optimization perspective, clustering can be formally considered to be a specific type of NP-hard grouping problem (Falkenauer, 1998). Evolutionary algorithms are meta-heuristics that are widely believed to be able to effectively produce sub-optimal solutions on NP-hard problems in a reasonable amount of time. Under this assumption, a large number of evolutionary algorithms for solving clustering problems have been proposed in the literature (see Hruschka, Campello, Freitas, and de Carvalho (2009) for an overview). More specifically, the Fast Evolutionary Algorithm for Clustering (FEAC) (Alves, Campello, & Hruschka, 2006) has shown to be especially efficient for automatically estimating k from data (Naldi, Campello, Hruschka, & Carvalho, 2011). However, this algorithm was not designed to address data streams. Aiming at circumventing such a limitation, we extend the FEAC in such a way that it can address data streams. The resulting algorithm is called the FEAC-Stream. To the best of our knowledge, this method is the first evolutionary algorithm for clustering data streams that addresses the estimation of k from the data.

In data stream scenarios, ideally the clustering algorithms should be able to update the data partition in an online fashion (Silva et al., 2012). This alternative can save computational resources when clusters do not change significantly over time. In order to determine if there is a change in the data partition, it is necessary to perform a change detection test. Among the alternatives in the literature, the Page–Hinkley (PH) Test (Mouss, Mouss, Mouss, & Sefouhi, 2004) is an efficient method to detect changes in the normal behavior of a process (Gama, Žliobaitė, Bifet, Pechenizkiy, & Bouchachia, 2014). Bearing this property in mind, we propose a change detection procedure that is based on the PH Test (Mouss et al., 2004). Specifically, the PH Test was adapted to detect whether the assignment of an object to the closest cluster increases the intra-cluster distances significantly.

The potential of the proposed FEAC-Stream is illustrated by comparing it to the framework proposed in de Andrade Silva and Hruschka (2011), which is based on three state-of-the-art algorithms for clustering data streams, namely, Stream LSearch (O’Callaghan et al., 2002), CluStream (Aggarwal et al., 2003), and StreamKM++ (Ackermann et al., 2012), combined with two algorithms for estimating the number of clusters, which are Ordered Multiple Runs of k-Means (OMRk) (Naldi et al., 2011) and Bisecting k-Means (BkM) (Steinbach, Karypis, & Kumar, 2000).

The remainder of this paper is organized as follows. In Section 2, we briefly review related approaches. Section 3 presents the proposed evolutionary algorithm for clustering data streams (FEAC-Stream). Experimental results are reported in Section 4. Finally, Section 5 concludes the paper.

Section snippets

Related Work

In general, the data stream clustering problem is defined as to maintain continuously consistent good clustering of processed objects using a small amount of memory and time (Guha et al., 2003). Ideally, the algorithms should incrementally process the data objects, rapidly detect and react to cluster evolution, provide a model representation that does not grow with the number of objects processed and handle outliers (Silva et al., 2012). Bearing these issues in mind, several clustering

Evolutionary algorithm for clustering data streams

In this section, we present our evolutionary algorithm for clustering data streams. Evolutionary algorithms are based on the optimization of some objective function that guides the evolutionary search (Hruschka et al., 2009). The Fast Evolutionary Algorithm for Clustering (FEAC) (Alves et al., 2006) tends to perform a computationally more efficient search as compared to multiple, systematic executions of k-Means, e.g., see Naldi et al. (2011). However, the FEAC algorithm cannot handle data

Experiments

We empirically evaluated the FEAC-Stream algorithm by comparing it with two algorithms for clustering data streams, CluStream (Aggarwal et al., 2003) and Stream-KM++ (Ackermann et al., 2012), and we combined them with two algorithms for estimating the number of clusters, OMRk (Naldi et al., 2011) and BkM (Steinbach et al., 2000) (for details, see Section 2). Thus, four algorithmic instantiations were compared: CluStream-OMRk (CLS-OMRk), StreamKM++-OMRk (SKM-OMRk), CluStream-BkM (CLS-BkM), and

Final remarks

Many clustering algorithms based on k-Means for processing data streams have been studied. Most of them assume that the number of clusters, k, is known and fixed a priori by the user. Aiming at relaxing this assumption, which is often unrealistic in practical applications, we proposed a Fast Evolutionary Algorithm for Clustering Data Streams (FEAC-Stream), which allows estimating k automatically from the data in an online fashion. Based on a detection point algorithm, FEAC-Stream monitors

Acknowledgments

The authors would like to thank CAPES, CNPq, FAPESP grant #2010/15049-7, and also acknowledge the support of the European Commission through the project MAESTRA (Grant Number ICT-750 2013-612944).

References (52)

  • M.K. Albertini et al.

    Data stream dynamic clustering supported by markov chain isomorphisms

    Intelligent Data Analysis

    (2013)
  • V. Alves et al.

    Towards a fast evolutionary algorithm for clustering

    IEEE congress on evolutionary computation (cec’06)

    (2006)
  • M.R. Anderberg

    Cluster analysis for applications

    (1973)
  • J. de Andrade Silva et al.

    Extending k-means-based algorithms for evolving data streams with variable number of clusters

    Fourth international conference on machine learning and applications - icmla’11

    (2011)
  • D. Arthur et al.

    k-means++: The advantages of careful seeding

    Proceedings of the soda’07

    (2007)
  • M. Bādoiu et al.

    Approximate clustering via core-sets

    Proceedings of the thiry-fourth annual acm symposium on theory of computing

    (2002)
  • A. Bifet et al.

    Moa: Massive online analysis

    Journal of Machine Learning Research

    (2010)
  • A. Broder et al.

    Scalable k-means by ranked retrieval

    Proceedings of the 7th acm international conference on web search and data mining

    (2014)
  • ChenY. et al.

    Density-based clustering for real-time stream data

    Kdd ’07: Proceedings of the 13th acm sigkdd international conference on knowledge discovery and data mining

    (2007)
  • H. Cui et al.

    A collaborative divide-and-conquer k-means clustering algorithm for processing large data

    Proceedings of the 11th acm conference on computing frontiers

    (2014)
  • A.E. Eiben et al.

    Introduction to evolutionary computing (Natural Computing Series)

    (2003)
  • B.S. Everitt et al.

    Cluster analysis

    (2001)
  • E. Falkenauer

    Genetic algorithms and grouping problems

    (1998)
  • J. Gama

    Knowledge discovery from data streams

    (2010)
  • J.a. Gama et al.

    A survey on concept drift adaptation

    ACM Computing Surveys

    (2014)
  • Guha et al.

    Clustering data streams: Theory and practice

    IEEE Transactions on Knowledge and Data Engineering

    (2003)
  • Cited by (0)

    1

    Present Address: Campus of Ponta Porã, The University of Mato Grosso do Sul (UFMS-CPPP)

    2

    Present Address: Department of Computer Science, The University of São Paulo (USP) at São Carlos, São Paulo, Brazil

    3

    Present Address: Laboratory of Artificial Intelligence and Decision Support, University of Porto (UP), Porto, Portugal

    View full text