An evolutionary algorithm for clustering data streams with a variable number of clusters
Introduction
Advances in both hardware and software have enabled large-scale data acquisition. Currently, enormous amounts of data are being collected in dynamic environments, at high speeds. Such data are usually referred to as data streams. A data stream is an unbounded, ordered sequence of objects that must be accessed in order and that can be read only once or a small number of times (Guha, Meyerson, Mishra, Motwani, & O’Callaghan, 2003). In recent years, data streams have attracted significant attention because of relevant applications, e.g., see Gama (2010); Lughofer, Macian, Guardiola, and Klement (2010); Mouchawe (2010); Zhang, Zhu, Shi, Guo, and Wu (2011).
Data streams must be intelligently transformed into meaningful and actionable information, which can then be used to enable more effective decision-making. To accomplish that goal, machine learning algorithms that are capable of continuous learning over time play a pivotal role. Specifically, data streams require learning algorithms that can adapt models, eventually forgetting data samples that become obsolete. In this context, incremental algorithms are of great relevance because they can avoid the computationally intensive task of re-training the whole model while accounting for dynamic patterns in the data that change over time. Additionally, the data stream must be processed in a single-pass-like manner, i.e., the data stream cannot be read again due to storage limitations. Usually, the data objects are discarded after being processed.
A useful form of analyzing data streams involves clustering (Aggarwal, Han, Wang, Yu, 2004, Ailon, Jaiswal, Monteleoni, 2009, Gama, 2010, Shindler, Wong, Meyerson, 2011, Silva, de Faria, Barros, Hruschka, de Carvalho, & Gama). The literature on clustering is very large. Of the many algorithms that is available is k-Means, which is very popular for data mining due to its simplicity, scalability, and empirical success in many real-world applications (Jain, 2009, Wu, Kumar, Ross Quinlan, Ghosh, Yang, Motoda, McLachlan, Ng, Liu, Yu, Zhou, Steinbach, Hand, Steinberg, 2007). Several k-Means variants have been proposed to address data streams, e.g., see Ackermann et al. (2012); Aggarwal, Han, Wang, Yu, 2003, Aggarwal, Han, Wang, Yu, 2003; Guha et al. (2003); O’Callaghan, Meyerson, Motwani, Mishra, and Guha (2002). Despite the successful application of these algorithms to many real-world problems, they have a major limitation: the number of clusters, k, must be defined a priori.
From an optimization perspective, clustering can be formally considered to be a specific type of NP-hard grouping problem (Falkenauer, 1998). Evolutionary algorithms are meta-heuristics that are widely believed to be able to effectively produce sub-optimal solutions on NP-hard problems in a reasonable amount of time. Under this assumption, a large number of evolutionary algorithms for solving clustering problems have been proposed in the literature (see Hruschka, Campello, Freitas, and de Carvalho (2009) for an overview). More specifically, the Fast Evolutionary Algorithm for Clustering (FEAC) (Alves, Campello, & Hruschka, 2006) has shown to be especially efficient for automatically estimating k from data (Naldi, Campello, Hruschka, & Carvalho, 2011). However, this algorithm was not designed to address data streams. Aiming at circumventing such a limitation, we extend the FEAC in such a way that it can address data streams. The resulting algorithm is called the FEAC-Stream. To the best of our knowledge, this method is the first evolutionary algorithm for clustering data streams that addresses the estimation of k from the data.
In data stream scenarios, ideally the clustering algorithms should be able to update the data partition in an online fashion (Silva et al., 2012). This alternative can save computational resources when clusters do not change significantly over time. In order to determine if there is a change in the data partition, it is necessary to perform a change detection test. Among the alternatives in the literature, the Page–Hinkley (PH) Test (Mouss, Mouss, Mouss, & Sefouhi, 2004) is an efficient method to detect changes in the normal behavior of a process (Gama, Žliobaitė, Bifet, Pechenizkiy, & Bouchachia, 2014). Bearing this property in mind, we propose a change detection procedure that is based on the PH Test (Mouss et al., 2004). Specifically, the PH Test was adapted to detect whether the assignment of an object to the closest cluster increases the intra-cluster distances significantly.
The potential of the proposed FEAC-Stream is illustrated by comparing it to the framework proposed in de Andrade Silva and Hruschka (2011), which is based on three state-of-the-art algorithms for clustering data streams, namely, Stream LSearch (O’Callaghan et al., 2002), CluStream (Aggarwal et al., 2003), and StreamKM++ (Ackermann et al., 2012), combined with two algorithms for estimating the number of clusters, which are Ordered Multiple Runs of k-Means (OMRk) (Naldi et al., 2011) and Bisecting k-Means (BkM) (Steinbach, Karypis, & Kumar, 2000).
The remainder of this paper is organized as follows. In Section 2, we briefly review related approaches. Section 3 presents the proposed evolutionary algorithm for clustering data streams (FEAC-Stream). Experimental results are reported in Section 4. Finally, Section 5 concludes the paper.
Section snippets
Related Work
In general, the data stream clustering problem is defined as to maintain continuously consistent good clustering of processed objects using a small amount of memory and time (Guha et al., 2003). Ideally, the algorithms should incrementally process the data objects, rapidly detect and react to cluster evolution, provide a model representation that does not grow with the number of objects processed and handle outliers (Silva et al., 2012). Bearing these issues in mind, several clustering
Evolutionary algorithm for clustering data streams
In this section, we present our evolutionary algorithm for clustering data streams. Evolutionary algorithms are based on the optimization of some objective function that guides the evolutionary search (Hruschka et al., 2009). The Fast Evolutionary Algorithm for Clustering (FEAC) (Alves et al., 2006) tends to perform a computationally more efficient search as compared to multiple, systematic executions of k-Means, e.g., see Naldi et al. (2011). However, the FEAC algorithm cannot handle data
Experiments
We empirically evaluated the FEAC-Stream algorithm by comparing it with two algorithms for clustering data streams, CluStream (Aggarwal et al., 2003) and Stream-KM++ (Ackermann et al., 2012), and we combined them with two algorithms for estimating the number of clusters, OMRk (Naldi et al., 2011) and BkM (Steinbach et al., 2000) (for details, see Section 2). Thus, four algorithmic instantiations were compared: CluStream-OMRk (CLS-OMRk), StreamKM++-OMRk (SKM-OMRk), CluStream-BkM (CLS-BkM), and
Final remarks
Many clustering algorithms based on k-Means for processing data streams have been studied. Most of them assume that the number of clusters, k, is known and fixed a priori by the user. Aiming at relaxing this assumption, which is often unrealistic in practical applications, we proposed a Fast Evolutionary Algorithm for Clustering Data Streams (FEAC-Stream), which allows estimating k automatically from the data in an online fashion. Based on a detection point algorithm, FEAC-Stream monitors
Acknowledgments
The authors would like to thank CAPES, CNPq, FAPESP grant #2010/15049-7, and also acknowledge the support of the European Commission through the project MAESTRA (Grant Number ICT-750 2013-612944).
References (52)
- et al.
A framework for clustering evolving data streams
Proceedings of the vldb
(2003) - et al.
A framework for projected clustering of high dimensional data streams
Proceedings of the vldb
(2004) - et al.
Online clustering of parallel data streams
Data and Knowledge Engineering
(2006) - et al.
Evolving clusters in gene-expression data
Information Sciences
(2006) - et al.
Efficiency issues of evolutionary k-means
Applied Soft Computing
(2011) - et al.
Robust ensemble learning for mining noisy data streams
Decision Support Systems
(2011) - et al.
Streamkm++: A clustering algorithm for data streams
ACM Journal of Experimental Algorithmics
(2012) - et al.
Approximating extent measures of points
Journal of the ACM
(2004) - et al.
Streaming k-means approximation
Advances in neural information processing systems 22
(2009) - et al.
On similarity indices and correction for chance agreement
Journal of Classification
(2006)
Data stream dynamic clustering supported by markov chain isomorphisms
Intelligent Data Analysis
Towards a fast evolutionary algorithm for clustering
IEEE congress on evolutionary computation (cec’06)
Cluster analysis for applications
Extending k-means-based algorithms for evolving data streams with variable number of clusters
Fourth international conference on machine learning and applications - icmla’11
k-means++: The advantages of careful seeding
Proceedings of the soda’07
Approximate clustering via core-sets
Proceedings of the thiry-fourth annual acm symposium on theory of computing
Moa: Massive online analysis
Journal of Machine Learning Research
Scalable k-means by ranked retrieval
Proceedings of the 7th acm international conference on web search and data mining
Density-based clustering for real-time stream data
Kdd ’07: Proceedings of the 13th acm sigkdd international conference on knowledge discovery and data mining
A collaborative divide-and-conquer k-means clustering algorithm for processing large data
Proceedings of the 11th acm conference on computing frontiers
Introduction to evolutionary computing (Natural Computing Series)
Cluster analysis
Genetic algorithms and grouping problems
Knowledge discovery from data streams
A survey on concept drift adaptation
ACM Computing Surveys
Clustering data streams: Theory and practice
IEEE Transactions on Knowledge and Data Engineering
Cited by (0)
- 1
Present Address: Campus of Ponta Porã, The University of Mato Grosso do Sul (UFMS-CPPP)
- 2
Present Address: Department of Computer Science, The University of São Paulo (USP) at São Carlos, São Paulo, Brazil
- 3
Present Address: Laboratory of Artificial Intelligence and Decision Support, University of Porto (UP), Porto, Portugal