research-article

A Support System for Clustering Data Streams with a Variable Number of Clusters

Authors:

Jonathan de Andrade Silva,

Eduardo Raul HruschkaAuthors Info & Claims

ACM Transactions on Autonomous and Adaptive Systems (TAAS), Volume 11, Issue 2

Article No.: 11, Pages 1 - 26

https://doi.org/10.1145/2932704

Published: 25 July 2016 Publication History

Abstract

Many algorithms for clustering data streams that are based on the widely used k-Means have been proposed in the literature. Most of these algorithms assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we propose a support system that allows not only estimating the number of clusters automatically from data but also monitoring the process of the data-stream clustering. We illustrate the potential of the proposed system by means of a prototype that implements eight algorithms for clustering data streams, namely, Stream LSearch-OMRk, Stream LSearch-BkM, Stream LSearch-IOMRk, Stream LSearch-IBkM, CluStream-OMRk, CluStream-BkM, StreamKM++-OMRk, and StreamKM++−BkM. These algorithms are combinations of three state-of-the-art algorithms for clustering data streams with fixed k, namely, Stream LSearch, CluStream, and StreamKM++, with two algorithms for estimating the number of clusters, which are Ordered Multiple Runs of k-Means (OMRk) and Bisecting k-Means (BkM). We experimentally compare the performance of these algorithms using both synthetic and real-world data streams. Analyses of statistical significance suggest that the algorithms that are based on OMRk yield the best data partitions, while the algorithms that are based on BkM are more computationally efficient. Additionally, StreamKM++−OMRk and Stream LSearch-IBkM provide the best tradeoff relationship between accuracy and efficiency.

References

[1]

Marcel R. Ackermann, Christiane Lammersen, Marcus Märtens, Christoph Raupach, Christian Sohler, and Kamil Swierkot. 2010. StreamKM++: A clustering algorithms for data streams. In Proc. of the ALENEX. 173--187.

Digital Library

[2]

Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2003. A framework for clustering evolving data streams. In Proc. of the VLDB. 81--92.

Digital Library

[3]

Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2004. A framework for projected clustering of high dimensional data streams. In Proc. of the 30th International Conference on Very Large Data Bases (VLDB’04). VLDB Endowment, 852--863.

Digital Library

[4]

Michael R. Anderberg. 1973. Cluster Analysis for Applications. Academic Press.

[5]

David Arthur and Sergei Vassilvitskii. 2007. k-means++: The advantages of careful seeding. In Proc. of the SODA’07. 1027--1035.

Digital Library

[6]

Jürgen Beringer and Eyke Hüllermeier. 2006. Online clustering of parallel data streams. Data Knowled. Eng. 58 (2006), 180--204.

Digital Library

[7]

Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. 2010. MOA: Massive online analysis. J. Mach. Learn. Res. 11 (2010), 1601--1604.

Digital Library

[8]

Abdelhamid Bouchachia. 2011. Evolving clustering: An asset for evolving systems. In IEEE SMC Newsletter, Vol. 36. 1--6.

[9]

T. Calinski and J. Harabasz. 1974. A dendrite method for cluster analysis. Commun. Stat. 3 (1974), 1--27.

[10]

Thiago F. Covões and Eduardo R. Hruschka. 2011. Towards improving cluster-based feature selection with a simplified silhouette filter. Inform. Sci. 181, 18 (2011), 3766--3782.

[11]

Fernando Crespo and Richard Weber. 2005. A methodology for dynamic data mining based on fuzzy clustering. Fuzzy Sets. Syst. 150 (2005), 267--284.

[12]

David L. Davies and Donald W. Bouldin. 1979. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1 (1979), 224 --227.

Digital Library

[13]

Jonathan de Andrade Silva and Eduardo Raul Hruschka. 2011. Extending k-means-based algorithms for evolving data streams with variable number of clusters. In Proc. of the 4th International Conference on Machine Learning and Applications (ICMLA’11), Vol. 2. 14--19.

Digital Library

[14]

Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7 (2006), 1--30.

Digital Library

[15]

J. C. Dunn. 1974. Well separated clusters and optimal fuzzy-partitions. J. Cybernet. 4 (1974), 95--104.

[16]

Brian S. Everitt, Sabine Landau, and Morven Leese. 2001. Cluster Analysis. Arnold Publishers.

Digital Library

[17]

Dominik Fisch, Dominik Fisch, Martin Jänicke, Edgar Kalkowski, and Bernhard Sick. 2012. Techniques for knowledge acquisition in dynamically changing environments. ACM Trans. Autonom. Adapt. Syst. 7 (2012), 16:1--16:25.

Digital Library

[18]

Joao Gama. 2010. Knowledge Discovery from Data Streams. Chapman Hall/CRC, London.

Digital Library

[19]

Guha, Meyerson, Mishra, Motwani, and O’Callaghan. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowled. Data Eng. 15 (2003).

Digital Library

[20]

Jiawei Han and Micheline Kamber. 2000. Data Mining: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, San Francisco, CA.

Digital Library

[21]

Myles Hollander and Douglas A. Wolfe. 1999. Nonparametric Statistical Methods (2nd ed.). Wiley, New York, NY.

[22]

E. R. Hruschka, L. N. de Castro, and R. J. G. B Campello. 2004. Evolutionary algorithms for clustering gene-expression data. In Proc. of the 4th IEEE International Conference on Data Mining (ICDM’04). 403--406.

Digital Library

[23]

Eduardo R. Hruschka, Ricardo J. G. B. Campello, and Leandro Nunes de Castro. 2006. Evolving clusters in gene-expression data. Inform. Sci. 176 (2006), 1898--1927.

Digital Library

[24]

L. Hubert and P. Arabie. 1985. Comparing partitions. J. Class. 2 (1985), 193--218.

[25]

Anil K. Jain. 2009. Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31 (2009), 651--666.

Digital Library

[26]

Anil K. Jain and Richard C. Dubes. 1988. Algorithms for Clustering Data. Prentice-Hall, Inc., Piscataway, NJ.

Digital Library

[27]

L. Kaufman and P. J. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York, NY.

[28]

Edwin Lughofer. 2011. Evolving Fuzzy Systems - Methodologies, Advanced Concepts and Applications. Studies in Fuzziness and Soft Computing, Vol. 266. Springer, Berlin.

Digital Library

[29]

Edwin Lughofer. 2012. A dynamic split-and-merge approach for evolving cluster models. Evolv. Syst. 3 (2012), 135--151.

[30]

Moamar S. Mouchaweh. 2010. Learning in dynamic environments: Application to the identification of hybrid dynamic systems. In Proc. of the 2010 9th International Conference on Machine Learning and Applications (ICMLA). 555--560.

Digital Library

[31]

Murilo C. Naldi, Ricardo J. G. B. Campello, Eduardo R. Hruschka, and André C. P. L. F. Carvalho. 2011. Efficiency issues of evolutionary k-means. Appl. Soft Comput. 11 (2011), 1938--1952.

Digital Library

[32]

Murilo C. Naldi, André Fontana, and Ricardo J. G. B. Campello. 2009. Comparison among methods for k estimation in k-means. In Proc. of the ISDA’09. 1006--1013.

Digital Library

[33]

Liadan O’Callaghan, Adam Meyerson, Rajeev Motwani, Nina Mishra, and Sudipto Guha. 2002. Streaming-data algorithms for high-quality clustering. In Proc. of the ICDE. 685--695.

Digital Library

[34]

N. R. Pal and J. C. Bezdek. 1995. On cluster validity for the fuzzy c-means model. IEEE Trans. Fuzzy Syst. 3 (1995), 370--379.

Digital Library

[35]

K. Pearson. 1901. On lines and planes of closest fit to systems of points in space. Philos. Mag. 2, 6 (1901), 559--572.

[36]

Witold Pedrycz and Richard Weber. 2008. Editorial: Special issue on soft computing for dynamic data mining. Appl. Soft Comput. 8 (2008), 1281--1282.

Digital Library

[37]

Andres Quiroz, Manish Parashar, Nathan Gnanasambandam, and Naveen Sharma. 2012. Design and evaluation of decentralized online clustering. ACM Trans. Autonom. Adapt. Syst. 7 (2012), 34:1--34:31.

Digital Library

[38]

Moamar Sayed Mouchaweh and Edwin Lughofer. 2012. Learning in Non-Stationary Environments: Methods and Applications. Springer, Berlin.

Digital Library

[39]

Jonathan A. Silva, Elaine R. Faria, Rodrigo C. Barros, Eduardo R. Hruschka, André C. P. L. F. de Carvalho, and João Gama. 2013. Data stream clustering: A survey. ACM Comput. Surv. 46, 1 (2013), 13:1--13:31.

Digital Library

[40]

Michael Steinbach, George Karypis, and Vipin Kumar. 2000. A comparison of document clustering techniques. In Proc. KDD Workshop Text Mining. 109--111.

[41]

Lucas Vendramin, Ricardo J. G. B. Campello, and Eduardo R. Hruschka. 2010. Relative clustering validity criteria: A comparative overview. Stat. Anal. Data Min. 3 (2010), 209--235.

Digital Library

[42]

Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. 2007. Top 10 algorithms in data mining. Knowled. Inform. Syst. 14 (2007), 1--37.

Digital Library

[43]

Zhenwei Yu, Jeffrey J. P. Tsai, and Thomas Weigert. 2008. An adaptive automatically tuning intrusion detection system. ACM Trans. Autonom. Adapt. Syst. 3 (2008), 10:1--10:25.

Digital Library

[44]

Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proc. of the SIGMOD’96. 103--114.

Digital Library

Cited By

Qi CShi YLi JLi H(2023)The causality analysis of incipient fault in industrial processes using dynamic data stream transfer entropyJournal of Process Control10.1016/j.jprocont.2023.103022128(103022)Online publication date: Aug-2023
https://doi.org/10.1016/j.jprocont.2023.103022
Park N(2022)Clustering Data Steams with Sliding Window PanesThe Journal of Korean Institute of Information Technology10.14801/jkiit.2022.20.1.4920:1(49-55)Online publication date: 31-Jan-2022
https://doi.org/10.14801/jkiit.2022.20.1.49
Brito Da Silva LMelton NWunsch D(2020)Incremental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative StudyIEEE Access10.1109/ACCESS.2020.29698498(22025-22047)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2969849
Show More Cited By

Index Terms

A Support System for Clustering Data Streams with a Variable Number of Clusters
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Recommendations

Extending k-Means-Based Algorithms for Evolving Data Streams with Variable Number of Clusters
ICMLA '11: Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops - Volume 02

Many algorithms for clustering data streams based on the widely used k-Means have been proposed in the literature. Most of them assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is ...
An evolutionary algorithm for clustering data streams with a variable number of clusters

An evolutionary algorithm for clustering data stream is proposed.Our algorithm allows estimating k automatically from the data in an online fashion.It monitors eventual degradation in the quality of the induced clusters.Results show our algorithm ...
Clustering categorical data streams

In this paper, we propose an efficient clustering algorithm for analyzing categorical data streams. It has been proved that the proposed algorithm uses small memory footprints. We provide empirical analysis on the performance of the algorithm in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Autonomous and Adaptive Systems

ACM Transactions on Autonomous and Adaptive Systems Volume 11, Issue 2

Special Section on Best Papers from SASO 2014 and Regular Articles

July 2016

267 pages

ISSN:1556-4665

EISSN:1556-4703

DOI:10.1145/2952298

Editors:
Manish Parashar
Rutgers University, USA
,
Franco Zambonelli
University of Modena e Reggio Emilia, Italy

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2016

Accepted: 01 June 2014

Revised: 01 February 2014

Received: 01 September 2013

Published in TAAS Volume 11, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Fundação de Amparo à Pesquisa do Estado de São Paulo

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
269
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Qi CShi YLi JLi H(2023)The causality analysis of incipient fault in industrial processes using dynamic data stream transfer entropyJournal of Process Control10.1016/j.jprocont.2023.103022128(103022)Online publication date: Aug-2023
https://doi.org/10.1016/j.jprocont.2023.103022
Park N(2022)Clustering Data Steams with Sliding Window PanesThe Journal of Korean Institute of Information Technology10.14801/jkiit.2022.20.1.4920:1(49-55)Online publication date: 31-Jan-2022
https://doi.org/10.14801/jkiit.2022.20.1.49
Brito Da Silva LMelton NWunsch D(2020)Incremental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative StudyIEEE Access10.1109/ACCESS.2020.29698498(22025-22047)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2969849
Casalino GCastellano GMencar C(2019)Data Stream Classification by Dynamic Incremental Semi-Supervised Fuzzy ClusteringInternational Journal on Artificial Intelligence Tools10.1142/S021821301960009128:08(1960009)Online publication date: 2-Dec-2019
https://doi.org/10.1142/S0218213019600091
Hu ZBodyanskiy YTyshchenko OBoiko O(2018)A neuro-fuzzy Kohonen network for data stream possibilistic clustering and its online self-learning procedureApplied Soft Computing10.1016/j.asoc.2017.09.04268(710-718)Online publication date: Jul-2018
https://doi.org/10.1016/j.asoc.2017.09.042
Candido PNaldi MSilva JFaria E(2017)Scalable Data Stream Clustering with k Estimation2017 Brazilian Conference on Intelligent Systems (BRACIS)10.1109/BRACIS.2017.53(336-341)Online publication date: Oct-2017
https://doi.org/10.1109/BRACIS.2017.53

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents