skip to main content
research-article

A Support System for Clustering Data Streams with a Variable Number of Clusters

Published: 25 July 2016 Publication History

Abstract

Many algorithms for clustering data streams that are based on the widely used k-Means have been proposed in the literature. Most of these algorithms assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we propose a support system that allows not only estimating the number of clusters automatically from data but also monitoring the process of the data-stream clustering. We illustrate the potential of the proposed system by means of a prototype that implements eight algorithms for clustering data streams, namely, Stream LSearch-OMRk, Stream LSearch-BkM, Stream LSearch-IOMRk, Stream LSearch-IBkM, CluStream-OMRk, CluStream-BkM, StreamKM++-OMRk, and StreamKM++−BkM. These algorithms are combinations of three state-of-the-art algorithms for clustering data streams with fixed k, namely, Stream LSearch, CluStream, and StreamKM++, with two algorithms for estimating the number of clusters, which are Ordered Multiple Runs of k-Means (OMRk) and Bisecting k-Means (BkM). We experimentally compare the performance of these algorithms using both synthetic and real-world data streams. Analyses of statistical significance suggest that the algorithms that are based on OMRk yield the best data partitions, while the algorithms that are based on BkM are more computationally efficient. Additionally, StreamKM++−OMRk and Stream LSearch-IBkM provide the best tradeoff relationship between accuracy and efficiency.

References

[1]
Marcel R. Ackermann, Christiane Lammersen, Marcus Märtens, Christoph Raupach, Christian Sohler, and Kamil Swierkot. 2010. StreamKM++: A clustering algorithms for data streams. In Proc. of the ALENEX. 173--187.
[2]
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2003. A framework for clustering evolving data streams. In Proc. of the VLDB. 81--92.
[3]
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2004. A framework for projected clustering of high dimensional data streams. In Proc. of the 30th International Conference on Very Large Data Bases (VLDB’04). VLDB Endowment, 852--863.
[4]
Michael R. Anderberg. 1973. Cluster Analysis for Applications. Academic Press.
[5]
David Arthur and Sergei Vassilvitskii. 2007. k-means++: The advantages of careful seeding. In Proc. of the SODA’07. 1027--1035.
[6]
Jürgen Beringer and Eyke Hüllermeier. 2006. Online clustering of parallel data streams. Data Knowled. Eng. 58 (2006), 180--204.
[7]
Albert Bifet, Geoff Holmes, Richard Kirkby, and Bernhard Pfahringer. 2010. MOA: Massive online analysis. J. Mach. Learn. Res. 11 (2010), 1601--1604.
[8]
Abdelhamid Bouchachia. 2011. Evolving clustering: An asset for evolving systems. In IEEE SMC Newsletter, Vol. 36. 1--6.
[9]
T. Calinski and J. Harabasz. 1974. A dendrite method for cluster analysis. Commun. Stat. 3 (1974), 1--27.
[10]
Thiago F. Covões and Eduardo R. Hruschka. 2011. Towards improving cluster-based feature selection with a simplified silhouette filter. Inform. Sci. 181, 18 (2011), 3766--3782.
[11]
Fernando Crespo and Richard Weber. 2005. A methodology for dynamic data mining based on fuzzy clustering. Fuzzy Sets. Syst. 150 (2005), 267--284.
[12]
David L. Davies and Donald W. Bouldin. 1979. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1 (1979), 224 --227.
[13]
Jonathan de Andrade Silva and Eduardo Raul Hruschka. 2011. Extending k-means-based algorithms for evolving data streams with variable number of clusters. In Proc. of the 4th International Conference on Machine Learning and Applications (ICMLA’11), Vol. 2. 14--19.
[14]
Janez Demšar. 2006. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7 (2006), 1--30.
[15]
J. C. Dunn. 1974. Well separated clusters and optimal fuzzy-partitions. J. Cybernet. 4 (1974), 95--104.
[16]
Brian S. Everitt, Sabine Landau, and Morven Leese. 2001. Cluster Analysis. Arnold Publishers.
[17]
Dominik Fisch, Dominik Fisch, Martin Jänicke, Edgar Kalkowski, and Bernhard Sick. 2012. Techniques for knowledge acquisition in dynamically changing environments. ACM Trans. Autonom. Adapt. Syst. 7 (2012), 16:1--16:25.
[18]
Joao Gama. 2010. Knowledge Discovery from Data Streams. Chapman Hall/CRC, London.
[19]
Guha, Meyerson, Mishra, Motwani, and O’Callaghan. 2003. Clustering data streams: Theory and practice. IEEE Trans. Knowled. Data Eng. 15 (2003).
[20]
Jiawei Han and Micheline Kamber. 2000. Data Mining: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, San Francisco, CA.
[21]
Myles Hollander and Douglas A. Wolfe. 1999. Nonparametric Statistical Methods (2nd ed.). Wiley, New York, NY.
[22]
E. R. Hruschka, L. N. de Castro, and R. J. G. B Campello. 2004. Evolutionary algorithms for clustering gene-expression data. In Proc. of the 4th IEEE International Conference on Data Mining (ICDM’04). 403--406.
[23]
Eduardo R. Hruschka, Ricardo J. G. B. Campello, and Leandro Nunes de Castro. 2006. Evolving clusters in gene-expression data. Inform. Sci. 176 (2006), 1898--1927.
[24]
L. Hubert and P. Arabie. 1985. Comparing partitions. J. Class. 2 (1985), 193--218.
[25]
Anil K. Jain. 2009. Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31 (2009), 651--666.
[26]
Anil K. Jain and Richard C. Dubes. 1988. Algorithms for Clustering Data. Prentice-Hall, Inc., Piscataway, NJ.
[27]
L. Kaufman and P. J. Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York, NY.
[28]
Edwin Lughofer. 2011. Evolving Fuzzy Systems - Methodologies, Advanced Concepts and Applications. Studies in Fuzziness and Soft Computing, Vol. 266. Springer, Berlin.
[29]
Edwin Lughofer. 2012. A dynamic split-and-merge approach for evolving cluster models. Evolv. Syst. 3 (2012), 135--151.
[30]
Moamar S. Mouchaweh. 2010. Learning in dynamic environments: Application to the identification of hybrid dynamic systems. In Proc. of the 2010 9th International Conference on Machine Learning and Applications (ICMLA). 555--560.
[31]
Murilo C. Naldi, Ricardo J. G. B. Campello, Eduardo R. Hruschka, and André C. P. L. F. Carvalho. 2011. Efficiency issues of evolutionary k-means. Appl. Soft Comput. 11 (2011), 1938--1952.
[32]
Murilo C. Naldi, André Fontana, and Ricardo J. G. B. Campello. 2009. Comparison among methods for k estimation in k-means. In Proc. of the ISDA’09. 1006--1013.
[33]
Liadan O’Callaghan, Adam Meyerson, Rajeev Motwani, Nina Mishra, and Sudipto Guha. 2002. Streaming-data algorithms for high-quality clustering. In Proc. of the ICDE. 685--695.
[34]
N. R. Pal and J. C. Bezdek. 1995. On cluster validity for the fuzzy c-means model. IEEE Trans. Fuzzy Syst. 3 (1995), 370--379.
[35]
K. Pearson. 1901. On lines and planes of closest fit to systems of points in space. Philos. Mag. 2, 6 (1901), 559--572.
[36]
Witold Pedrycz and Richard Weber. 2008. Editorial: Special issue on soft computing for dynamic data mining. Appl. Soft Comput. 8 (2008), 1281--1282.
[37]
Andres Quiroz, Manish Parashar, Nathan Gnanasambandam, and Naveen Sharma. 2012. Design and evaluation of decentralized online clustering. ACM Trans. Autonom. Adapt. Syst. 7 (2012), 34:1--34:31.
[38]
Moamar Sayed Mouchaweh and Edwin Lughofer. 2012. Learning in Non-Stationary Environments: Methods and Applications. Springer, Berlin.
[39]
Jonathan A. Silva, Elaine R. Faria, Rodrigo C. Barros, Eduardo R. Hruschka, André C. P. L. F. de Carvalho, and João Gama. 2013. Data stream clustering: A survey. ACM Comput. Surv. 46, 1 (2013), 13:1--13:31.
[40]
Michael Steinbach, George Karypis, and Vipin Kumar. 2000. A comparison of document clustering techniques. In Proc. KDD Workshop Text Mining. 109--111.
[41]
Lucas Vendramin, Ricardo J. G. B. Campello, and Eduardo R. Hruschka. 2010. Relative clustering validity criteria: A comparative overview. Stat. Anal. Data Min. 3 (2010), 209--235.
[42]
Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. 2007. Top 10 algorithms in data mining. Knowled. Inform. Syst. 14 (2007), 1--37.
[43]
Zhenwei Yu, Jeffrey J. P. Tsai, and Thomas Weigert. 2008. An adaptive automatically tuning intrusion detection system. ACM Trans. Autonom. Adapt. Syst. 3 (2008), 10:1--10:25.
[44]
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proc. of the SIGMOD’96. 103--114.

Cited By

View all
  • (2023)The causality analysis of incipient fault in industrial processes using dynamic data stream transfer entropyJournal of Process Control10.1016/j.jprocont.2023.103022128(103022)Online publication date: Aug-2023
  • (2022)Clustering Data Steams with Sliding Window PanesThe Journal of Korean Institute of Information Technology10.14801/jkiit.2022.20.1.4920:1(49-55)Online publication date: 31-Jan-2022
  • (2020)Incremental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative StudyIEEE Access10.1109/ACCESS.2020.29698498(22025-22047)Online publication date: 2020
  • Show More Cited By

Index Terms

  1. A Support System for Clustering Data Streams with a Variable Number of Clusters

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Autonomous and Adaptive Systems
    ACM Transactions on Autonomous and Adaptive Systems  Volume 11, Issue 2
    Special Section on Best Papers from SASO 2014 and Regular Articles
    July 2016
    267 pages
    ISSN:1556-4665
    EISSN:1556-4703
    DOI:10.1145/2952298
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 July 2016
    Accepted: 01 June 2014
    Revised: 01 February 2014
    Received: 01 September 2013
    Published in TAAS Volume 11, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Clustering
    2. data stream
    3. online clustering

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)The causality analysis of incipient fault in industrial processes using dynamic data stream transfer entropyJournal of Process Control10.1016/j.jprocont.2023.103022128(103022)Online publication date: Aug-2023
    • (2022)Clustering Data Steams with Sliding Window PanesThe Journal of Korean Institute of Information Technology10.14801/jkiit.2022.20.1.4920:1(49-55)Online publication date: 31-Jan-2022
    • (2020)Incremental Cluster Validity Indices for Online Learning of Hard Partitions: Extensions and Comparative StudyIEEE Access10.1109/ACCESS.2020.29698498(22025-22047)Online publication date: 2020
    • (2019)Data Stream Classification by Dynamic Incremental Semi-Supervised Fuzzy ClusteringInternational Journal on Artificial Intelligence Tools10.1142/S021821301960009128:08(1960009)Online publication date: 2-Dec-2019
    • (2018)A neuro-fuzzy Kohonen network for data stream possibilistic clustering and its online self-learning procedureApplied Soft Computing10.1016/j.asoc.2017.09.04268(710-718)Online publication date: Jul-2018
    • (2017)Scalable Data Stream Clustering with k Estimation2017 Brazilian Conference on Intelligent Systems (BRACIS)10.1109/BRACIS.2017.53(336-341)Online publication date: Oct-2017

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media