An Efficient Set-Based Algorithm for Variable Streaming Clustering

Campos, Isaac; León, Jared; Campos, Fernando

doi:10.1007/978-3-030-46140-9_9

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1070))

Included in the following conference series:

Annual International Symposium on Information Management and Big Data

654 Accesses

Abstract

In this paper, a new algorithm for Data Streaming clustering is proposed, namely the SetClust algorithm. The Data Streaming clustering model focuses on making clustering of the data while it arrives, being useful in many practical applications. The proposed algorithm, unlike other streaming clustering algorithms, is designed to handle cases when there is no available a priori information about the number of clusters to be formed, having as a second objective to discover the best number of clusters needed to represent the points. The SetClust algorithm is based on structures for disjoint-set operations, making the concept of a cluster to be the union of multiple well-formed sets to allow the algorithm to recognize non-spherical patterns even in high dimensional points. This yields to quadratic running time on the number of formed sets. The algorithm itself can be interpreted as an efficient data structure for streaming clustering. Results of the experiments show that the proposed algorithm is highly suitable for clustering quality on well-spread data points.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

CETra: online cluster tracking for clustering of streaming data sources

Article 02 November 2024

State-of-the-art on clustering data streams

Article Open access 01 December 2016

Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms

Article 21 January 2019

Notes

1.
These operations should be done online without traversing through all the elements.
2.
We need to calculate the new mean, standard deviation and the rest of the information in constant time.
3.
If we take advantage of the fact that only the last formed set can make instability, we can achieve an overall worst-case running time complexity of O(rd).
4.
As in the previous case, if we take advantage of the fact that only the last formed set can make instability, we can perform this operation in worst-case running time complexity of $O(\alpha (r)rd)$, where $\alpha $ is the inverse Ackerman function. For any practical situation, the function is never greater than 4.

References

Aggarwal, C.C., Reddy, C.: Data Clustering: Algorithms and Applications, 1st edn. Chapman & Hall/CRC (2013)
Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on Very Large Data Bases, VLDB 2003, vol. 29, pp. 81–92. VLDB Endowment (2003)
Google Scholar
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)
Google Scholar
Campos, I., Leon, J.: Setclust. Zenodo, July 2019. https://doi.org/10.5281/zenodo.3270842
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Conference on Data Mining (SIAM 2006), pp. 328–339 (2006)
Google Scholar
Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)
MATH Google Scholar
Fränti, P., Sieranoja, S.: K-means properties on six clustering benchmark datasets. Appl. Intell. 48(12), 4743–4759 (2018). https://doi.org/10.1007/s10489-018-1238-7
Article MATH Google Scholar
Fränti, P., Virmajoki, O.: Iterative shrinking method for clustering problems. Pattern Recognit. 39(5), 761–775 (2006)
Article Google Scholar
Galler, B.A., Fisher, M.J.: An improved equivalence algorithm. Commun. ACM 7(5), 301–303 (1964)
Article Google Scholar
Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall Inc., Upper Saddle River (1988)
MATH Google Scholar
Karkkainen, I., Franti, P.: Dynamic local search for clustering with unknown number of clusters. In: Object Recognition Supported by User Interaction for Service, vol. 2, pp. 240–243 (2002)
Google Scholar
Kranen, P., Assent, I., Baldauf, C., Seidl, T.: Self-adaptive anytime stream clustering. In: 2009 Ninth IEEE International Conference on Data Mining, pp. 249–258 (2009)
Google Scholar
Muthukrishnan, S.: Data streams: algorithms and applications. Found. Trends Theor. Comput. Sci. 1(2), 117–236 (2005)
Article MathSciNet Google Scholar
Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 410–420. EMNLP-CoNLL (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Universidad Nacional de San Antonio Abad del Cusco, Cusco, Peru
Isaac Campos, Jared León & Fernando Campos

Authors

Isaac Campos
View author publications
You can also search for this author in PubMed Google Scholar
Jared León
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Campos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Isaac Campos .

Editor information

Editors and Affiliations

Stanford University, Stanford, CA, USA
Juan Antonio Lossio-Ventura
University of A Coruña, A Coruña, Spain
Nelly Condori-Fernandez
Visibilia, São Paulo, Brazil
Jorge Carlos Valverde-Rebaza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Campos, I., León, J., Campos, F. (2020). An Efficient Set-Based Algorithm for Variable Streaming Clustering. In: Lossio-Ventura, J.A., Condori-Fernandez, N., Valverde-Rebaza, J.C. (eds) Information Management and Big Data. SIMBig 2019. Communications in Computer and Information Science, vol 1070. Springer, Cham. https://doi.org/10.1007/978-3-030-46140-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-46140-9_9
Published: 23 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46139-3
Online ISBN: 978-3-030-46140-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Efficient Set-Based Algorithm for Variable Streaming Clustering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CETra: online cluster tracking for clustering of streaming data sources

State-of-the-art on clustering data streams

Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

An Efficient Set-Based Algorithm for Variable Streaming Clustering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

CETra: online cluster tracking for clustering of streaming data sources

State-of-the-art on clustering data streams

Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation