Histogram-based clustering of multiple data streams

Balzanella, Antonio; Verde, Rosanna

doi:10.1007/s10115-019-01350-5

Histogram-based clustering of multiple data streams

Regular Paper
Published: 19 March 2019

Volume 62, pages 203–238, (2020)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

907 Accesses
3 Citations
Explore all metrics

Abstract

This paper introduces a strategy for clustering online multiple data streams. We assume that several sources are used for recording, over time, data about some physical phenomena. Each source provides repeated measurements at a very high frequency so that it is not possible to store the whole amount of data into some easy-to-access media, but data are available only in batches. Our aim is to discover a partition of the sources (e.g. sensors) into homogeneous clusters, analysing the incoming streams of data. The proposed strategy is based on processing the incoming data batches independently, through an initial summarization of the data batches by histograms and, then, by means of a local clustering performed on the histograms which provides a further data summarization. To keep track of the data proximities among the data streams over time, we use local clustering outputs for updating a proximity matrix. The final partitioning of the streams is obtained by a clustering based on such proximity matrix. Through an application on real and simulated data, we show the effectiveness of our strategy in finding homogeneous groups of sources of data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Stream Summarization by Histograms Clustering

Clustering Data Streams by On-Line Proximity Updating

An Improved Visual Assessment with Data-Dependent Kernel for Stream Clustering

Notes

http://archive.ics.uci.edu/ml/.

References

Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) Streamkm++: a clustering algorithm for data streams. J Exp Algorithmics 17:2.4:2.1–2.4:2.30. https://doi.org/10.1145/2133803.2184450
Article MathSciNet MATH Google Scholar
Agrawal R, Faloutsos C, Swami A (1993) Efficient similarity search in sequence databases. In: Proceedings of the 4th conference on foundations of data organization and algorithms
Alseghayer R, Petrov D, Chrysanthis PK, Sharaf MA, Labrinidis A (2017) Detection of highly correlated live data streams. BIRTE 3(1–3):8
Google Scholar
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, vol 29. VLDB Endowment, VLDB’03, pp 81–92
Agueh M, Carlier G (2011) Barycenters in the Wasserstein space. Soc Ind Appl Math 43:904–924
MathSciNet MATH Google Scholar
Arroyo J, Maté C (2009) Forecasting histogram time series with k-nearest neighbours methods. Int J Forecast 25(1):192–207. https://doi.org/10.1016/j.ijforecast.2008.07.003
Article Google Scholar
Berckmoes B, Lowen R, Van Casteren J (2011) Distances on probability measures and random variables. J Math Anal Appl 374(2):412–428
Article MathSciNet Google Scholar
Beringer J, Hüllermeier E (2006) Online clustering of parallel data streams. Data Knowl Eng 58(2):180–204. https://doi.org/10.1016/j.datak.2005.05.009
Article Google Scholar
Billard L, Diday E (2003) From the statistics of data to the statistic of knowledge: symbolic data analysis. JASA J Am Stat Assoc 98(462):470–487
Article MathSciNet Google Scholar
Caló DG, Montanari A, Viroli C (2014) A hierarchical modeling approach for clustering probability density functions. Comput Stat Data Anal 71:79–91
Article MathSciNet Google Scholar
Cao F, Estert M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise, pp 328–339. https://doi.org/10.1137/1.9781611972764.29
Chan K, Fu AW-C (1999) Efficient time series matching by wavelets. In: Proceedings 15th international conference on data engineering Sydney, NSW, Australia, pp 126–133. https://doi.org/10.1109/ICDE.1999.754915
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD’07, pp 133–142. https://doi.org/10.1145/1281192.1281210
Cormode G, Garofalakis M, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found Trends Databases 4(1–3):1–294. https://doi.org/10.1561/1900000004
Article MATH Google Scholar
Dai BR, Huang JW, Yeh MY, Chen MS (2006) Adaptive clustering for multiple evolving streams. IEEE Trans Knowl Data Eng 18(9):1166–1180. https://doi.org/10.1109/TKDE.2006.137
Article Google Scholar
Diday E, Noirhomme-Fraiture M (eds) (2008) Symbolic data analysis and the SODAS software. Wiley, Hoboken
MATH Google Scholar
Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh EJ (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1(2):1542–1552
Google Scholar
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining, AAAI Press, KDD’96, pp 226–231
Fränti P (2018) Efficiency of random swap clustering. J Big Data 5(13):1–29
MathSciNet Google Scholar
Fränti P, Rezaei M, Zhao Q (2014) Centroid index: cluster level similarity measure. Pattern Recognit 47(9):3034–3045
Article Google Scholar
Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48(12):4743–4759
Article Google Scholar
Ganguly AR, Gama J, Omitaomu OA, Gaber M, Vatsavai RR (eds) (2008) Knowledge discovery from sensor data. CRC Press, Boca Raton
Google Scholar
Garofalakis M, Gehrke J, Rastogi R (eds) (2016) Data stream management. Data-centric systems and applications. Springer, Berlin
Google Scholar
Ghesmoune M, Azzag H, Lebbah M (2014) G-stream: growing neural gas over data stream. Springer, Cham, pp 207–214
Google Scholar
Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Int Stat Rev 7(3):419–435
Article Google Scholar
Gong S, Zhang Y, Yu G (2017) Clustering stream data by exploring the evolution of density mountain. Proc VLDB Endow 11(4):393–405. https://doi.org/10.1145/3186728.3164136
Article Google Scholar
González-Rivera G, Arroyo J (2012) Time series modeling of histogram-valued data: the daily histogram time series of S&P500 intradaily returns. Int J Forecast 28(1):20–33. https://doi.org/10.1016/j.ijforecast.2011.02.007
Article Google Scholar
Henderson K, Gallagher B, Eliassi-Rad T (2015) EP-MEANS: an efficient nonparametric clustering of empirical probability distributions. In: SAC ’15 proceedings of the 30th annual ACM symposium on applied computing, pp 893–900
Huang D, Zheng WS, Lai JH, Wang CD (2013) Svstream: a support vector-based algorithm for clustering data streams. IEEE Trans Knowl Data Eng 25:1410–1424. https://doi.org/10.1109/TKDE.2011.263
Article Google Scholar
Irpino A, Iacono M (2011) Improving the MHIST-p algorithm for multivariate histograms of continuous data. In: Classification and multivariate analysis for complex data structures. Springer, pp 155–164. ISBN: 978-3-642-13311-4
Irpino A, Verde R, De Carvalho FAT (2014) Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. Expert Syst Appl 41(7):3351–3366
Article Google Scholar
Irpino A, Verde R (2015) Basic statistics for distributional symbolic variables: a new metric-based approach. Adv Data Anal Classif 9(2):143–175. https://doi.org/10.1007/s11634-014-0176-4
Article MathSciNet MATH Google Scholar
Jiang B, Pei J, Tao Y, Lin X (2013) Clustering uncertain data based on probability distribution similarity. IEEE Trans Knowl Data Eng 25:751–763
Article Google Scholar
Kärkkäinen I, Fränti P (2007) Gradual model generator for single-pass clustering. Pattern Recognit 40(3):784–795. https://doi.org/10.1016/j.patcog.2006.06.023
Article MATH Google Scholar
Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001) Dimensionality reduction for fast similarity search in large time series databases. Knowl Inf Syst 3:263. https://doi.org/10.1007/PL00011669
Article MATH Google Scholar
Laurinec P, Luck M (2018) Interpretable multiple data streams clustering with clipped streams representation for the improvement of electricity consumption forecasting. Data Min Knowl Disc. https://doi.org/10.1007/s10618-018-0598-2
Article Google Scholar
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability. 1. University of California Press, pp 281–297
Mallows CL (1972) A note on asymptotic joint normality. Ann Math Stat 43(2):508–515
Article MathSciNet Google Scholar
Panaretos VM, Zemel Y (2019) Statistical aspects of Wasserstein distances. Ann Rev Stat Appl 6:1
Article MathSciNet Google Scholar
Rezaei M, Fränti P (2016) Set matching measures for external cluster validity. IEEE Trans Knowl Data Eng 28(8):2173–2186. https://doi.org/10.1109/TKDE.2016.2551240
Article Google Scholar
Rodrigues PP, Gama J, Pedroso J (2008) Hierarchical clustering of time-series data streams. IEEE Trans Knowl Data Eng 20(5):615–627. https://doi.org/10.1109/TKDE.2007.190727
Article Google Scholar
Sato M, Ishii S (2000) On-line EM algorithm for the normalized Gaussian network. Neural Comput 12(2):407–432
Article Google Scholar
Sakurai Y, Papadimitriou S, Faloutsos C (2005) BRAID: stream mining through group lag correlations. ACM SIGMOD’05, pp 599–610
Sebastião R, Gama J (2007) Change detection in learning histograms from data streams. Prog Artif Intell 4874:112–123
Article Google Scholar
Shafer I, Ren K, Boddeti VN, Abe Y, Ganger GR, Faloutsos C (2012) RainMon: an integrated approach to mining bursty timeseries monitoring data. ACM KDD’12, pp 1158–1166
Silva JA, Faria ER, Barros RC, Hruschka ER, Carvalho ACPLFd, Ja Gama (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):13:1–13:31. https://doi.org/10.1145/2522968.2522981
Article MATH Google Scholar
Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: evolution-based technique for stream clustering. In: Proceedings of the 3rd international conference on advanced data mining and applications, Springer, Berlin, Heidelberg, ADMA’07, pp 605–615
Chapter Google Scholar
Verde R, Irpino A (2007) Dynamic clustering of histogram data: using the right metric. Springer, Berlin, pp 123–134
MATH Google Scholar
Villani C (2008) Optimal transport: old and new. Springer, Berlin
MATH Google Scholar
Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. SIGMOD Rec 25(2):103–114
Article Google Scholar
Zhu Y, Shasha D (2002) StatStream: statistical monitoring of thousands of data streams in real time. In: Proceedings of the 28th international conference on very large data bases. VLDB Endowment, p 358
Zhao J, Ishikawa Y, Xiao C, Sugiura K (2018) Histogram construction for difference analysis of spatio-temporal data on array DBMS. Databases Theory Appl. https://doi.org/10.1007/978-3-319-92013-94
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Physics, Università della Campania L. Vanvitelli, viale Lincoln, 5, Caserta, Italy
Antonio Balzanella & Rosanna Verde

Authors

Antonio Balzanella
View author publications
You can also search for this author in PubMed Google Scholar
Rosanna Verde
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antonio Balzanella.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Balzanella, A., Verde, R. Histogram-based clustering of multiple data streams. Knowl Inf Syst 62, 203–238 (2020). https://doi.org/10.1007/s10115-019-01350-5

Download citation

Received: 09 May 2018
Revised: 04 March 2019
Accepted: 08 March 2019
Published: 19 March 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s10115-019-01350-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Histogram-based clustering of multiple data streams

Abstract

Access this article

Similar content being viewed by others

Data Stream Summarization by Histograms Clustering

Clustering Data Streams by On-Line Proximity Updating

An Improved Visual Assessment with Data-Dependent Kernel for Stream Clustering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Histogram-based clustering of multiple data streams

Abstract

Access this article

Similar content being viewed by others

Data Stream Summarization by Histograms Clustering

Clustering Data Streams by On-Line Proximity Updating

An Improved Visual Assessment with Data-Dependent Kernel for Stream Clustering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation