Skip to main content
Log in

Histogram-based clustering of multiple data streams

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This paper introduces a strategy for clustering online multiple data streams. We assume that several sources are used for recording, over time, data about some physical phenomena. Each source provides repeated measurements at a very high frequency so that it is not possible to store the whole amount of data into some easy-to-access media, but data are available only in batches. Our aim is to discover a partition of the sources (e.g. sensors) into homogeneous clusters, analysing the incoming streams of data. The proposed strategy is based on processing the incoming data batches independently, through an initial summarization of the data batches by histograms and, then, by means of a local clustering performed on the histograms which provides a further data summarization. To keep track of the data proximities among the data streams over time, we use local clustering outputs for updating a proximity matrix. The final partitioning of the streams is obtained by a clustering based on such proximity matrix. Through an application on real and simulated data, we show the effectiveness of our strategy in finding homogeneous groups of sources of data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. http://archive.ics.uci.edu/ml/.

References

  1. Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) Streamkm++: a clustering algorithm for data streams. J Exp Algorithmics 17:2.4:2.1–2.4:2.30. https://doi.org/10.1145/2133803.2184450

    Article  MathSciNet  MATH  Google Scholar 

  2. Agrawal R, Faloutsos C, Swami A (1993) Efficient similarity search in sequence databases. In: Proceedings of the 4th conference on foundations of data organization and algorithms

  3. Alseghayer R, Petrov D, Chrysanthis PK, Sharaf MA, Labrinidis A (2017) Detection of highly correlated live data streams. BIRTE 3(1–3):8

    Google Scholar 

  4. Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, vol 29. VLDB Endowment, VLDB’03, pp 81–92

  5. Agueh M, Carlier G (2011) Barycenters in the Wasserstein space. Soc Ind Appl Math 43:904–924

    MathSciNet  MATH  Google Scholar 

  6. Arroyo J, Maté C (2009) Forecasting histogram time series with k-nearest neighbours methods. Int J Forecast 25(1):192–207. https://doi.org/10.1016/j.ijforecast.2008.07.003

    Article  Google Scholar 

  7. Berckmoes B, Lowen R, Van Casteren J (2011) Distances on probability measures and random variables. J Math Anal Appl 374(2):412–428

    Article  MathSciNet  Google Scholar 

  8. Beringer J, Hüllermeier E (2006) Online clustering of parallel data streams. Data Knowl Eng 58(2):180–204. https://doi.org/10.1016/j.datak.2005.05.009

    Article  Google Scholar 

  9. Billard L, Diday E (2003) From the statistics of data to the statistic of knowledge: symbolic data analysis. JASA J Am Stat Assoc 98(462):470–487

    Article  MathSciNet  Google Scholar 

  10. Caló DG, Montanari A, Viroli C (2014) A hierarchical modeling approach for clustering probability density functions. Comput Stat Data Anal 71:79–91

    Article  MathSciNet  Google Scholar 

  11. Cao F, Estert M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise, pp 328–339. https://doi.org/10.1137/1.9781611972764.29

  12. Chan K, Fu AW-C (1999) Efficient time series matching by wavelets. In: Proceedings 15th international conference on data engineering Sydney, NSW, Australia, pp 126–133. https://doi.org/10.1109/ICDE.1999.754915

  13. Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD’07, pp 133–142. https://doi.org/10.1145/1281192.1281210

  14. Cormode G, Garofalakis M, Haas PJ, Jermaine C (2012) Synopses for massive data: samples, histograms, wavelets, sketches. Found Trends Databases 4(1–3):1–294. https://doi.org/10.1561/1900000004

    Article  MATH  Google Scholar 

  15. Dai BR, Huang JW, Yeh MY, Chen MS (2006) Adaptive clustering for multiple evolving streams. IEEE Trans Knowl Data Eng 18(9):1166–1180. https://doi.org/10.1109/TKDE.2006.137

    Article  Google Scholar 

  16. Diday E, Noirhomme-Fraiture M (eds) (2008) Symbolic data analysis and the SODAS software. Wiley, Hoboken

    MATH  Google Scholar 

  17. Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh EJ (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1(2):1542–1552

    Google Scholar 

  18. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining, AAAI Press, KDD’96, pp 226–231

  19. Fränti P (2018) Efficiency of random swap clustering. J Big Data 5(13):1–29

    MathSciNet  Google Scholar 

  20. Fränti P, Rezaei M, Zhao Q (2014) Centroid index: cluster level similarity measure. Pattern Recognit 47(9):3034–3045

    Article  Google Scholar 

  21. Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48(12):4743–4759

    Article  Google Scholar 

  22. Ganguly AR, Gama J, Omitaomu OA, Gaber M, Vatsavai RR (eds) (2008) Knowledge discovery from sensor data. CRC Press, Boca Raton

    Google Scholar 

  23. Garofalakis M, Gehrke J, Rastogi R (eds) (2016) Data stream management. Data-centric systems and applications. Springer, Berlin

    Google Scholar 

  24. Ghesmoune M, Azzag H, Lebbah M (2014) G-stream: growing neural gas over data stream. Springer, Cham, pp 207–214

    Google Scholar 

  25. Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. Int Stat Rev 7(3):419–435

    Article  Google Scholar 

  26. Gong S, Zhang Y, Yu G (2017) Clustering stream data by exploring the evolution of density mountain. Proc VLDB Endow 11(4):393–405. https://doi.org/10.1145/3186728.3164136

    Article  Google Scholar 

  27. González-Rivera G, Arroyo J (2012) Time series modeling of histogram-valued data: the daily histogram time series of S&P500 intradaily returns. Int J Forecast 28(1):20–33. https://doi.org/10.1016/j.ijforecast.2011.02.007

    Article  Google Scholar 

  28. Henderson K, Gallagher B, Eliassi-Rad T (2015) EP-MEANS: an efficient nonparametric clustering of empirical probability distributions. In: SAC ’15 proceedings of the 30th annual ACM symposium on applied computing, pp 893–900

  29. Huang D, Zheng WS, Lai JH, Wang CD (2013) Svstream: a support vector-based algorithm for clustering data streams. IEEE Trans Knowl Data Eng 25:1410–1424. https://doi.org/10.1109/TKDE.2011.263

    Article  Google Scholar 

  30. Irpino A, Iacono M (2011) Improving the MHIST-p algorithm for multivariate histograms of continuous data. In: Classification and multivariate analysis for complex data structures. Springer, pp 155–164. ISBN: 978-3-642-13311-4

  31. Irpino A, Verde R, De Carvalho FAT (2014) Dynamic clustering of histogram data based on adaptive squared Wasserstein distances. Expert Syst Appl 41(7):3351–3366

    Article  Google Scholar 

  32. Irpino A, Verde R (2015) Basic statistics for distributional symbolic variables: a new metric-based approach. Adv Data Anal Classif 9(2):143–175. https://doi.org/10.1007/s11634-014-0176-4

    Article  MathSciNet  MATH  Google Scholar 

  33. Jiang B, Pei J, Tao Y, Lin X (2013) Clustering uncertain data based on probability distribution similarity. IEEE Trans Knowl Data Eng 25:751–763

    Article  Google Scholar 

  34. Kärkkäinen I, Fränti P (2007) Gradual model generator for single-pass clustering. Pattern Recognit 40(3):784–795. https://doi.org/10.1016/j.patcog.2006.06.023

    Article  MATH  Google Scholar 

  35. Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001) Dimensionality reduction for fast similarity search in large time series databases. Knowl Inf Syst 3:263. https://doi.org/10.1007/PL00011669

    Article  MATH  Google Scholar 

  36. Laurinec P, Luck M (2018) Interpretable multiple data streams clustering with clipped streams representation for the improvement of electricity consumption forecasting. Data Min Knowl Disc. https://doi.org/10.1007/s10618-018-0598-2

    Article  Google Scholar 

  37. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability. 1. University of California Press, pp 281–297

  38. Mallows CL (1972) A note on asymptotic joint normality. Ann Math Stat 43(2):508–515

    Article  MathSciNet  Google Scholar 

  39. Panaretos VM, Zemel Y (2019) Statistical aspects of Wasserstein distances. Ann Rev Stat Appl 6:1

    Article  MathSciNet  Google Scholar 

  40. Rezaei M, Fränti P (2016) Set matching measures for external cluster validity. IEEE Trans Knowl Data Eng 28(8):2173–2186. https://doi.org/10.1109/TKDE.2016.2551240

    Article  Google Scholar 

  41. Rodrigues PP, Gama J, Pedroso J (2008) Hierarchical clustering of time-series data streams. IEEE Trans Knowl Data Eng 20(5):615–627. https://doi.org/10.1109/TKDE.2007.190727

    Article  Google Scholar 

  42. Sato M, Ishii S (2000) On-line EM algorithm for the normalized Gaussian network. Neural Comput 12(2):407–432

    Article  Google Scholar 

  43. Sakurai Y, Papadimitriou S, Faloutsos C (2005) BRAID: stream mining through group lag correlations. ACM SIGMOD’05, pp 599–610

  44. Sebastião R, Gama J (2007) Change detection in learning histograms from data streams. Prog Artif Intell 4874:112–123

    Article  Google Scholar 

  45. Shafer I, Ren K, Boddeti VN, Abe Y, Ganger GR, Faloutsos C (2012) RainMon: an integrated approach to mining bursty timeseries monitoring data. ACM KDD’12, pp 1158–1166

  46. Silva JA, Faria ER, Barros RC, Hruschka ER, Carvalho ACPLFd, Ja Gama (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):13:1–13:31. https://doi.org/10.1145/2522968.2522981

    Article  MATH  Google Scholar 

  47. Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: evolution-based technique for stream clustering. In: Proceedings of the 3rd international conference on advanced data mining and applications, Springer, Berlin, Heidelberg, ADMA’07, pp 605–615

    Chapter  Google Scholar 

  48. Verde R, Irpino A (2007) Dynamic clustering of histogram data: using the right metric. Springer, Berlin, pp 123–134

    MATH  Google Scholar 

  49. Villani C (2008) Optimal transport: old and new. Springer, Berlin

    MATH  Google Scholar 

  50. Zhang T, Ramakrishnan R, Livny M (1996) Birch: an efficient data clustering method for very large databases. SIGMOD Rec 25(2):103–114

    Article  Google Scholar 

  51. Zhu Y, Shasha D (2002) StatStream: statistical monitoring of thousands of data streams in real time. In: Proceedings of the 28th international conference on very large data bases. VLDB Endowment, p 358

  52. Zhao J, Ishikawa Y, Xiao C, Sugiura K (2018) Histogram construction for difference analysis of spatio-temporal data on array DBMS. Databases Theory Appl. https://doi.org/10.1007/978-3-319-92013-94

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Balzanella.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Balzanella, A., Verde, R. Histogram-based clustering of multiple data streams. Knowl Inf Syst 62, 203–238 (2020). https://doi.org/10.1007/s10115-019-01350-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-019-01350-5

Keywords

Navigation