Abstract
This paper presents a new interpretable approach for multiple data streams clustering in a smart grid used for the improvement of forecasting accuracy of aggregated electricity consumption and grid analysis named ClipStream. Consumers time series streams are compressed and represented by interpretable features extracted from the clipped representation. The proposed representation has low computational complexity and is incremental in the sense of the windowing method. From the extracted features, outlier consumers can be simply and quickly detected. The clustering phase consists of three parts: clustering non-outlier representations, the aggregation of consumption within clusters, and unsupervised change detection procedure on aggregated time series streams windows. ClipStream behaviour and its forecasting accuracy improvement were evaluated on four different real datasets containing variable patterns of electricity consumption. The clustering accuracy with the proposed feature extraction method from the clipped representation was evaluated on 85 time series datasets from a large public repository. The results of experiments proved the stability of the proposed ClipStream in the sense of improving forecasting accuracy and showed the suitability of the proposed representation in many tested applications.
Similar content being viewed by others
References
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on Very large data bases-volume 29, VLDB Endowment, pp 81–92
Aghabozorgi S, Seyed Shirkhorshidi A, Ying Wah T (2015) Time-series clustering: a decade review. Inf Syst 53:16–38
Amini A, Saboohi H, Herawan T, Wah TY (2016) Mudi-stream: a multi density clustering algorithm for evolving data stream. J Netw Comput Appl 59:370–385
Appice A, Guccione P, Malerba D, Ciampi A (2014) Dealing with temporal and spatial correlations to classify outliers in geophysical data streams. Inf Sci 285:162–180
Arora P, Deepali Varshney S (2016) Analysis of k-means and k-medoids algorithm for big data. Procedia Comput Sci 78:507–512
Bagnall A, Ratanamahatana C, Keogh E, Lonardi S, Janacek G (2006) A bit level representation for time series data mining with shape based similarity. Data Min Knowl Discov 13(1):11–40
Beringer J, Hüllermeier E (2007) Fuzzy clustering of parallel data streams. In: Advances in fuzzy clustering and its application, pp 333–352
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, Amsterdam
Chan KP, Fu AWC (1999) Efficient time series matching by wavelets. In: Data engineering, 1999. Proceedings., 15th international conference on, IEEE, pp 126–133
Chen JY, He HH (2016) A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Inf Sci 345:271–293
Chen L, Zou LJ, Tu L (2012) A clustering algorithm for multiple data streams based on spectral component similarity. Inf Sci 183(1):35–47
Chen Y (2009) Clustering parallel data streams. InTech
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 133–142
Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, Batista G (2015) The ucr time series classification archive www.cs.ucr.edu/~eamonn/time_series_data
Cleveland RB, Cleveland WS, McRae JE, Terpenning I (1990) STL: a seasonal-trend decomposition procedure based on loess. J Off Stat 6(1):3–73
Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex fourier series. Math Comput 19(90):297–301
Corduas M, Piccolo D (2008) Time series clustering and classification by the autoregressive metric. Comput Stat Data Anal 52(4):1860–1872
Dai BR, Huang JW, Yeh MY, Chen MS (2006) Adaptive clustering for multiple evolving streams. IEEE Trans Knowl Data Eng 18(9):1166–1180
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227
Esling P, Agon C (2012) Time-series data mining. ACM Comput Surv 45(1):1–34
Faloutsos C, Ranganathan M, Manolopoulos Y (1994) Fast subsequence matching in time-series databases. In: Proceedings of the 1994 ACM SIGMOD international conference on management of data, ACM, New York, SIGMOD ’94, pp 419–429. https://doi.org/10.1145/191839.191925
Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383):553–569
Gama J, Rodrigues PP (2007) Stream-based electricity load forecast. In: Proceedings of the 11th European conference on principles and practice of knowledge discovery in databases (PKDD 2007) vol 4702, pp 446–453
Hahsler M, Bolaños M (2016) Clustering data streams based on shared density between micro-clusters. IEEE Trans Knowl Data Eng 28(6):1449–1461
Hyndman R, Khandakar Y (2008) Automatic time series forecasting: the forecast package for R. J Stat Softw 27(3):1–22
Hyndman R, Koehler AB, Ord JK, Snyder RD (2008) Forecasting with exponential smoothing: the state space approach. Springer, Berlin
Jarábek T, Laurinec P, Lucká M (2017) Energy load forecast using s2s deep neural networks with k-shape clustering. In: Informatics, 2017 IEEE 14th international scientific conference on, IEEE, pp 140–145
Kaufman L, Rousseeuw P (2009) Finding groups in data: an introduction to cluster analysis. Wiley, London
Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data. ACM, New York, SIGMOD ’01, pp 151–162. https://doi.org/10.1145/375663.375680
Keogh EJ, Pazzani MJ (1998) An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: Proceedings of the fourth international conference on knowledge discovery and data mining. AAAI Press, KDD’98, pp 239–243
Keogh EJ, Pazzani MJ (2000) A simple dimensionality reduction technique for fast similarity search in large time series databases. In: Terano T, Liu H, Chen ALP (eds) Knowledge discovery and data mining. Current issues and new applications. Springer, Berlin, pp 122–133
Khan I, Huang JZ, Ivanov K (2016) Incremental density-based ensemble clustering over evolving data streams. Neurocomputing 191(Supplement C):34–43
Laurinec P (2018) TSrepr R package: time series representations. J Open Source Softw 3(23):577. https://doi.org/10.21105/joss.00577
Laurinec P, Lucká M (2016) Comparison of representations of time series for clustering smart meter data. In: Lecture notes in engineering and computer science: proceedings of the world congress on engineering and computer science 2016, pp 458–463
Laurinec P, Lucká M (2017) New clustering-based forecasting method for disaggregated end-consumer electricity load using smart grid data. In: 2017 IEEE 14th international scientific conference on informatics, pp 210–215, https://doi.org/10.1109/INFORMATICS.2017.8327248
Laurinec P, Lucká M (2018) Clustering-based forecasting method for individual consumers electricity load using time series representations. Open Comput Sci 8(1):38–50
Laurinec P, Lucká M (2018) Usefulness of unsupervised ensemble learning methods for time series forecasting of aggregated or clustered load. In: Appice A, Loglisci C, Manco G, Masciari E, Ras ZW (eds) New frontiers in mining complex patterns. Springer, Cham, pp 122–137
Laurinec P, Lóderer M, Vrablecová P, Lucká M, Rozinajová V, Ezzeddine AB (2016) Adaptive time series forecasting of energy consumption using optimized cluster analysis. In: Data mining workshops (ICDMW), 2016 IEEE 16th international conference on, IEEE, pp 398–405
Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery—DMKD ’03 p 2. https://doi.org/10.1145/882085.882086
Livera AMD, Hyndman RJ, Snyder RD (2011) Forecasting time series with complex seasonal patterns using exponential smoothing. J Am Stat Assoc 106(496):1513–1527. https://doi.org/10.1198/jasa.2011.tm09771
Manjoro WS, Dhakar M, Chaurasia BK (2016) Operational analysis of k-medoids and k-means algorithms on noisy data. In: 2016 International conference on communication and signal processing (ICCSP), pp 1500–1505. https://doi.org/10.1109/ICCSP.2016.7754408
McGill R, Tukey JW, Larsen WA (1978) Variations of box plots. Am Stat 32(1):12–16
Paparrizos J, Gravano L (2015) k-shape: efficient and accurate clustering of time series. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data, ACM, New York, SIGMOD ’15, pp 1855–1870. https://doi.org/10.1145/2723372.2737793
Pereira CMM, de Mello RF (2014) TS-stream: clustering time series on data streams. J Intell Inf Syst 42(3):531–566
Pravilovic S, Bilancia M, Appice A, Malerba D (2017) Using multiple time series analysis for geosensor data forecasting. Inf Sci 380:31–52
Ratanamahatana C, Keogh E, Bagnall AJ, Lonardi S (2005) A novel bit level time series representation with implication of similarity search and clustering. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 771–777
Razali NM, Wah YB et al (2011) Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests. J Stat Model Anal 2(1):21–33
Rodrigues PP, Gama J, Pedroso J (2008) Hierarchical clustering of time-series data streams. IEEE Trans Knowl Data Eng 20(5):615–627
Schofield JR, Carmichael R, Tindemans S, Bilton M, Woolf M, Strbac G, et al (2015) Low carbon london project: data from the dynamic time-of-use electricity pricing trial, 2013
Scholz FW, Stephens MA (1987) K-sample anderson–darling tests. J Am Stat Assoc 82(399):918–924
Silva JA, Faria ER, Barros RC, Hruschka ER, Carvalho ACPLFD, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):1–31
Strasser H, Weber C (1999) On the asymptotic theory of permutation statistics. In: SFB adaptive information systems and modelling in economics and management science
Yang J, Ning C, Deb C, Zhang F, Cheong D, Lee SE, Sekhar C, Tham KW (2017) k-shape clustering algorithm for building energy usage patterns analysis and forecasting model accuracy improvement. Energy Build 146:27–37
Acknowledgements
This work was partially supported by the Slovak Research and Development Agency, Grant Nos. APVV-16-0484 and APVV-16-0213, and the Scientific Grant Agency of The Slovak Republic, Grant No. VG 1/0458/18.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Jesse Davis, Elisa Fromont, Derek Greene, Bjorn Bringmann.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Laurinec, P., Lucká, M. Interpretable multiple data streams clustering with clipped streams representation for the improvement of electricity consumption forecasting. Data Min Knowl Disc 33, 413–445 (2019). https://doi.org/10.1007/s10618-018-0598-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-018-0598-2