Distributed evidential clustering toward time series with big data issue
Introduction
A time series is a sequence of continuous elements pertaining to a chronological order (Guijo-Rubio, Durán-Rosal, Gutiérrez, Troncoso, & Hervás-Martínez, 2020). Time series analysis uses statistical techniques to mathematically model data and discover the patterns of it (Bagnall, Lines, Bostrom, Large, & Keogh, 2017). Its applications conclude a variety of scientific fields, such as: anomaly detection (Choi, Lim, Choi, & Kim, 2020), signal analysis (Manolakis, Bosowski, & Ingle, 2019), environment information forecasting (Wen, Yang, Jiang, Song, & Wang, 2020), electricity load profiling (Liang & Ma, 2020), etc. As one of the most important unsupervised tasks, clustering also plays a key role in time series analysis, aiming to segment data objects into patterns (called clusters) with homologous characteristics (Gong et al., 2020, Gong et al., 2021).
Recent decade has witnessed a considerable amount of developments in time series clustering, caused by emerging concepts such as big data (Aghabozorgi, Shirkhorshidi, & Wah, 2015). For example, the number of smart meters installed in U.S. has reached 86.8 million in 2018 (Eia, 2020) and every ten thousand meters store about 0.7 GB data (hourly) over one year. Such high-resolution data of large volume bring the curse of big data, which concretely involves problems of two aspects. On the one hand, severe increment of the available data encumbers the conventional clustering algorithms for detecting typical patterns from large time series datasets (Bendechache et al., 2016, Bi et al., 2016), due to lack of computing power. On the other hand, the dissimilarity measure between time series, such as Dynamic Time Warping (DTW) (Itakura, 1975), is computationally expensive (Lemire, 2009, Sarda-Espinosa, 2020) and further prevents the adoption of clustering algorithm, even for medium-sized datasets. Thus, there are growing interests in how to decode hidden patterns in a large time series dataset into useful information for tackling real-world applications.
In this paper, we propose an evidential clustering algorithm named distributed belief-peaks evidential clustering (DBPEC) to group time series data, with the help of notion belief peaks presented in Su and Denoeux (2018). Different from the clustering algorithms deriving hard partitions (e.g., kmeans, hierarchical methods and Self-Organizing Maps) and fuzzy partitions (e.g., fuzzy c-means and Gaussian mixture model), DBPEC creates a credal partition (Denoeux and Masson, 2004, Masson and Denoeux, 2008) that can better describe the ambiguous and uncertain information implied in clustering memberships, under the scope of evidence theory (Dempster, 2008, Shafer, 1976). As a non-iterative algorithm, DBPEC separates the raw time series dataset into several partitions and parallelizes the calculation of belief peaks under Apache Spark (Karau et al., 2015, Zaharia et al., 2012). By drawing a visual decision-graph rather than presetting a fixed number, DBPEC semi-automatically detects the cluster centers. Integrating a simple and fast DTW distance into DBPEC, the final clustering result is parallel outputted. The main contributions of this work are summarized:
- •
a scalable evidential clustering algorithm DBPEC is heuristically proposed, which directly manages clustering analysis for millions of time series based on Apache Spark and avoids the destruction of raw data structure;
- •
uniting a fast version of DTW, DBPEC generates a more practical and explicable clustering result for time series datasets with medium/big volume, relaxing the sensitiveness of time-shift for time series clustering;
- •
ambiguity and uncertainty in memberships of every time series to clusters are better described in the form of a credal partition for the first time, helping managers obtain more reasonable explanation of clustering result when handling real-world problems.
The rest of this paper is organized as follows. The related work of clustering big time series datasets and motivation of DBPEC, some basic notions of evidence theory and DTW are respectively recalled in Section 2. In Section 3, we successively introduce the basic idea, workflow and spark-based designment of DBPEC. Several numerical datasets are used to evaluate the performance of DBPEC, while a real-world dataset is considered to illustrate the effectiveness of DBPEC in Section 4. Section 5 finally concludes this paper and outlines future work.
Section snippets
Preliminaries
In this section, we firstly clarify those related work of clustering big time-series datasets and the motivation of DBPEC in Section 2.1. Then, some basic notions of evidence theory and DTW are introduced in Sections 2.2 Evidence theory, 2.3 DTW: dynamic time warping.
The method: DBPEC
In this section, we firstly detail the basic idea and workflow of DBPEC in Section 3.1. The specific designment scheme of DBPEC under spark framework is introduced in Section 3.2.
Experimental results
In this Section, we firstly consider 30 small time series datasets from the UCR archive (Chen et al., 2015) to evaluate the performance of DBPEC in Section 4.1. Those 30 selected datasets not only contains two-class (e.g. BeeteFly) and multi-class (e.g. Shake) ones but also considers low-dimensional (e.g. Chinatown) and high-dimensional (e.g. Rock) ones. To perform the scalability assessment, DBPEC is running on other 4 big datasets that is up to approximate 3 million of objects from UCI
Conclusion
In this paper, we introduce a distributed evidential clustering algorithm for time series data under Apache Spark, named DBPEC. Compared with 9 popular clustering algorithms referred in this paper, DBPEC shows statistically better performance on 30 small datasets due to the derivation of credal partitions. DBPEC also outperforms other 4 state-of-the-art clustering algorithms for analyzing big time series datasets. The experimental results demonstrate that DBPEC can tackle the big datasets which
CRediT authorship contribution statement
Chaoyu Gong: Proposal of algorithm, Programming. Zhi-gang Su: Interpretation of data. Pei-hong Wang: Interpretation of data. Yang You: Modification of language writing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors would like to thank the EiC, AE and anonymous referees for their invaluable comments and suggestions. This work is supported by the National Natural Science Foundation of China under Grant 51876035 and Grant 51976032.
References (56)
- et al.
Time-series clustering–A decade review
Information Systems
(2015) - et al.
Stock market co-movement assessment using a three-phase clustering method
Expert Systems with Applications
(2014) - et al.
Study on density peaks clustering based on K-nearest neighbors and principal component analysis
Knowledge-Based Systems
(2016) - et al.
Cumulative belief peaks evidential K-nearest neighbor clustering
Knowledge-Based Systems
(2020) - et al.
An evidential clustering algorithm by finding belief-peaks and disjoint neighborhoods
Pattern Recognition
(2021) Faster retrieval with a two-pass dynamic-time-warping lower bound
Pattern Recognition
(2009)- et al.
ECM: An evidential version of the fuzzy c-means algorithm
Pattern Recognition
(2008) - et al.
MRPR: A MapReduce solution for prototype reduction in big data classification
Neurocomputing
(2015) - et al.
Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors
Information Sciences
(2016) - et al.
Adaptive density peak clustering based on K-nearest neighbors with aggregating strategy
Knowledge-Based Systems
(2017)
A hybrid algorithm for clustering of time series data based on affinity search technique
The Scientific World Journal
Multi-layered clustering for power consumption profiling in smart grids
IEEE Access
UCI machine learning repository
The great time series classification bake off: A review and experimental evaluation of recent algorithmic advances
Data Mining and Knowledge Discovery
Efficient large scale clustering based on data partitioning
Pattern recognition with fuzzy objective function algorithms
Fuzzy based scalable clustering algorithms for handling big data using apache spark
IEEE Transactions on Big Data
A big data clustering algorithm for mitigating the risk of customer churn
IEEE Transactions on Industrial Informatics
The UCR time series classification archive
Comparisons among clustering techniques for electricity customer classification
IEEE Transactions on Power Systems
Gan-based anomaly detection and localization of multivariate time series data for power plant
A cluster separation measure
IEEE Transactions on Pattern Analysis and Machine Intelligence
MapReduce: Simplified data processing on large clusters
Communications of the ACM
MapReduce: A flexible data processing tool
Communications of the ACM
Upper and lower probabilities induced by a multivalued mapping
Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
EVCLUS: Evidential clustering of proximity data
IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics)
How many smart meters are installed in the united states, and who has them?
Cited by (11)
Self-filling evidential clustering for partial multi-view data
2024, Expert Systems with ApplicationsRepresenting uncertainty and imprecision in machine learning: A survey on belief functions
2024, Journal of King Saud University - Computer and Information SciencesAdaptive evidential K-NN classification: Integrating neighborhood search and feature weighting
2023, Information SciencesSeeking patterns in rms voltage variations at the sub-10-minute scale from multiple locations via unsupervised learning and patterns' post-processing
2022, International Journal of Electrical Power and Energy SystemsCitation Excerpt :As observed in [27,28], the automatic extraction of principal features has a normally better role than the manually extracted ones (e.g., statistical indices) [1,29,30] to group a dataset. There are many works previously done on time series clustering, e. g., clustering on the areas of big data in [31], clustering by utilizing various tools than k-means, and the Euclidean distance measurement criterion addressed in [32] as shape-based clustering and in [33] as fuzzy-based one by using Distance Time Wrapping (DTW) as the similarity measure criteria. However, a limited number of applications in power quality data measurement analysis have been found, such as a time series clustering methodology for knowledge extraction in energy consumption data in [34], a clustering method for the probabilistic evaluation of harmonic load flow in [35], and a k-means clustering for identification of distributed generation contribution in [36].
An unsupervised learning schema for seeking patterns in rms voltage variations at the sub-10-minute time scale
2022, Sustainable Energy, Grids and NetworksCitation Excerpt :There are many works done on time series clustering. e.g., clustering on the areas of big data in [33], multivariable time series clustering in [34–36], and clustering by using different tools than k-means and the Euclidean distance measurement criterion addressed in [37] as shape-based clustering and in [38] as fuzzy-based by using Distance Time Wrapping (DTW) as similarity measure criteria. However, a limited number of applications in power quality data measurement analysis have been found, e.g., a time-series clustering methodology for knowledge extraction in energy consumption data in [39], a clustering method for probabilistic evaluation of harmonic load flow in [40] and in [41] a k-means clustering for identification of distributed generation contribution.
A Sparse Reconstructive Evidential K-Nearest Neighbor Classifier for High-Dimensional Data
2023, IEEE Transactions on Knowledge and Data Engineering