Distributed evidential clustering toward time series with big data issue

https://doi.org/10.1016/j.eswa.2021.116279Get rights and content

Highlights

  • A distributed evidential clustering algorithm parallelized by Spark is proposed.

  • DBPEC analyzes millions of time series without destroying the raw data structure.

  • DBPEC generates practical results for big data uniting a fast version of DTW.

  • Ambiguity and uncertainty in memberships are better described in DBPEC.

  • Credal partitions help users obtain reasonable explanations for real-world problems.

Abstract

To analyze time series data with large volume, most of the existing clustering algorithms focus on data reduction techniques or multi-level strategies. However, the destruction of raw data structure is inevitable, leading to the information loss, even an abnormal and unaccountable clustering result. To tackle above issues, we propose a distributed evidential clustering algorithm that can be directly adopted on the raw big data, and parallelize it under the Apache Spark, which is a processing engine built for sophisticated data analysis. Concretely, in a parallel way, the possibility of becoming a cluster center is first calculated for each data object under the framework of evidence theory. After drawing a decision graph, the cluster centers are determined and then a credal partition of the time series data is derived. Without simplifying the data structure, the proposed algorithm can not only detect the number of clusters, but also describe the ambiguity and uncertainty in memberships of every data object. Experiments on several benchmark datasets and one real-world problem show the strong scalability and well clustering performance of the introduced algorithm.

Introduction

A time series is a sequence of continuous elements pertaining to a chronological order (Guijo-Rubio, Durán-Rosal, Gutiérrez, Troncoso, & Hervás-Martínez, 2020). Time series analysis uses statistical techniques to mathematically model data and discover the patterns of it (Bagnall, Lines, Bostrom, Large, & Keogh, 2017). Its applications conclude a variety of scientific fields, such as: anomaly detection (Choi, Lim, Choi, & Kim, 2020), signal analysis (Manolakis, Bosowski, & Ingle, 2019), environment information forecasting (Wen, Yang, Jiang, Song, & Wang, 2020), electricity load profiling (Liang & Ma, 2020), etc. As one of the most important unsupervised tasks, clustering also plays a key role in time series analysis, aiming to segment data objects into patterns (called clusters) with homologous characteristics (Gong et al., 2020, Gong et al., 2021).

Recent decade has witnessed a considerable amount of developments in time series clustering, caused by emerging concepts such as big data (Aghabozorgi, Shirkhorshidi, & Wah, 2015). For example, the number of smart meters installed in U.S. has reached 86.8 million in 2018 (Eia, 2020) and every ten thousand meters store about 0.7 GB data (hourly) over one year. Such high-resolution data of large volume bring the curse of big data, which concretely involves problems of two aspects. On the one hand, severe increment of the available data encumbers the conventional clustering algorithms for detecting typical patterns from large time series datasets (Bendechache et al., 2016, Bi et al., 2016), due to lack of computing power. On the other hand, the dissimilarity measure between time series, such as Dynamic Time Warping (DTW) (Itakura, 1975), is computationally expensive (Lemire, 2009, Sarda-Espinosa, 2020) and further prevents the adoption of clustering algorithm, even for medium-sized datasets. Thus, there are growing interests in how to decode hidden patterns in a large time series dataset into useful information for tackling real-world applications.

In this paper, we propose an evidential clustering algorithm named distributed belief-peaks evidential clustering (DBPEC) to group time series data, with the help of notion belief peaks presented in Su and Denoeux (2018). Different from the clustering algorithms deriving hard partitions (e.g., kmeans, hierarchical methods and Self-Organizing Maps) and fuzzy partitions (e.g., fuzzy c-means and Gaussian mixture model), DBPEC creates a credal partition (Denoeux and Masson, 2004, Masson and Denoeux, 2008) that can better describe the ambiguous and uncertain information implied in clustering memberships, under the scope of evidence theory (Dempster, 2008, Shafer, 1976). As a non-iterative algorithm, DBPEC separates the raw time series dataset into several partitions and parallelizes the calculation of belief peaks under Apache Spark (Karau et al., 2015, Zaharia et al., 2012). By drawing a visual decision-graph rather than presetting a fixed number, DBPEC semi-automatically detects the cluster centers. Integrating a simple and fast DTW distance into DBPEC, the final clustering result is parallel outputted. The main contributions of this work are summarized:

  • a scalable evidential clustering algorithm DBPEC is heuristically proposed, which directly manages clustering analysis for millions of time series based on Apache Spark and avoids the destruction of raw data structure;

  • uniting a fast version of DTW, DBPEC generates a more practical and explicable clustering result for time series datasets with medium/big volume, relaxing the sensitiveness of time-shift for time series clustering;

  • ambiguity and uncertainty in memberships of every time series to clusters are better described in the form of a credal partition for the first time, helping managers obtain more reasonable explanation of clustering result when handling real-world problems.

The rest of this paper is organized as follows. The related work of clustering big time series datasets and motivation of DBPEC, some basic notions of evidence theory and DTW are respectively recalled in Section 2. In Section 3, we successively introduce the basic idea, workflow and spark-based designment of DBPEC. Several numerical datasets are used to evaluate the performance of DBPEC, while a real-world dataset is considered to illustrate the effectiveness of DBPEC in Section 4. Section 5 finally concludes this paper and outlines future work.

Section snippets

Preliminaries

In this section, we firstly clarify those related work of clustering big time-series datasets and the motivation of DBPEC in Section 2.1. Then, some basic notions of evidence theory and DTW are introduced in Sections 2.2 Evidence theory, 2.3 DTW: dynamic time warping.

The method: DBPEC

In this section, we firstly detail the basic idea and workflow of DBPEC in Section 3.1. The specific designment scheme of DBPEC under spark framework is introduced in Section 3.2.

Experimental results

In this Section, we firstly consider 30 small time series datasets from the UCR archive (Chen et al., 2015) to evaluate the performance of DBPEC in Section 4.1. Those 30 selected datasets not only contains two-class (e.g. BeeteFly) and multi-class (e.g. Shake) ones but also considers low-dimensional (e.g. Chinatown) and high-dimensional (e.g. Rock) ones. To perform the scalability assessment, DBPEC is running on other 4 big datasets that is up to approximate 3 million of objects from UCI

Conclusion

In this paper, we introduce a distributed evidential clustering algorithm for time series data under Apache Spark, named DBPEC. Compared with 9 popular clustering algorithms referred in this paper, DBPEC shows statistically better performance on 30 small datasets due to the derivation of credal partitions. DBPEC also outperforms other 4 state-of-the-art clustering algorithms for analyzing big time series datasets. The experimental results demonstrate that DBPEC can tackle the big datasets which

CRediT authorship contribution statement

Chaoyu Gong: Proposal of algorithm, Programming. Zhi-gang Su: Interpretation of data. Pei-hong Wang: Interpretation of data. Yang You: Modification of language writing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank the EiC, AE and anonymous referees for their invaluable comments and suggestions. This work is supported by the National Natural Science Foundation of China under Grant 51876035 and Grant 51976032.

References (56)

  • AghabozorgiS. et al.

    A hybrid algorithm for clustering of time series data based on affinity search technique

    The Scientific World Journal

    (2014)
  • Al-JarrahO.Y. et al.

    Multi-layered clustering for power consumption profiling in smart grids

    IEEE Access

    (2017)
  • BacheK. et al.

    UCI machine learning repository

    (2013)
  • BagnallA. et al.

    The great time series classification bake off: A review and experimental evaluation of recent algorithmic advances

    Data Mining and Knowledge Discovery

    (2017)
  • BendechacheM. et al.

    Efficient large scale clustering based on data partitioning

  • BezdekJ.C.

    Pattern recognition with fuzzy objective function algorithms

    (2013)
  • BharillN. et al.

    Fuzzy based scalable clustering algorithms for handling big data using apache spark

    IEEE Transactions on Big Data

    (2016)
  • BiW. et al.

    A big data clustering algorithm for mitigating the risk of customer churn

    IEEE Transactions on Industrial Informatics

    (2016)
  • ChenY. et al.

    The UCR time series classification archive

    (2015)
  • ChiccoG. et al.

    Comparisons among clustering techniques for electricity customer classification

    IEEE Transactions on Power Systems

    (2006)
  • ChoiY. et al.

    Gan-based anomaly detection and localization of multivariate time series data for power plant

  • DaviesD.L. et al.

    A cluster separation measure

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1979)
  • DeanJ. et al.

    MapReduce: Simplified data processing on large clusters

    Communications of the ACM

    (2008)
  • DeanJ. et al.

    MapReduce: A flexible data processing tool

    Communications of the ACM

    (2010)
  • DempsterA.P.

    Upper and lower probabilities induced by a multivalued mapping

  • DemšarJ.

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • DenoeuxT. et al.

    EVCLUS: Evidential clustering of proximity data

    IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics)

    (2004)
  • EiaM.

    How many smart meters are installed in the united states, and who has them?

    (2020)
  • Cited by (11)

    • Representing uncertainty and imprecision in machine learning: A survey on belief functions

      2024, Journal of King Saud University - Computer and Information Sciences
    • Seeking patterns in rms voltage variations at the sub-10-minute scale from multiple locations via unsupervised learning and patterns' post-processing

      2022, International Journal of Electrical Power and Energy Systems
      Citation Excerpt :

      As observed in [27,28], the automatic extraction of principal features has a normally better role than the manually extracted ones (e.g., statistical indices) [1,29,30] to group a dataset. There are many works previously done on time series clustering, e. g., clustering on the areas of big data in [31], clustering by utilizing various tools than k-means, and the Euclidean distance measurement criterion addressed in [32] as shape-based clustering and in [33] as fuzzy-based one by using Distance Time Wrapping (DTW) as the similarity measure criteria. However, a limited number of applications in power quality data measurement analysis have been found, such as a time series clustering methodology for knowledge extraction in energy consumption data in [34], a clustering method for the probabilistic evaluation of harmonic load flow in [35], and a k-means clustering for identification of distributed generation contribution in [36].

    • An unsupervised learning schema for seeking patterns in rms voltage variations at the sub-10-minute time scale

      2022, Sustainable Energy, Grids and Networks
      Citation Excerpt :

      There are many works done on time series clustering. e.g., clustering on the areas of big data in [33], multivariable time series clustering in [34–36], and clustering by using different tools than k-means and the Euclidean distance measurement criterion addressed in [37] as shape-based clustering and in [38] as fuzzy-based by using Distance Time Wrapping (DTW) as similarity measure criteria. However, a limited number of applications in power quality data measurement analysis have been found, e.g., a time-series clustering methodology for knowledge extraction in energy consumption data in [39], a clustering method for probabilistic evaluation of harmonic load flow in [40] and in [41] a k-means clustering for identification of distributed generation contribution.

    View all citing articles on Scopus
    View full text