Big data time series forecasting based on pattern sequence similarity and its application to the electricity demand
Introduction
Time series data available has increased considerably in the last decades, given the recent interest in storing and analyzing huge amount of data nowadays [6], as may be the case of datasets extracted from smart meters for decades, from hundreds of buildings or with a very high measurement frequency [24], [25].
Time series forecasting algorithms can create models based on historical data and make predictions for given target variables of interest [44], [43]. The computation time of these algorithms may increase notably when big data time series are evaluated. Therefore, single-core machine environments are not enough and need to be improved with additional computational resources. For such reason, it becomes necessary to distribute the data and its computation across multiple nodes using a cluster of machines.
The Pattern Sequence based Forecasting algorithm (PSF) [21] is an effective general-purpose multi-output approach particularly designed to deal with time series for prediction horizons of an arbitrary length. This multi-output feature has turned PSF into a flexible tool that has been used in different fields of research. PSF is mainly based on the identification of certain patterns that are searched for throughout the whole dataset. Such patterns are calculated by means of any clustering technique (k-means in the original work) and, later, sequences of clusters are formed to characterize the target time series.
In this work, a new algorithm, hereinafter called bigPSF, is proposed. It is inspired by the pattern search strategy introduced in the original PSF algorithm. But bigPSF has two major contributions to the literature. First, it is scalable, thanks to its distributed computation under the Apache Spark framework. That is, bigPSF is suitable for handling big data time series and mining millions of records reporting reduced execution times in contrast with PSF. Second, some modifications have been carried out in bigPSF with respect to the original PSF improving the results of the predictions and achieving higher accuracy. Hence, the algorithm mainly covers volume and velocity dimensions from the well-established 4-Vs big data paradigm [6].
Although bigPSF is also a general-purpose algorithm, data related to electricity demand have been used to assess its performance. A study case of the original methodology along with its last methodological improvement for predicting electricity demand data can be found in Ref. [19]. Compared to PSF and other five well-known prediction algorithms, bigPSF achieved a higher accuracy than that of each of them and was able to deal with big data, exhibiting a linear behavior in terms of computation time.
The rest of the paper is structured as follows. Section 2 reviews the relevant and related papers to PSF and big data time series forecasting. Section 3 describes the proposed methodology. Section 4 reports all results and discusses the performance in terms of both errors and scalability. Finally, Section 5 summarizes the most significant achievements within the manuscript.
Section snippets
Related works
This section overviews the most relevant and related works. In particular, this section is structured in two different parts. First, works related to the PSF algorithm are reviewed and summarized highlighting their main contributions to the literature. Second, works related to big data time series forecasting are reported and discussed.
The PSF algorithm was firstly published in 2011 [21]. It was developed to deal with time series and proposed, for the first time, to use clustering methods to
Methodology
This section describes the methodology proposed to predict big data time series of electricity consumption. The novel and distributed bigPSF algorithm, based on the existing PSF algorithm, is introduced. bigPSF is a forecasting algorithm able to handle big data time series in a scalable way. In addition to the scalability property, some modifications have been proposed in order to enhance the prediction results of the original PSF algorithm.
Section 3.1 describes PSF, detailing how its main
Results
This section presents and discusses the experiments carried out to assess the bigPSF performance for a 24-h prediction horizon using a big data time series of electrical consumption from Uruguay. Furthermore, a comparative analysis is performed to compare the bigPSF to other approaches published in the literature.Algorithm 1. The bigPSF algorithm
This section is divided as follows. Section 4.1 describes the dataset in which the bigPSF has been tested. The quality measures used to evaluate its
Conclusions
This work proposes the bigPSF algorithm based on distributed computing in order to process and to forecast big data time series. This highly scalable algorithm is capable of processing and extracting results from datasets containing millions of records in outstanding time.
In a big data environment, where the consumption habits of the population can change over time, it is very important to calibrate the way the prediction is computed. In this sense, for instance, in a dataset containing samples
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The authors would like to thank the Spanish Ministry of Science, Innovation and Universities for the support under project TIN2017-88209-C2-1-R.
References (46)
- et al.
A novel imputation methodology for time series based on pattern sequence forecasting
Pattern Recognition Letters
(2018) - et al.
A novel Spark-based multi-step forecasting algorithm for big data time series
Information Sciences
(2018) - et al.
Model-based clustering of multivariate functional data
Computational Statistics and Data Analysis
(2014) - et al.
Discovery of motifs to forecast outlier occurrence in time series
Pattern Recognition Letters
(2011) - et al.
A novel deep learning ensemble model with data denoising for short-term wind speed forecasting
Energy Conversion and Management
(2020) - et al.
MRF: MapReduce based forecasting algorithm for time series data
Procedia Computer Science
(2018) - et al.
Big data time series forecasting based on nearest neighbors distributed computing with Spark
Knowledge-Based Systems
(2018) - et al.
MV-kWNN: a novel multivariate and multi-output weighted nearest neighbors algorithm for big data time series forecasting
Neurocomputing
(2019) - D. Arthur, S. Vassilvitskii, K-Means++: The advantages of careful seeding, in: Proceedings of the ACM-SIAM Symposium on...
- et al.
Scalable k-means++
PSF: Introduction to R Package for Pattern Sequence Based Forecasting Algorithm
The R Journal
Big data: a survey
Mobile Networks and Applications
Multi-step forecasting for big data time series forecasting based on ensemble learning
Knowledge-Based Systems
Hybrid leakage management for water network using PSF algorithm and soft computing techniques
Water Resources Management
Improved pattern sequence-based forecasting method for electricity load
IEEJ Transactions on Electrical and Electronic Engineering
Time series analysis with Apache Spark and its applications to energy informatics
Energy Informatics
Midterm power load forecasting model based on kernel principal component analysis and back propagation neural network with particle swarm optimization
Big Data
An approach to validity indices for clustering techniques in big data
Progress in Artificial Intelligence
External clustering validity index based on chi-squared statistical test
Information Sciences
Cited by (38)
Medium-term water consumption forecasting based on deep neural networks
2024, Expert Systems with ApplicationsPattern sequence-based algorithm for multivariate big data time series forecasting: Application to electricity consumption
2024, Future Generation Computer SystemsADCT-Net: Adaptive traffic forecasting neural network via dual-graphic cross-fused transformer
2024, Information FusionCUDA-bigPSF: An optimized version of bigPSF accelerated with graphics processing Unit
2023, Expert Systems with Applications