Big data time series forecasting based on pattern sequence similarity and its application to the electricity demand

doi:10.1016/j.ins.2020.06.014

Information Sciences

Volume 540, November 2020, Pages 160-174

https://doi.org/10.1016/j.ins.2020.06.014 Get rights and content

Abstract

This work proposes a novel algorithm to forecast big data time series. Based on the well-established Pattern Sequence-based Forecasting algorithm, this new approach has two major contributions to the literature. First, the improvement of the original algorithm with respect to the accuracy of predictions, and second, its transformation into the big data context, having reached meaningful results in terms of scalability. The algorithm uses the Apache Spark distributed computation framework and it is a ready-to-use application with few parameters to adjust. Physical and cloud clusters have been used to carry out the experimentation, which consisted in applying the algorithm to real-world data from Uruguay electricity demand.

Introduction

Time series data available has increased considerably in the last decades, given the recent interest in storing and analyzing huge amount of data nowadays [6], as may be the case of datasets extracted from smart meters for decades, from hundreds of buildings or with a very high measurement frequency [24], [25].

Time series forecasting algorithms can create models based on historical data and make predictions for given target variables of interest [44], [43]. The computation time of these algorithms may increase notably when big data time series are evaluated. Therefore, single-core machine environments are not enough and need to be improved with additional computational resources. For such reason, it becomes necessary to distribute the data and its computation across multiple nodes using a cluster of machines.

The Pattern Sequence based Forecasting algorithm (PSF) [21] is an effective general-purpose multi-output approach particularly designed to deal with time series for prediction horizons of an arbitrary length. This multi-output feature has turned PSF into a flexible tool that has been used in different fields of research. PSF is mainly based on the identification of certain patterns that are searched for throughout the whole dataset. Such patterns are calculated by means of any clustering technique (k-means in the original work) and, later, sequences of clusters are formed to characterize the target time series.

In this work, a new algorithm, hereinafter called bigPSF, is proposed. It is inspired by the pattern search strategy introduced in the original PSF algorithm. But bigPSF has two major contributions to the literature. First, it is scalable, thanks to its distributed computation under the Apache Spark framework. That is, bigPSF is suitable for handling big data time series and mining millions of records reporting reduced execution times in contrast with PSF. Second, some modifications have been carried out in bigPSF with respect to the original PSF improving the results of the predictions and achieving higher accuracy. Hence, the algorithm mainly covers volume and velocity dimensions from the well-established 4-Vs big data paradigm [6].

Although bigPSF is also a general-purpose algorithm, data related to electricity demand have been used to assess its performance. A study case of the original methodology along with its last methodological improvement for predicting electricity demand data can be found in Ref. [19]. Compared to PSF and other five well-known prediction algorithms, bigPSF achieved a higher accuracy than that of each of them and was able to deal with big data, exhibiting a linear behavior in terms of computation time.

The rest of the paper is structured as follows. Section 2 reviews the relevant and related papers to PSF and big data time series forecasting. Section 3 describes the proposed methodology. Section 4 reports all results and discusses the performance in terms of both errors and scalability. Finally, Section 5 summarizes the most significant achievements within the manuscript.

Section snippets

Related works

This section overviews the most relevant and related works. In particular, this section is structured in two different parts. First, works related to the PSF algorithm are reviewed and summarized highlighting their main contributions to the literature. Second, works related to big data time series forecasting are reported and discussed.

The PSF algorithm was firstly published in 2011 [21]. It was developed to deal with time series and proposed, for the first time, to use clustering methods to

Methodology

This section describes the methodology proposed to predict big data time series of electricity consumption. The novel and distributed bigPSF algorithm, based on the existing PSF algorithm, is introduced. bigPSF is a forecasting algorithm able to handle big data time series in a scalable way. In addition to the scalability property, some modifications have been proposed in order to enhance the prediction results of the original PSF algorithm.

Section 3.1 describes PSF, detailing how its main

Results

This section presents and discusses the experiments carried out to assess the bigPSF performance for a 24-h prediction horizon using a big data time series of electrical consumption from Uruguay. Furthermore, a comparative analysis is performed to compare the bigPSF to other approaches published in the literature.

Algorithm 1. The bigPSF algorithm

This section is divided as follows. Section 4.1 describes the dataset in which the bigPSF has been tested. The quality measures used to evaluate its

Conclusions

This work proposes the bigPSF algorithm based on distributed computing in order to process and to forecast big data time series. This highly scalable algorithm is capable of processing and extracting results from datasets containing millions of records in outstanding time.

In a big data environment, where the consumption habits of the population can change over time, it is very important to calibrate the way the prediction is computed. In this sense, for instance, in a dataset containing samples

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors would like to thank the Spanish Ministry of Science, Innovation and Universities for the support under project TIN2017-88209-C2-1-R.

References (46)

N. Bokde et al.
A novel imputation methodology for time series based on pattern sequence forecasting
Pattern Recognition Letters
(2018)
A. Galicia et al.
A novel Spark-based multi-step forecasting algorithm for big data time series
Information Sciences
(2018)
J. Jacques et al.
Model-based clustering of multivariate functional data
Computational Statistics and Data Analysis
(2014)
F. Martínez-Álvarez et al.
Discovery of motifs to forecast outlier occurrence in time series
Pattern Recognition Letters
(2011)
Z. Peng et al.
A novel deep learning ensemble model with data denoising for short-term wind speed forecasting
Energy Conversion and Management
(2020)
A. Sinha et al.
MRF: MapReduce based forecasting algorithm for time series data
Procedia Computer Science
(2018)
R. Talavera-Llames et al.
Big data time series forecasting based on nearest neighbors distributed computing with Spark
Knowledge-Based Systems
(2018)
R. Talavera-Llames et al.
MV-kWNN: a novel multivariate and multi-output weighted nearest neighbors algorithm for big data time series forecasting
Neurocomputing
(2019)
D. Arthur, S. Vassilvitskii, K-Means++: The advantages of careful seeding, in: Proceedings of the ACM-SIAM Symposium on...
B. Bahmani et al.
Scalable k-means++

N. Bokde et al.

PSF: Introduction to R Package for Pattern Sequence Based Forecasting Algorithm

The R Journal

(2017)

N. Bokde, A. Troncoso, G. Asencio-Cortés, K. Kulat, F. Martínez-Álvarez, Pattern sequence similarity based techniques...

W. Chen et al.

Big data: a survey

Mobile Networks and Applications

(2014)

Y. Fujimoto, Y. Hayashi, Pattern sequence-based energy demand forecast using photovoltaic energy records, in:...

A. Galicia et al.

Multi-step forecasting for big data time series forecasting based on ensemble learning

Knowledge-Based Systems

(2018)

B. Greenwell, B. Boehmke, J. Cunningham, GBM Developers, GBM: generalized boosted regression models, 2019. R package...

A. Gupta et al.

Hybrid leakage management for water network using PSF algorithm and soft computing techniques

Water Resources Management

(2018)

C.H. Jin et al.

Improved pattern sequence-based forecasting method for electricity load

IEEJ Transactions on Electrical and Electronic Engineering

(2014)

I. Koprinska, M. Rana, A. Troncoso, F. Martínez-Álvarez, Combining pattern sequence similarity with neural networks for...

C. Krome et al.

Time series analysis with Apache Spark and its applications to energy informatics

Energy Informatics

(2018)

Z. Liu et al.

Midterm power load forecasting model based on kernel principal component analysis and back propagation neural network with particle swarm optimization

Big Data

(2019)

J.M. Luna-Romera et al.

An approach to validity indices for clustering techniques in big data

Progress in Artificial Intelligence

(2018)

J.M. Luna-Romera et al.

External clustering validity index based on chi-squared statistical test

Information Sciences

(2018)

Cited by (38)

Medium-term water consumption forecasting based on deep neural networks
2024, Expert Systems with Applications
Water consumption forecasting is an essential tool for water management, as it allows for efficient planning and allocation of water resources, an undervalued but indispensable resource for all living beings. With the increasing demand for accurate and timely water forecasting, traditional forecasting methods are proving to be insufficient. Deep learning techniques, which have shown remarkable performance in a wide range of applications, offer a promising approach to address the challenges of water consumption forecasting. In this work, the use of deep learning models for medium-term water consumption forecasting of residential areas is explored. A deep feed-forward neural network is developed to predict water consumption of a company’s customers for the next quarter. First, customers are grouped according to their consumption as these customers include both household consumers and special consumers such as public swimming pools, sports halls or small industries. Then, a deep feed-forward neural network is designed for household customers by obtaining the optimal values for those hyperparameters that have a great influence on the network performance. Results are reported using a real-world dataset composed of the water consumption from 1999 to 2015 on a quarterly basis, corresponding to 3262 clients of a water supply company. Finally, the proposed algorithm is evaluated by comparing it with other reference algorithms including an LSTM network.
Pattern sequence-based algorithm for multivariate big data time series forecasting: Application to electricity consumption
2024, Future Generation Computer Systems
Several interrelated variables typically characterize real-world processes, and a time series cannot be predicted without considering the influence that other time series might have on the target time series. This work proposes a novel algorithm to forecast multivariate big data time series. This new general-purpose approach consists first of a previous pattern recognition performed jointly using all time series that form the multivariate time series and then predicts the target time series by searching for similarities between pattern sequences. The proposed algorithm is designed to tackle multivariate time series forecasting problems within the context of big data. In particular, the algorithm has been developed with a distributed nature to enhance its efficiency in analyzing and processing large volumes of data. Moreover, the algorithm is straightforward to use, with only two parameters needing adjustment. Another advantage of the MV-bigPSF algorithm is its ability to perform multi-step forecasting, which is particularly useful in many practical applications. To evaluate the algorithm’s performance, real-world data from Uruguay’s power consumption has been utilized. Specifically, MV-bigPSF has been compared with both univariate and multivariate methods. Regarding the univariate ones, MV-bigPSF improved 12.8% in MAPE compared to the second-best method. Regarding the multivariate comparison, MV-bigPSF improved 44.8% in MAPE with respect to the second most accurate method. Regarding efficiency, the execution time of MV-bigPSF was 1.83 times faster than the second-fastest multivariate method, both in a single-core environment. Therefore, the proposed algorithm can be a valuable tool for practitioners and researchers working in multivariate time series forecasting, particularly in big data applications.
A novel Seasonal Fractional Incomplete Gamma Grey Bernoulli Model and its application in forecasting hydroelectric generation
2024, Energy
With the arrival of the first truly global energy crisis, how to precisely forecast the hydroelectric generation becomes a hot spot for allowing governments to obtain more valuable information. A novel forecasting model, Seasonal Fractional Incomplete Gamma Nonlinear Grey Bernoulli Model (SFIGNGBM(1, 1)), is proposed in this paper to precisely forecast the hydroelectric generation in some countries. First, the seasonal raw data are classified into four seasonal groups based on their significant seasonal fluctuations. Second, a novel SFIGNGBM(1, 1) model is established by combining the Bernoulli equation, the fractional-order accumulation operator, and the incomplete gamma function to further optimize partial parameters in the forecasting model and improve the forecasting performance. Third, the Whale Optimized Algorithm (WOA) is employed to optimize the Bernoulli power exponent $η$ , the fractional order parameter $r$ , and the incomplete coefficient $h$ for minimizing the MAPE values and enhancing the fitting precision. Finally, our results present that our proposed model outperforms a set of baseline forecasting models with the smallest three error measure values in all fitting results, and its MAPE values converge before 10 iterations. This indicates that our proposed model has a favorable forecasting performance with fast-convergence for hydroelectric generation in the elected countries.
ADCT-Net: Adaptive traffic forecasting neural network via dual-graphic cross-fused transformer
2024, Information Fusion
The rapid development of road traffic networks has provided a wealth of research data for intelligent transportation systems. We are faced with vast high-dimensional traffic flow data, characterized by complex spatio-temporal dependencies, waiting for exploration of their internal relationships. Accurately representing these spatiotemporal relationships and improving the accuracy of spatiotemporal traffic prediction are critical challenges in current intelligent transportation forecasting. To tackle this issue, we propose an intelligent prediction framework for traffic flow based on the adaptive dual-graphic transformer with a cross-fusion strategy. Our aim is to uncover latent graphic feature representations that transcend temporal and spatial limitations. Furthermore, we establish a traffic spatiotemporal prediction model using a cross-fusion attention mechanism to capture dependency relationships represented by adaptive graphs. Extensive experiments demonstrate that our proposed model achieves superior prediction performance on practical urban traffic flow datasets compared to benchmarks, particularly for long-term predictions. Further analysis confirms its strength in balancing reliability and practicality, making it well-suited for applications in intelligent transportation systems.
CUDA-bigPSF: An optimized version of bigPSF accelerated with graphics processing Unit
2023, Expert Systems with Applications
Accurate and fast short-term load forecasting is crucial in efficiently managing energy production and distribution. As such, many different algorithms have been proposed to address this topic, including hybrid models that combine clustering with other forecasting techniques. One of these algorithms is bigPSF, an algorithm that combines K-means clustering and a similarity search optimized for its use in distributed environments. The work presented in this paper aims to improve the time required to execute the algorithm with two main contributions. First, some of the issues of the original proposal that limited the number of cores simultaneously used are studied and highlighted. Second, a version of the algorithm optimized for Graphics Processing Unit (GPU) is proposed, solving the previously mentioned issues while taking into account the GPU architecture and memory structure. Experimentation was done with seven years of real-world electric demand data from Uruguay. Results show that the proposed algorithm executed consistently faster than the original version, achieving speedups up to 500 times faster during the training phase.
A new approach based on association rules to add explainability to time series forecasting models
2023, Information Fusion
Machine learning and deep learning have become the most useful and powerful tools in the last years to mine information from large datasets. Despite the successful application to many research fields, it is widely known that some of these solutions based on artificial intelligence are considered black-box models, meaning that most experts find difficult to explain and interpret the models and why they generate such outputs. In this context, explainable artificial intelligence is emerging with the aim of providing black-box models with sufficient interpretability. Thus, models could be easily understood and further applied. This work proposes a novel method to explain black-box models, by using numeric association rules to explain and interpret multi-step time series forecasting models. Thus, a multi-objective algorithm is used to discover quantitative association rules from the target model. Then, visual explanation techniques are applied to make the rules more interpretable. Data from Spanish electricity energy consumption has been used to assess the suitability of the proposal.

View all citing articles on Scopus

View full text

Big data time series forecasting based on pattern sequence similarity and its application to the electricity demand

Abstract

Introduction

Section snippets

Related works

Methodology

Results

Conclusions

Declaration of Competing Interest

Acknowledgements

Pattern Recognition Letters

Information Sciences

Computational Statistics and Data Analysis

Pattern Recognition Letters

Energy Conversion and Management

Procedia Computer Science

Knowledge-Based Systems

Neurocomputing

Scalable k-means++

PSF: Introduction to R Package for Pattern Sequence Based Forecasting Algorithm

The R Journal

Big data: a survey

Mobile Networks and Applications

Multi-step forecasting for big data time series forecasting based on ensemble learning

Knowledge-Based Systems

Hybrid leakage management for water network using PSF algorithm and soft computing techniques

Water Resources Management

Improved pattern sequence-based forecasting method for electricity load

IEEJ Transactions on Electrical and Electronic Engineering

Time series analysis with Apache Spark and its applications to energy informatics

Energy Informatics

Midterm power load forecasting model based on kernel principal component analysis and back propagation neural network with particle swarm optimization

Big Data

An approach to validity indices for clustering techniques in big data

Progress in Artificial Intelligence

External clustering validity index based on chi-squared statistical test

Information Sciences