Elsevier

Information Sciences

Volume 540, November 2020, Pages 160-174
Information Sciences

Big data time series forecasting based on pattern sequence similarity and its application to the electricity demand

https://doi.org/10.1016/j.ins.2020.06.014Get rights and content

Abstract

This work proposes a novel algorithm to forecast big data time series. Based on the well-established Pattern Sequence-based Forecasting algorithm, this new approach has two major contributions to the literature. First, the improvement of the original algorithm with respect to the accuracy of predictions, and second, its transformation into the big data context, having reached meaningful results in terms of scalability. The algorithm uses the Apache Spark distributed computation framework and it is a ready-to-use application with few parameters to adjust. Physical and cloud clusters have been used to carry out the experimentation, which consisted in applying the algorithm to real-world data from Uruguay electricity demand.

Introduction

Time series data available has increased considerably in the last decades, given the recent interest in storing and analyzing huge amount of data nowadays [6], as may be the case of datasets extracted from smart meters for decades, from hundreds of buildings or with a very high measurement frequency [24], [25].

Time series forecasting algorithms can create models based on historical data and make predictions for given target variables of interest [44], [43]. The computation time of these algorithms may increase notably when big data time series are evaluated. Therefore, single-core machine environments are not enough and need to be improved with additional computational resources. For such reason, it becomes necessary to distribute the data and its computation across multiple nodes using a cluster of machines.

The Pattern Sequence based Forecasting algorithm (PSF) [21] is an effective general-purpose multi-output approach particularly designed to deal with time series for prediction horizons of an arbitrary length. This multi-output feature has turned PSF into a flexible tool that has been used in different fields of research. PSF is mainly based on the identification of certain patterns that are searched for throughout the whole dataset. Such patterns are calculated by means of any clustering technique (k-means in the original work) and, later, sequences of clusters are formed to characterize the target time series.

In this work, a new algorithm, hereinafter called bigPSF, is proposed. It is inspired by the pattern search strategy introduced in the original PSF algorithm. But bigPSF has two major contributions to the literature. First, it is scalable, thanks to its distributed computation under the Apache Spark framework. That is, bigPSF is suitable for handling big data time series and mining millions of records reporting reduced execution times in contrast with PSF. Second, some modifications have been carried out in bigPSF with respect to the original PSF improving the results of the predictions and achieving higher accuracy. Hence, the algorithm mainly covers volume and velocity dimensions from the well-established 4-Vs big data paradigm [6].

Although bigPSF is also a general-purpose algorithm, data related to electricity demand have been used to assess its performance. A study case of the original methodology along with its last methodological improvement for predicting electricity demand data can be found in Ref. [19]. Compared to PSF and other five well-known prediction algorithms, bigPSF achieved a higher accuracy than that of each of them and was able to deal with big data, exhibiting a linear behavior in terms of computation time.

The rest of the paper is structured as follows. Section 2 reviews the relevant and related papers to PSF and big data time series forecasting. Section 3 describes the proposed methodology. Section 4 reports all results and discusses the performance in terms of both errors and scalability. Finally, Section 5 summarizes the most significant achievements within the manuscript.

Section snippets

Related works

This section overviews the most relevant and related works. In particular, this section is structured in two different parts. First, works related to the PSF algorithm are reviewed and summarized highlighting their main contributions to the literature. Second, works related to big data time series forecasting are reported and discussed.

The PSF algorithm was firstly published in 2011 [21]. It was developed to deal with time series and proposed, for the first time, to use clustering methods to

Methodology

This section describes the methodology proposed to predict big data time series of electricity consumption. The novel and distributed bigPSF algorithm, based on the existing PSF algorithm, is introduced. bigPSF is a forecasting algorithm able to handle big data time series in a scalable way. In addition to the scalability property, some modifications have been proposed in order to enhance the prediction results of the original PSF algorithm.

Section 3.1 describes PSF, detailing how its main

Results

This section presents and discusses the experiments carried out to assess the bigPSF performance for a 24-h prediction horizon using a big data time series of electrical consumption from Uruguay. Furthermore, a comparative analysis is performed to compare the bigPSF to other approaches published in the literature.

Algorithm 1. The bigPSF algorithm

This section is divided as follows. Section 4.1 describes the dataset in which the bigPSF has been tested. The quality measures used to evaluate its

Conclusions

This work proposes the bigPSF algorithm based on distributed computing in order to process and to forecast big data time series. This highly scalable algorithm is capable of processing and extracting results from datasets containing millions of records in outstanding time.

In a big data environment, where the consumption habits of the population can change over time, it is very important to calibrate the way the prediction is computed. In this sense, for instance, in a dataset containing samples

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors would like to thank the Spanish Ministry of Science, Innovation and Universities for the support under project TIN2017-88209-C2-1-R.

References (46)

  • N. Bokde et al.

    PSF: Introduction to R Package for Pattern Sequence Based Forecasting Algorithm

    The R Journal

    (2017)
  • N. Bokde, A. Troncoso, G. Asencio-Cortés, K. Kulat, F. Martínez-Álvarez, Pattern sequence similarity based techniques...
  • W. Chen et al.

    Big data: a survey

    Mobile Networks and Applications

    (2014)
  • Y. Fujimoto, Y. Hayashi, Pattern sequence-based energy demand forecast using photovoltaic energy records, in:...
  • A. Galicia et al.

    Multi-step forecasting for big data time series forecasting based on ensemble learning

    Knowledge-Based Systems

    (2018)
  • B. Greenwell, B. Boehmke, J. Cunningham, GBM Developers, GBM: generalized boosted regression models, 2019. R package...
  • A. Gupta et al.

    Hybrid leakage management for water network using PSF algorithm and soft computing techniques

    Water Resources Management

    (2018)
  • C.H. Jin et al.

    Improved pattern sequence-based forecasting method for electricity load

    IEEJ Transactions on Electrical and Electronic Engineering

    (2014)
  • I. Koprinska, M. Rana, A. Troncoso, F. Martínez-Álvarez, Combining pattern sequence similarity with neural networks for...
  • C. Krome et al.

    Time series analysis with Apache Spark and its applications to energy informatics

    Energy Informatics

    (2018)
  • Z. Liu et al.

    Midterm power load forecasting model based on kernel principal component analysis and back propagation neural network with particle swarm optimization

    Big Data

    (2019)
  • J.M. Luna-Romera et al.

    An approach to validity indices for clustering techniques in big data

    Progress in Artificial Intelligence

    (2018)
  • J.M. Luna-Romera et al.

    External clustering validity index based on chi-squared statistical test

    Information Sciences

    (2018)
  • Cited by (38)

    View all citing articles on Scopus
    View full text