Elsevier

Information Sciences

Volumes 460–461, September 2018, Pages 497-518
Information Sciences

Knowledge discovery in data streams with the orthogonal series-based generalized regression neural networks

https://doi.org/10.1016/j.ins.2017.07.013Get rights and content

Abstract

In this paper, a method for nonparametric regression estimation in a time-varying environment is presented. The orthogonal series-based kernels are used to design learning procedures tracking non-stationary systems changes under non-stationary noise. The presented procedures, constructed in the spirit of generalized regression neural networks, are a very effective tool to deal with stream data. The convergences in probability and with probability one are proved, experimental results are given and discussed.

Introduction

One of the most challenging problems in data mining is related to learning in non-stationary environments. For the recent excellent survey of these problems the reader is referred to [1]. Various methods have been developed to cope with the so-called “concept drift” in the context of designing intelligent systems, stream data mining or incremental machine learning [2], [3], [4], [5], [6], [7], [8], [9], [10]. Vast majority of them are devoted to pattern classification whereas only few deal with a non-stationary regression. Most of them rely on a Gaussian or Markov models, extend Support Vector Machine or Extreme Learning Machine to regression problems, implement regression trees or polynomial regression in a non-stationary environment. We will briefly describe these approaches.

A lot of work has been put to investigate the methods which treat regression as a Gaussian process. To address the problem of large-scale and non-stationary data set the authors in [11] proposed a K-Nearest-Neighbor-based Kalman filter for the Gaussian process regression (KNN-KFGP). The developed method worked in a few steps. Firstly, the test-input driven KNN mechanism, to group the training set into a number of small collections, is performed. Secondly, the latent function values of these collections are used as the unknown states and a novel state space model with the GP prior is constructed. Thirdly, the Kalman filter on this state space model, to efficiently filter out the latent function values, is explored for prediction. In a result, the KNN mechanism helps each test point to find its strongly correlated local training subset, and thus the KNN-KFGP algorithm can model non-stationarity in a flexible manner. The other consideration about the Gaussian process regression in shown in [12]. The author proposed two approaches for the on-line Gaussian process regression with low computational and memory demands. The first approach assumes known hyper parameters and performs regression on a set of basis vectors that store mean and covariance estimates of the latent function. The second approach additionally learns the hyper parameters on-line. For this purpose, techniques from nonlinear Gaussian state estimation are exploited. More about Gaussian process regression can be found in [13], [14], [15].

The comparison of Markov switching regression, proposed in [16], and time-varying parameter methods is presented in [17]. The novelty of this paper was to select the coefficients of the detection methods by optimizing the profit objective functions of the trading activity, using statistical estimates as initial values. The paper also developed a sequential approach, based on sliding windows, to cope with the time-variability of Markov switching coefficients.

In the paper [18], a cost-efficient online adaptive learning approach is proposed for Support Vector Regression (SVR) by combining Feature Vector Selection and Incremental and Decremental Learning. In this approach, the model is adaptively modified only when different pattern drifts are detected according to proposed criteria. Two tolerance parameters are introduced in the approach to control the computational complexity, reduce the influence of the intrinsic noise in the data and avoid the overfitting problem of SVR. The same authors in [19] proposed an SVR-based ensemble model. Other approaches with applications of SVR can be found in [20], [21], [22]

Since the On-Line Sequential Extreme Learning Machine (OS-ELM) has been proposed in [23], many researchers have tried to apply this algorithm to work in a non-stationary environment. In [24], the authors developed an algorithm using the OS-ELM with an adaptive forgetting factor to improve performance in time-varying environments. A special batch variant of the ELM, extreme learning machine with kernels (ELMK), was proposed in [25]. It uses unknown kernel mappings instead of known hidden layer mappings; in consequence there is no need to select the number of hidden nodes. Another combination of the ELM and kernel methods was proposed in [26]. In [10], the batch-learning type and time-varying version of the ELM, called ELM-TV, is presented. The proposed version can deal with applications where sequential arrival or large number of training data occurs. In [27], a new sequential learning algorithm is constructed by combining the OS-ELM and Kalman filter regression.

Considerable effort has been devoted to the development of regression trees in non-stationary environments, see [28], [29]. The problem of functional polynomial regression in a non-stationary environment was considered in [30], [31].

In [32], the authors proposed a varying-coefficient fractionally exponential (VC-FEXP) model which allows to detect the dynamic change for both short-memory and long-memory structures. This approach is built on a semi-parametric class of models, whose specification is extended from a stationary fractionally exponential (FEXP) model by allowing parameters in the spectra to vary smoothly over time. The authors applied a time-varying version of the log-periodogram regression. Under this regression framework, they suggested a generalized goodness-of-fit test to detect various aspects of non-stationarity. Another test procedure is presented in [33]. The author proposed a Gini-based statistical test for a unit root. This test is based on the well-known Dickey Fuller test [33], where the ordinary least squares regression is replaced by the semi-parametric Gini regression in modeling the autoregressive process. The critical values are determined based on the residual-based bootstrap method. The proposed methodology takes into account variability of values and ranks. Therefore, it provides robust estimators that are rank-based, while avoiding loss of information. The Gini methodology can be used for a wide range of distributions.

It should be emphasized that listed above methods and techniques rely heavily on various heuristic approaches. Motivated by this fact, the lack of mathematically justified methods in presented above literature review, in this paper we will develop non-parametric algorithms tracking a wide spectrum of concept-drifts and possessing solid mathematical foundations.

Let X1,X2 be a sequence of independent and identically distributed random variables in Rp with a common density function f. In this paper, we will consider two non-stationary models:

  • i)

    Yn=ρ(Xn)+Zn,n=1,2,,

  • ii)

    Yn=ϕn(Xn)+Zn,n=1,2,,

where ρ(·), ϕn(·), for n=1,2,, are unknown functions and Zn are independent random variables with time-varying distributions such that EZn=0,EZn2=dn,n=1,2,.Various examples of systems working in the presence of noise with time-varying variances can be found in [34], [35], [36].

Our problem is to design a nonparametric procedure tracking changes of unknown functions ρ(x) in model (1), and ϕn(x) in model (2), for n=1,2,, based on the observations (X1, Y1), (X2,Y2),. To solve the problem we propose to use a nonparametric technique based on the orthogonal series expansions of unknown functions. Our approach can be treated as a variant of generalized regression neural networks (GRNN) suggested by Specht [37] and studied by many authors in non-stream scenario, see e.g. [38], [39]. Our paper differs from previous approaches in two aspects. First, contrary to the classical GRNN were all the samples must be stored, we will use recursive formulas to cope with coming stream data. Moreover, we will replace the Parzen-based kernel used in previous papers and given by Kn(u,x)=hnpK(uxhn),where K is an appropriately selected function and hn is a sequence of numbers called the bandwidth parameters, by orthogonal series-based kernels of type (6) and (7). These kernels, as it is emphasized in Section 5, are very suitable in stream data scenario mainly because of splitting dependence upon variables u and x in the kernel formulas (6) and (7). We will prove tracking properties of our method even if both ϕn(x) → ∞ and dn → ∞ as n → ∞, e.g. in the model Yn=nβϕ(Xn)+Zn,EZn2=nαwith both α and β greater than 0 (see Example 2 in Section 3 for upper limits imposed on α and β to ensure convergence of the tracking algorithm).

The main contributions of this paper are listed as follows:

  • 1.

    We present the incremental GRNNs, see formula (14), for estimation of function ρ(x) in model (1) and we prove their convergence (see Theorems 1 and 2) even if the variance of noise diverges to infinity, i.e. dn → ∞ (see Example 1).

  • 2.

    We study the incremental GRNNs, see formula (31), for tracking changing functions ϕn(x), i=1,2,, in model (2) and we prove their convergence (see Theorems 3 and 4) even if both ϕn(x) → ∞ and dn → ∞. Obviously, our GRNNs have tracking properties if function ϕn(x) does not tend to infinity too fast.

  • 3.

    Our approach allows to deal with different types of concept drifts satisfying assumptions (34) and (45), see Section 3.

  • 4.

    The proposed algorithms can be easily extended to the multidimensional case, as it is shown in Section 4, and moreover can work in 2 different scenarios – convenient when processing data streams (see Section 5).

  • 5.

    Through computers experiments we illustrate the convergence of GRNNs designed for modeling systems (1) and (2) under non-stationary noise (5).

  • 6.

    We compare the performance of our approach with commonly used strategies for dealing with concept drift, i.e. sliding windows or methods equipped with forgetting mechanism. Their performance is worse, and moreover – contrary to our approach – these methods do not ensure the convergence as n → ∞.

The rest of the paper is organized as follows. In Section 2, the proposed algorithms are introduced. The theorems showing their convergence for model (1) are presented in Section 2.1 and the appropriate theorems for model (2) in Section 2.2. In Section 3, we clarify how to choose parameters of our incremental GRNNs for various concept drift cases. An extension to multivariate case is presented in Section 4 whereas Section 5 presents an alternative version – convenient in a stream data scenario. Section 6 contains experimental evaluations of the proposed methods. Finally, the conclusions are drawn in Section 7.

Section snippets

Learning algorithms and their convergence

For the sake of presentation clarity in this section we will consider the unidimensional case, p=1, and later on in Section 4 it will be easily extended to the multidimensional case.

To estimate the regression functions ρ(x) and ϕn(x) in models (1) and (2) we propose the orthogonal series based kernels in the form: Kn(u,x)=j=0N(n)gj(u)gj(x),and Kn(u,x)=j=0M(n)gj(u)gj(x),where {gj(·)}, j=0,1,2, is a complete orthonormal system defined on AR such that maxx|gj(x)|<Gj,and where M(n)n and N(n)n

Discussion and examples: how to choose parameters in algorithm (14) and (31)

The critical point in implementing algorithms (14) and (31) is the choice of the following functions and parameters:

  • (i)

    algorithm (14) – the orthonormal system {gj(·)}, j=0,1,2,, and sequences M(n) and N(n),

  • (ii)

    algorithm (31) – the orthonormal system {gj(·)}, j=0,1,2,, and sequences M(n), N(n) and γn.

In the sequel, it will be shown that it is possible to choose M(n), N(n) and γn for various concept-drifts with a relatively little prior knowledge about underlying problems. Among others, it will be

Extension to the multivariate case

As we indicated in Section 2, our method can be easily extended to the multivariate case. Let gj(·), j=1,2, be a complete orthonormal system in AR. Let us consider the set A=i=1pA, where p is a positive integer higher then 1. Every x ∈ A is a p dimensional vector in the form x=[x(1),,x(p)]. Then the system composed of all possible products Ψj1,,jp(x(1),,x(p))=gj1(x(1))gjp(x(p)),for jk=0,1,2, and k=1,,p, is a complete orthonormal system in A.

Let X1,X2, be a sequence of independent p

An alternative implementation of the algorithm to deal with stream data

The algorithms presented in previous sections allow to track various drifts at a given point x. In order to get estimate of ρ(x) in model (1) or ϕn(x) in model (2), we should perform recursions for a dense mesh of points x. Fortunately, due to splitting variables in kernels (6) and (7), what is not possible with the Parzen kernel (4), we can present algorithms (14) and (31) in the form allowing to determine values of estimates in any point x ∈ A. It can be easily checked that formulas (72) and

Experimental results

In this section we present the experimental results obtained for two considered systems. We will investigate the performance of the proposed algorithms using kernels (6) and (7) based on the Hermite and Laguerre orthogonal system. The Hermite orthonormal system is defined on (,) and can be generated in a recursive way as follows: g0(x)=π14expx22g1(x)=212xg0(x)gj+1(x)=2/(j+1)xgj(x)j/(j+1)gj1(x),for j=1,2,.

The Laguerre orthonormal system, defined on (0, ∞), can be generated in a

Conclusions

In this paper, we investigated a nonparametric method of solving the problem of learning in non-stationary environment. We proposed the algorithms based on the orthogonal series kernels. The properties of described methods were investigated and the convergence in probability and with probability one was proved. Our algorithms do not require to store incoming data elements, what is a crucial feature of data stream methods, but only some number of coefficients (80) and (81) (see Sections 5).

Acknowledgment

This work was supported by the Polish National Science Centre under Grant No. 2014/15/B/ST7/05264.

References (52)

  • WangX. et al.

    Online sequential extreme learning machine with kernels for nonstationary time series prediction

    Neurocomputing

    (2014)
  • J.P. Nobrega et al.

    Kalman filter-based method for online sequential extreme learning machine for regression problems

    Eng. Appl. Artif. Intell.

    (2015)
  • E. Ikonomovska et al.

    Online tree-based ensembles and option trees for regression on evolving data streams

    Neurocomputing

    (2015)
  • ZhangT. et al.

    Model detection for functional polynomial regression

    Comput. Stat. Data Anal.

    (2014)
  • ChenY.-H. et al.

    A frequency domain test for detecting nonstationary time series

    Comput. Stat. Data Anal.

    (2014)
  • A. Shelef

    A Gini-based unit root test

    Comput. Stat. Data Anal.

    (2016)
  • H.D. Jennings et al.

    Variance fluctuations in nonstationary time series: a comparative study of music genres

    Phys. A

    (2004)
  • WongK.F.K. et al.

    Modelling non-stationary variance in EEG time series by state space GARCH model

    Comput. Biol. Med.

    (2006)
  • D.F. Specht

    Probabilistic neural networks

    Neural Netw.

    (1990)
  • G. Ditzler et al.

    Learning in nonstationary environments: a survey

    IEEE Comput. Intell. Mag.

    (2015)
  • R. Elwell et al.

    Incremental learning of concept drift in nonstationary environments

    IEEE Trans. Neural Netw.

    (2011)
  • M. Jaworski et al.

    New splitting criteria for decision trees in stationary data streams

    IEEE Trans. Neural Netw. Learn Syst.

    (2017)
  • L. Rutkowski et al.

    Decision trees for mining data streams based on the McDiarmid’s bound

    IEEE Trans. Knowl. Data Eng.

    (2013)
  • L. Rutkowski et al.

    Decision trees for mining data streams based on the gaussian approximation

    IEEE Trans. Knowl. Data Eng.

    (2014)
  • L. Rutkowski et al.

    A new method for data stream mining based on the misclassification error

    IEEE Trans. Neural Netw. Learn. Syst.

    (2015)
  • L. Csató et al.

    Sparse on-line Gaussian processes

    Neural Comput.

    (2002)
  • Cited by (27)

    • A novel semi-supervised classification approach for evolving data streams

      2023, Expert Systems with Applications
      Citation Excerpt :

      The data collection environment tends to change dynamically, and the distribution of data streams changes over time (Liao et al., 2021; Ren et al., 2018). The data streams are not independently and identically distributed so that the traditional machine learning techniques are invalid (Duda et al., 2018; Nguyen et al., 2018). For extracting the important information from data streams, the data streams mining techniques have gained significant research attention, in which the data streams classification plays a critical role (Hu et al., 2020; Liu et al., 2015; Lu et al., 2019; Wang et al., 2018).

    • Sequential analysis in Fourier probabilistic neural networks

      2022, Expert Systems with Applications
      Citation Excerpt :

      There is a special type of PNN in which the Parzen kernel is replaced by the orthogonal series kernel (Györfi et al., 2002; Rutkowski, 2004). For example, it was used to design learning procedures tracking non-stationary systems changes (Duda et al., 2018). An orthogonal series density estimator based on a random number of observations was proposed in Lagha and Adjabi (2017).

    • Exploiting evolving micro-clusters for data stream classification with emerging class detection

      2020, Information Sciences
      Citation Excerpt :

      In many real-world data-intensive applications such as traffic control, social media, weblogs, market analysis, and sensor networks, a massive amount of data in the form of data streams is continuously produced. To extract relevant information or patterns from these streaming data, data stream mining has gained significant research attention in the past decade [1–7]. A data stream is an infinite, ordered sequence of data items, which arrive continuously at high-speeds [8,9].

    • The Analysis of Optimizers in Training Artificial Neural Networks Using the Streaming Approach

      2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text