Elsevier

Neural Networks

Volume 13, Issue 10, December 2000, Pages 1107-1133
Neural Networks

Contributed article
Modular neural networks for non-linearity recovering by the Haar approximation

https://doi.org/10.1016/S0893-6080(00)00055-1Get rights and content

Abstract

The paper deals with the design of a composite neural system for recovering non-linear characteristics from random input–output measurement data. It is assumed that non-linearity output measurements are corrupted by an additive zero-mean white random noise and that the input excitation is an i.i.d. random sequence with an arbitrary (and unknown) probability density function. A class of modular networks is developed. The class is based on the Haar approximation of functions with piecewise constant functions on a refinable grid and consists of the networks composed of perceptron-like modules connected in parallel. The networks provide a local mean value estimators of functions. The relationship between complexity and accuracy of modular networks is analysed. It is shown that under mild conditions on the non-linearities and input probability density functions the networks yield pointwise consistent estimates of non-linear characteristics, provided that complexity of the networks grows appropriately with the number of training data. Efficiency of the networks is examined and the asymptotic rate of convergence of the network estimates is established. Specifically, local ability of the networks to recover non-linear characteristics in dependence on local smoothness of the underlying non-linear function and the input probability density is discussed. Optimum complexity selection rules, guaranteeing the best performance of the networks, are given. Illustrative simulation examples are provided.

Introduction

From among many applications of neural networks, the problem of recovering non-linear characteristics of physical phenomena from the measurement data points has attracted much attention in the last decade. This interest is caused by the fact that the necessity of finding an exact or approximate model of a physical system occurs frequently in many engineering applications (e.g. in robotics, signal processing, automatic control) and that an on-line solution to the problem is often desired, i.e. truly fast modelling tools are required. Moreover, in many practical situations a prior knowledge of a possibly non-linear characteristic is poor and no well-grounded hypothesis concerning its functional form can be formulated. Very often the true characteristic is known only at some sample points, recorded in an identification experiment, but the need is to derive the original function (or at least a satisfactory approximation to it) at all points, also on an unseen set of inputs. Thus a kind of generalization is required. Due to the widely recognized approximation and generalization capabilities of neural networks as well as their fast operation, a rational approach for solving such problems is to apply an analogue artificial neural network.

Following the seminal results of Cybenko, 1989, Hornik et al., 1989, Park and Sandberg, 1991, Barron, 1993, among others, an extensive research has been done towards employing neural networks for recovering of non-linear characteristics and various kinds of architectures have been proposed and investigated. Most existing neural architectures can be placed into one of three categories:

  • 1.

    Multilayer Perceptrons (MLP) and sigmoidal networks (Cybenko, 1989, Hornik et al., 1989, Hornik et al., 1990, Ito, 1991, Hornik, 1991, Barron, 1993, Barron, 1994, for instance);

  • 2.

    Radial Basis Function (RBF) networks (Bishop, 1991, Park and Sandberg, 1991, Park and Sandberg, 1993, Leonard et al., 1992, Elanayar and Shin, 1994, Chen and Chen, 1995, Gorinevsky, 1995, Krzyzak et al., 1996, for example); or

  • 3.

    Wavelet networks (Zhang and Benveniste, 1992, Pati and Krishnaprasad, 1993, Delyon et al., 1995, Zhang et al., 1995, Zhang, 1997).

An early approach (first group of networks) relies on step or sigmoidal activation functions and leads to rather complicated multilayer structures. The sigmoidal networks use quite explicit parametric representation of functions, where parametric function models are built up as a linear (weighted) combination of sigmoids, with adjustable activation parameters (scale and translation factors). Both weights and activation parameters need training which leads to highly non-linear parametric optimization tasks. The popular method of backpropagation can take a large number of iterations to converge and can converge to local minima instead of finding the global minimum of the approximation error. Though a number of modifications of backpropagation and many new training algorithms have been proposed to overcome this problem (see, e.g. Azmi and Liou, 1993, Verma, 1997), fast and guaranteed training of sigmoidal networks is still an open question (Jones, 1997).

An alternative strategy, using radial basis functions (second class of networks), results in simpler (one hidden layer) and more flexible architectures with better approximation capability, which has been demonstrated in a number of papers (see, for instance, Jackson, 1988, Sandberg, 1991, Powell, 1992 and the papers cited above). Unlike the former, the RBF networks provide linear-in-the-parameters approximations of functions (after prior establishing the basis function centroids and widths) and consequently linear optimization techniques (e.g. linear least squares; Chen, Cowan, & Grant, 1991) can be implemented for training, which significantly simplifies training procedures. However, performance of the RBF networks critically depends on the selection of the centroids and widths of RBFs and the latter is in turn a rather delicate and non-trivial problem (see Xu et al., 1994, Krzyzak et al., 1996 and the references therein). Despite many efforts (see the cited papers), simple and efficient means for training the RBF networks are still searched for (e.g. Kaminski & Strumillo, 1997).

Recently, good localization (zoom-in) properties and parsimony of wavelet representations, recognized within the multiresolution and wavelet theory (see e.g. the fundamental monographs by Chui, 1992a, Chui, 1992b, Daubechies, 1992 or Walter, 1994), resulted in the development of the wavelet neural networks, particularly recommended for exploring fine details in highly non-linear characteristics. Although the wavelet networks have been introduced as a new tool for approximation of non-linear functions (Zhang & Benveniste, 1992), they repeat some of the standard disadvantages of sigmoidal networks-with the difference that a sigmoid activation function with tuneable parameters is replaced with a mother wavelet with adjustable scale and shift (translation) factors. In particular, determination of the wavelet network parameters (i.e. synaptic weights, scales and translations of the wavelet activation functions) needs solution of also highly non-linear parametric optimization tasks, which are rather complicated even if the problem can be reduced to the convex optimization (Pati & Krishnaprasad, 1993). In spite of the fact that specialized techniques, considerably reducing complexity of the parameter search, have been developed (e.g. Monte Carlo approach in Delyon et al., 1995 or least squares algorithms in Pati and Krishnaprasad, 1993, Zhang et al., 1995, Zhang, 1997), the wavelet networks generally suffer from lacking of fast training routines and training of such nets is still a hard problem.

The power of wavelet neural networks is typically attributed to their ability to approximate closely more general, irregular, non-linear characteristics in localized regions (see the references above). Thus the local, pointwise, efficiency of the wavelet networks — touching the very essence of the wavelet bases — should be of particular interest. Unfortunately, till now only the global approximation properties (in the sense of L2 and sup-norm error) of the wavelet networks have been investigated and the associated global approximation rates have been established (Delyon et al., 1995, Zhang et al., 1995).

In this paper, we propose and analyse a class of networks for non-linearity recovering, belonging to the intersection of the first and third of the categories distinguished above. The proposed networks are perceptron-based architectures originated from the Haar wavelet analysis. This simple formal ancestor yields the networks of composite modular structure, where the problem of achieving of a high resolution of training data (in order to follow fine local details in the run of a target non-linearity) is decomposed into a number of less demanding tasks of achieving lower resolution ability by the component modules (subnetworks). These modules (to some extent of an arbitrary complexity and ‘precision’) are connected in parallel which results in a simple, flexible, and easily expandable structure of the whole network. Such a structure, as composed of standard units, can be attractive from the viewpoint of hardware realization, the more so as the building blocks possess simple perceptron-like set-up (with step activation function). Training of the proposed networks is an easy, one-pass, process which does not involve any parametric optimization techniques.

The problem of implementation of such modular networks to discover non-linear characteristics from the set of input–output training data is here considered in a stochastic framework. We assume that output measurements are blurred by an additive zero mean white random noise (similarly as it was in Delyon et al., 1995, Zhang, 1997) and that the non-linear characteristic is driven by a random i.i.d. input sequence possessing a probability density function. In contrast with Delyon et al., 1995, Zhang et al., 1995 we do not require the input data to be uniformly distributed. We show that under moderate requirements concerning the unknown non-linearities and input probability density functions, our networks successfully recover non-linear characteristics, i.e. yield their pointwise consistent estimates, provided that complexity of the networks (data resolution ability) grows in an appropriate manner with the number of training data. In the main part of the paper the considerations refer to the memoryless observation model (static system). In remarks, we shortly discuss in parallel the non-linearity recovering problem for the Hammerstein system, where the non-linearity output is transformed by a linear output dynamics before measurement. This is because such systems occur in many important applications in various areas such as biocybernetics (Hunter & Korenberg, 1986), automatic control (Vörös, 1999) or industrial engineering (Eskinat, Johnson, & Luyben, 1991), among others.

The paper outline is as follows. In Section 2, the problem of non-linearity recovering is stated and the underlying assumptions are collected. Then, in Section 3, the Haar approximation of functions, providing the theoretical background for operation of the networks, is briefly reviewed. The basic neural architecture for recovering non-linear characteristics is presented in Section 4 and motivation, training algorithm and reference of the network outcome to the Haar approximation are described. We also examine in this section accuracy of the network estimate and the relationship between complexity of the network, the number of training data and the corresponding approximation and estimation errors (bias and variance of the network estimate). As a result of these studies, we give conditions for the weak pointwise consistency of the network issue. The conditions are distribution-free, i.e. do not rely on any specific probability distribution of the input sequence. A class of modular networks is introduced in Section 5. In Section 6, we consider efficiency of modular networks and establish the asymptotic rate of convergence of the network estimate to the target non-linearity in dependence on local smoothness of both the recovered non-linearity and the input probability density function. It is shown that the asymptotic rate of convergence can be optimized by proper selection of the network complexity. General guidelines for selecting the size of modular networks, for large and moderate number of training data, are given in Section 7. Section 8 presents the results of computer simulations. Conclusions in Section 9 complete the paper.

Section snippets

Non-linearity recovering problem

We consider the problem of recovering a non-linear characteristic R(x) from the empirical input–output measurement (training) data {(xk,yk); k=1,2,…,N} in a stochastic environment. Basically, we focus on the standard task, where the scalar input–output observations (xk,yk) are generated according to the equationyk=R(xk)+zkThe following assumptions are imposed on the problem:

Assumption 1: The input process {xk; k=…,−1,0,1,2,…} is a sequence of independent and identically distributed (i.i.d.)

The Haar approximation

For completeness and ease of reference we present here the basic facts from the Haar wavelet approximation theory, relevant to our considerations. Detailed treatment of this theory can be found, e.g. in Daubechies, 1992, Ogden, 1997 or Mallat (1998).

Letφ(x)=I[0,1)(x)=1(x)−1(x−1)where I[a,b)(x) is the indicator function of [a,b) and 1(x) is the perceptron (step) activation function, i.e. we have φ(x)=1 if x∈[0,1) and 0 otherwise. Assume that m≥0 is an integer and consider the functionsφmn(x)=2m/2

Basic network

The motivation of the network structure presented in this section follows from the fact that the theoretical mean value (expectation) in Eq. (2) (respectively, Eq. (8)) can be estimated by the empirical (sample) mean computed from the output observations yk for xk lying in a neighbourhood of x. Such an intuitive idea is commonly used in non-parametric estimation of functions (see e.g. Prakasa Rao, 1983, Eubank, 1988 or Härdle, 1990). On the other hand, computation of a local mean value of a

Modular networks

We shall introduce a class of modular networks for recovering non-linear characteristics. The basic building block for these networks will be the net in Fig. 2. In this class, an arbitrarily high resolution of data (scale m) will be achieved by applying a suitable number of neural modules, instead of expanding the basic network.

For the fixed scale factor m=m0 the net in Fig. 2 works with the resolution 1/2m0 on the interval [0,1) (i.e. it does not differentiate the inputs x∈[0,1) which belong

Efficiency analysis

Consider the general modular network Ci(m0+m1+⋯+mi), i=1,2,…. Further on, the scale m will stand for m0+m1+⋯+mi and changing m will mean changing of an arbitrary mi, i=1,2,… (a degree of freedom), except m0 (the scale factor of the basic module C0(m0)) which will be treated as being fixed. The artificial scale m will be identified with the complexity of the modular network as m is the log2-cardinality of the set of all perceptron neurons in the modular net. We shall check how should the

Complexity selection

As it was established in Section 6, efficiency of the modular network and optimum network complexity depend on local smoothness of the underlying non-linearity R(x) and the input probability density function f(x) around particular network input. Nevertheless, taking into account Corollaries 1–4, one can recognize the lawm(N)=13log2Nas a satisfactorily general rule for selecting the complexity (artificial scale m=m(N)) of the modular network, with a relatively wide range of applicability. This

Simulation study

Here, we present results of computer simulation to illustrate performance of the networks for finite number N of training patterns and to provide some empirical indications as to the choice of the constant C in the network complexity selection rule in Eq. (67). We confine our presentation to the measurement model in Eq. (1). The situation when the training data come from the Hammerstein system (Eq. (5)), as in principle the same from the viewpoint of the experimental results, is shortly

Conclusions

We have proposed and examined a class of modular networks for recovering non-linear characteristics from random noisy measurements. The networks have been based on the Haar approximation of functions. Each network in the class is a parallel connection of a number of modules (sub-networks) which work autonomously. The networks are self-similar: they possess the same general architecture and more complex nets repeat the structure of modules of which they are composed (Fig. 3a–c). It has been

Acknowledgements

The author wishes to thank the reviewers for their helpful comments and suggestions. He also thanks M.Sc. P. Sliwinski for his assistance in preparing the numerical examples.

References (56)

  • C. Bishop

    Improving the generalization properties of radial basis function neural networks

    Neural Computation

    (1991)
  • T. Chen et al.

    Approximation capability to functions of several variables, nonlinear functionals, and operators by radial basis function neural networks

    IEEE Transactions on Neural Networks

    (1995)
  • S. Chen et al.

    Orthogonal least squares learning algorithm for radial basis function networks

    IEEE Transactions on Neural Networks

    (1991)
  • C.K. Chui

    An introduction to wavelets

    (1992)
  • C.K. Chui

    Wavelets: a tutorial in theory and applications

    (1992)
  • G. Cybenko

    Approximation by superpositions of a sigmoidal function

    Mathematics of Control, Signals, and Systems

    (1989)
  • I. Daubechies

    Ten lectures on wavelets

    (1992)
  • B. Delyon et al.

    Accuracy analysis for wavelet approximations

    IEEE Transactions on Neural Networks

    (1995)
  • D.L. Donoho et al.

    Ideal spatial adaptation by wavelet shrinkage

    Biometrica

    (1994)
  • S. Elanayar et al.

    Radial basis function neural network for approximation and estimation of non-linear stochastic dynamic systems

    IEEE Transactions on Neural Networks

    (1994)
  • E. Eskinat et al.

    Use hammerstein models in identification of non-linear systems

    American Institute of Chemical Engineering

    (1991)
  • R.L. Eubank

    Spline smoothing and nonparametric regression

    (1988)
  • D. Gorinevsky

    On the persistency of excitation in radial basis function network identification of non-linear systems

    IEEE Transactions on Neural Networks

    (1995)
  • W. Greblicki

    Nonparametric orthogonal series identification of Hammerstein systems

    International Journal of Systems Science

    (1989)
  • W. Greblicki

    Nonparametric identification of Wiener systems by orthogonal series

    IEEE Transactions on Automatic Control

    (1994)
  • W. Greblicki et al.

    Fourier and Hermite series estimates of regression functions

    Annals of the Institute of Statistics and Mathematics

    (1985)
  • W. Greblicki et al.

    Identification of discrete Hammerstein systems using kernel regression estimates

    IEEE Transactions on Automatic Control

    (1986)
  • W. Greblicki et al.

    Hammerstein system identification by non-parametric regression estimation

    International Journal of Control

    (1987)
  • Cited by (10)

    View all citing articles on Scopus
    View full text