An extension of the Gauss–Newton algorithm for estimation under asymmetric loss

https://doi.org/10.1016/j.csda.2004.08.007Get rights and content

Abstract

Estimators obtained by the use of the relevant loss function lead to forecasts with good properties when the same loss function is used to evaluate the forecasts. The provided extension of the Gauss–Newton algorithm is tailored for the associated optimization problem. Due to an approximation of the second derivative of the loss function, it can be viewed as a succession of linear generalized least-squares regressions and is easy to implement. Smoothing loss functions which do not possess derivatives has asymptotic validity. The extension performs well compared to the Newton (with exact Hessian) and BFGS algorithms in a Monte Carlo study employing different loss functions and several autoregressive models.

Introduction

The importance of minimum mean-squared error as an optimality criterion in forecasting, which implies the use of the conditional mean as an optimal forecast, is being increasedly challenged by the use of alternative loss functions in evaluating predictions. Sets of sufficient conditions, under which the conditional mean delivers optimal predictions for other loss functions, are given for instance in Granger (1999), but these are rather the exception. In addition to that, the way the assumed model is estimated may play a decisive role in the performance of the forecasting procedure, as pointed out by Weiss and Andersen (1984). As a solution, Granger (1969) and Weiss (1996) suggest that the estimation of the model parameters should be done by using the same loss function as in forecasting; consequently, one has to minimize an objective function based on the mentioned loss function.

Analytic solutions to this minimum problem rarely exist. Then, minimization is handled by a numerical method. Naturally, each method has its advantages and disadvantages. For a survey of optimization methods used in econometrics, see Davidson (2000, Section 9.2) or Judge et al. (1985, Appendix B).

Used for fitting nonlinear least-squares regressions, the Gauss–Newton (GN) algorithm is a Newton-like method, for which the Hessian matrix of the objective function is approximated in such a way that the optimization procedure can be interpreted as a succession of linear regressions; furthermore, the approximated Hessian is always positive definite, making the GN method a popular choice in empirical work. Wedderburn (1974) points out that the GN algorithm also delivers Maximum-likelihood (ML) estimates, if innovations are assumed gaussian. The properties of nonlinear least-squares estimators are reviewed, among others, by Mittelhammer et al. (2000, Section 8).

Several extensions have been proposed for more general ML-estimation problems. Bard (1974, pp. 97–99) gives a generalization which allows for dependence structures in the innovations and for a departure from normality. Green (1984) indicates that maximization of a likelihood function can also be written as a succession of weighted linear regressions; his approach, “iteratively reweighted least squares” (IRLS), uses directly the properties of the likelihood and approximates the Hessian with the information matrix. A prominent case covered by IRLS is the BHHH algorithm (Berndt et al., 1974), for which the Hessian is approximated by the average outer product of gradients, due to the information matrix equality. More recently, an iteratively reweighted scheme was proposed by Basu and Lindsay (2004) for minimum distance estimation.

It can be shown that not every loss function can be derived from a likelihood function in models with additive zero-mean innovations. Besides, if forecasting, the loss function to be used is externally imposed by the beneficiary of the forecast and not derived from the assumed statistical model. When using it for inference, ML arguments (like those used in IRLS) are invalidated. To our knowledge, there is no optimization method that accounts for the special structure of a minimum aggregated loss problem. Therefore, we give an extension of the GN algorithm for a class of more general loss functions.

The remainder of the paper is structured as follows: in Section 2, we describe how asymmetric loss estimation works for location and location/scale processes, with particular attention to the effects of distribution misspecification. Then, while preserving the succession-of-linear-regressions interpretation of GN, an optimization procedure for the class of loss functions with continuous second derivative is given. We also show the use of approximating loss functions to be asymptotically valid, thus extending the method to loss functions that do not exhibit the desired degree of smoothness. In Section 4, the proposed method is studied for linear and nonlinear models, as well as for several different loss functions and Section 5 concludes.

Section snippets

Estimation under asymmetric loss

Let Yt,tR, denote the process to be forecast. The optimal forecast, or optimal predictor, under the imposed loss function, minimizes the expected loss, or risk, of the prediction of Y at time t*, given the information set available. Typically, the information set at time t* consists of lagged values of the process.

In order to obtain optimal forecasts in the general case, one should model the conditional density for each t* where a forecast is desired. This, however, is not always feasible.

The extended GN method

While the previous considerations did not restrict the loss function to a particular class, this will not be maintained in the following discussion. For the proposed extension (EGN) to remain in the Newton family, we require the loss function to possess continuous second derivative and the optimal predictor to possess second partial derivatives w.r.t. θ1 through θK.

The kth component of the gradient γ of the loss function in the nth iteration is γk(θn)=t=1TdLdu·uθk=-t=1TdLduyt-fxt;θn·fθ

Simulations

For the simulation experiments in this section, we follow the outline of Example 1. All simulations are carried out by means of GAUSS, kernel rev. 5.0.25, running on a WindowsXP computer with an AMD XP2600+ CPU and 3 Gb RAM.

The data is generated according to different autoregressive models, all having starting values 0. All innovations are standard normal. We use the linex loss and the double linex loss, given by Ldle=eau+e-bu-(a-b)u,for positive a,b. For convenience, the latter is taken

Conclusions

A brief review and some refinements of estimation under asymmetric loss are given, related to robustness against misspecification of innovation distribution. Considering the optimization aspect, this paper proposes an extension of the GN algorithm for the class of loss functions with continuous second derivative. Approximation results for smoothed loss functions are derived, to make this extension applicable to non-smooth loss functions. The usefulness of the proposed methods is exemplified by

Acknowledgements

I am grateful to Uwe Hassler, Adina-Ioana Tarcolea and three anonymous referees for helping improve this paper.

References (26)

  • R. Fletcher

    Practical Methods of Optimization

    (1987)
  • C.W.J. Granger

    Prediction with a generalized cost of error function

    Oper. Res. Quart.

    (1969)
  • C.W.J. Granger

    Outline of forecast theory using generalized cost functions

    Spanish Econom. Rev.

    (1999)
  • Cited by (5)

    • ROC curves for regression

      2013, Pattern Recognition
      Citation Excerpt :

      Second, when asymmetry is important the operating condition may be unknown on deployment time.6 Third, keeping several models and choosing the best one depending on the operating condition seems a reasonable way to adapt predictions to the new context, but there is always the alternative approach of retraining the model (provided that we can use specialised regression techniques for each cost function [11,39]). This is known as the reframing/retraining dilemma.

    • Joint forecasts of Dow Jones stocks under general multivariate loss function

      2010, Computational Statistics and Data Analysis
      Citation Excerpt :

      After specifying its conditional covariance matrix, we have to specify the shape of the conditional distribution, either by using a parametric model or by relying in a semi-parametric manner on the GARCH residuals. The semi-parametric approach is actually the two-step procedure suggested by Granger (1969); see Demetrescu (2006) for its use with univariate GARCH. So the choice here, for both estimation and forecasting, is between the imprecision of a model (by its very nature only an approximation) and the imprecision of solving (1) based on a sample.

    View full text