Short Communication
Least median of squares and regression through the origin

https://doi.org/10.1016/j.csda.2005.01.005Get rights and content

Abstract

An exact algorithm is provided for finding the least median of squares (LMS) line for a bivariate regression with no intercept term. It is shown that the popular Program for RObust reGRESSion (PROGRESS) routine will not, in general, find the LMS slope when the intercept is suppressed. A Microsoft Excel workbook that provides the code in Visual Basic is made available at http://www.wabash.edu/econexcel/LMSOrigin.

Introduction

Rousseeuw (1984) introduced least median of squares (LMS) as a robust regression procedure. Instead of minimizing the sum of squared residuals, coefficients are chosen so as to minimize the median of the squared residuals. Unlike conventional least squares (LS), there is no closed-form solution with which to easily calculate the LMS line since the median is an order or rank statistic. A general non-linear optimization algorithm performs poorly because the median of squared residuals surface is so bumpy that merely local minima are often incorrectly reported as the solution.

Although a closed-form solution does not exist and brute force optimization is not reliable, several algorithms are available for fitting the LMS line (or hyperplane). Perhaps the most popular approach is called Program for RObust reGRESSion (PROGRESS). The program itself is explained in Rousseeuw and Leroy (1987) and the most recent version is available at http://www.agoras.ua.ac.be/. Several software packages, such as SAS/IML (version 6.12 or greater), have an LMS routine based on PROGRESS.

This paper focuses on the special problem of finding the LMS fitted line through the origin in the bivariate case. The next section presents the model and defines the LMS line. Section 3 shows that the PROGRESS algorithm gives an incorrect solution, in general, when the intercept is restricted to zero. Section 4 presents an analytical, exact method for finding the minimum median squared residual for the bivariate, zero intercept case. Finally, a simple example is provided to illustrate the algorithm and show why PROGRESS fails in the zero-intercept case.

Section snippets

The model

Suppose that observed values of y are generated according to the model yi=μxi+εi. Given a realization of n xi,yi points, the problem is to find the ‘best’ choice for the slope of a straight line that passes through the origin, y^=mx.

One choice for the ‘best’ line is the line that minimizes the median of the individual squares of the deviations, or residuals. Given this objective, one chooses the value of the slope, m, to minimize the median value of the squared residuals: di2(m)=yi-y^i2=yi-mxi2,

PROGRESS and LMS with a zero intercept

The PROGRESS algorithm is based on sampling subsets of points from the data in order to generate candidate LMS estimates. The size of each subset is determined by the number of coefficients to be estimated. For the table above, there is one coefficient to be estimated and five observations so there are five subsets, each containing one data point. For each of the n subsets, the slope is computed. Using this slope, the squared deviation of each of the data points is calculated and the median of

An exact algorithm for LMS with a zero intercept in the bivariate case

The central idea of this algorithm that provides an exact solution to the LMS problem is the observation that there are a finite number of slopes, bounded by the number of pairs of points, which could provide the minimum median deviation. In most cases not all of the On2 points need to be checked to determine the optimal slope.

For a given x, y pair, the square of the deviation of the fitted line from the actual data point is given by d2(m)=y-y^2=(y-mx)2. As a function of the slope, m, this

An example

Fig. 2 shows an example of this algorithm using the five data points from Table 1. The slope varies from mmin=1.4 due to observation 5 to mmax=3 due to observation 3.

Each parabola is labeled by the data point that relates to that parabola, obs 1, obs 2, obs 3, obs 4 and obs 5. The median squared residual for a given slope, m, is the median, or middle, one of the y values of the 5 parabolas. The thick line follows the median, or 3rd, deviation in this example of 5 data points. The vertical line

Conclusion

When applying least median of squares, coefficients are chosen so as to minimize the median of the squared residuals. Because the median is not sensitive to extreme values, it can outperform conventional least squares when data are contaminated. This paper makes two contributions to the LMS literature:

  • (1)

    PROGRESS, the standard algorithm for fitting the LMS estimator, does not find the true LMS fit when the intercept is suppressed. Any computations based on the estimated slope (such as regression

Acknowledgements

The authors thank Michael Axtell, Frank Howland, and anonymous referees for suggestions and criticisms.

References (2)

  • P.J. Rousseeuw

    Least median of squares regression

    J. Amer. Statist. Assoc.

    (1984)
  • P.J. Rousseeuw et al.

    Robust Regression and Outlier Detection

    (1987)

Cited by (17)

  • Computing the least quartile difference estimator in the plane

    2007, Computational Statistics and Data Analysis
  • A Nonlinear Transformation Methods Using Covid-19 Data in the Kurdistan Region

    2022, Proceedings of the 2nd 2022 International Conference on Computer Science and Software Engineering, CSASE 2022
View all citing articles on Scopus
View full text