Marginal maximum likelihood estimation of SAR models with missing data

https://doi.org/10.1016/j.csda.2017.11.004Get rights and content

Abstract

Maximum likelihood (ML) estimation of simultaneous autocorrelation models is well known. Under the presence of missing data, estimation is not straightforward, due to the implied dependence of all units. The EM algorithm is the standard approach to accomplish ML estimation in this case. An alternative approach is considered, the method of maximising the marginal likelihood. At first glance the method is computationally complex due to inversion of large matrices that are of the same size as the complete data, but these can be avoided, leading to an algorithm that is usually much faster than the EM algorithm and without typical EM convergence issues. Another approximate method is also proposed that serves as an alternative, for example when the contiguity matrix is dense. The methods are illustrated using a well known data set on house prices with 25,357 units.

Introduction

Simultaneous autoregressive models (SAR) are popular linear regression models for spatially distributed data that take the dependence of the response variable of neighbouring units into account. Maximum likelihood (ML) estimation for SAR models is well established (Ord, 1975). The dependence of neighbouring units is represented by the contiguity matrix W, which has non-zero entries Wij if i is close to j. Various choices of how to define W are available. The W of size n×n matrix refers to the n units of interest. If complete data are observed for a response variable Y of interest, then ML estimation can be accomplished relatively effectively following the methods proposed by Ord (1975). However when missing data are present, i.e. the response variable Y is only observed for a subset of ns units with ns<n, estimation becomes more difficult as the contiguity matrix refers to n units while Y refers to ns units. ML estimation for this situation has been investigated by Lesage and Pace (2004) using the EM algorithm.

Kato (2013) considered estimation and prediction of several spatial covariance models, including SAR models, using the EM algorithm and also a quasi-likelihood method for estimation. Goulard et al. (2017) used the EM algorithm to investigate the performance of various in-sample and out-of-sample predictors, but also used an ML algorithm. Computational improvements of the EM algorithm based on Lesage and Pace (2004) have been considered by Suesse and Zammit Mangion (2017).

Other estimation methods for SAR models under the presence of missing data have been extensively studied using, for example, the generalised method of moments and least squares (Wang and Lee, 2013) and approximate Bayesian methods using Integrated Laplace (INLA) approximation Bivand et al. (2014), Gomez-Rubio et al. (2015), Gomez-Rubio et al. (2017).

In this paper I follow a different approach for ML estimation, by maximising the marginal likelihood directly instead of maximising it indirectly via the EM algorithm, leading to a generally much faster algorithm by applying some computational tricks. The proposed method is a true marginal ML method and requires out-of-sample information, which is different from the ML methods used by Kato (2013) and Goulard et al. (2017) that only use in-sample information. The proposed algorithm also does not suffer from convergence problems, that are common and well known for the EM algorithm (McLachlan and Krishnan, 2008), in particular when the sample size is very small.

Both the EM algorithm and the marginal ML approach generally lead to the same ML estimates, provided convergence to a global maximum has been achieved. In analogy Laird and Ware (1982) considered ML estimation via the EM algorithm for linear mixed models, whereas Lindstrom and Bates (1988) considered maximising the marginal likelihood. Both are also common techniques for restricted ML estimation, a method to achieve unbiased variance estimates.

In Section 2, the spatial models under consideration are introduced. Then in Section 3 standard ML estimation of complete data is reviewed. In Section 4 I introduce an exact method of maximising the marginal likelihood and an alternative approximate method. Section 4 presents a simulation study to investigate the performance of the various methods and Section 5 applies the methods to a well known data set on house prices by also illustrating typical computation times and convergence issues of the EM algorithm for small sampling ratios. The article concludes with a discussion.

Section snippets

SAR models

There are two main SAR models, one where the spatial dependence is directly incorporated into the equation for the response variable, in the following referred to as spatial autoregressive model (SAM) model, and the model where the spatial dependence is incorporated into the error term, the spatial errors model (SEM), using the convention of Lesage and Pace (2004).

Let y=(y1,,yn) be the n-vector of the response variable, X be the n×p design matrix containing the explanatory variables and W be

Maximum likelihood estimation with complete data

In the following it is assumed that the complete response vector y(ys,yu) is not fully observed, because yu is unobserved and assumed to be missing at random (Little and Rubin, 2002), and only ys is available.

The marginal likelihood of ys under this situation can be obtained by integrating over the unobserved data by f(ys;θ)=f(y;θ)dyu,where f(y) is the density of the complete data and θ=(β,ρ,σ2) contains the unknown parameters. Lesage and Pace (2004) circumvented dealing with the

Marginal log-likelihood

While integration in Eq. (4) may appear difficult, the distribution of y is multivariate normal, consequently ys is also multivariate normal with mean μs and variance ωVss. For example this follows from the general well-known result: when yN(μ,Σ), then AyN(Aμ,AΣA) for any full rank matrix A. When A is a matrix of ones and zeros, such that ys=Ay, then the result ysN(μs,ωVss) follows.

This result gives immediately the marginal log-likelihood by replacing V by Vss, μ by μs, y by ys and n by ns

Simulation study

In theory the EM algorithm and the marginal ML method produce the same estimates. However often non-convergence of the EM algorithm can yield different estimates. When an underlying optimisation problem has multiple local maxima, then using local optimisers, such as Newton–Raphson, to find the global maximum can also cause issues with the marginal ML method. A simulation study is conducted in order (i) to demonstrate that the EM algorithm and the proposed marginal ML method are equivalent by

Example

The Lucas County (Ohio, USA) housing data consist of n=25,357 observations of single family homes sold in the period 1993–1998. The data set is part of the R package spdep (Bivand, 2017) and are fully described in the Spatial Econometrics toolbox for Matlab, see http://www.spatial-econometrics.com/html/jplv7.zip. The data have been used by Bivand (2010) to compare several software packages for fitting spatial regression models. Bivand (2010) used log houseprice (log(price)) as the response

Conclusion

I presented an alternative method for obtaining ML estimates of SAR models under the presence of missing data using the direct approach of maximising the marginal likelihood instead of its indirect maximisation via the EM algorithm. The method is worthwhile applying when W is sparse and when n is not too large, and as long as a Cholesky factorisation of the nu×nu matrix Muu can be calculated quickly. The method is attractive because it is widely known that the EM algorithm may converge slowly

Acknowledgements

I would like to thank NIASRA for providing and in particular Clint Shumack for guidance of the high performance cluster to run the simulation and to fit the large data set. I am also grateful to the referees for their valuable feedback that greatly improved the paper.

References (26)

  • HarvilleD.

    Matrix Algebra From a Statistician’s Perspective

    (1997)
  • KatoT.

    Usefulness of the information contained in the prediction sample for the spatial error model

    J. Real Estate Finance Econom.

    (2013)
  • LairdN.M. et al.

    Random-effects models for longitudinal data

    Biometrics

    (1982)
  • Cited by (13)

    • IPW-based robust estimation of the SAR model with missing data

      2021, Statistics and Probability Letters
      Citation Excerpt :

      In fact, it results in inconsistent estimators, such as the naive ordinary least square (OLS) estimator and the naive 2SLS estimator. The likelihood-based approach (Goulard et al., 2017; Suesse and Zammit-Mangion, 2017; Suesse, 2018) would also derive an inconsistent estimator under the misspecified distribution. In view of robustness, Wang and Lee (2013) present the generalized method of moments (GMM) estimator, the nonlinear least squares (NLS) estimator, and the 2SLS estimator with imputation (I2SLS), which all are consistent and asymptotically normally distributed.

    • Spatial Data Science: With Applications in R

      2023, Spatial Data Science: With Applications in R
    • Missing Data Estimation and Imputation Algorithm for Wireless Sensor Network Applications

      2022, 2022 International Conference on Computer Communication and Informatics, ICCCI 2022
    View all citing articles on Scopus
    View full text