Alternative computational formulae for generalized linear model diagnostics: identifying influential observations with SAS software

https://doi.org/10.1016/j.csda.2004.03.013Get rights and content

Abstract

In generalized linear models, regression diagnostics including leverage, DFBETA and Cook's distance are commonly used to assess the influence of observations on the fit of a model. We illustrate how familiarity with the construction of common regression diagnostics formulae can lead to useful alternative formulae when the computer software of interest provides numerical values for only some of the component statistics. In particular, SAS software version 8.2 offers these diagnostics for logistic regression through PROC LOGISTIC, however PROC GENMOD does not compute them, so that, aside from residuals, diagnostics are not directly available from SAS for many generalized linear models. This article describes how these diagnostics may be obtained indirectly with alternative computational formulae based upon observation statistics that are produced as output by PROC GENMOD. Data from the Guidelines for Urinary Incontinence Discussion and Evaluation study, a randomized controlled trial directed at assessing the impact of urinary incontinence guideline adoption by primary care providers on patient outcomes, is used to illustrate the alternative computations.

Introduction

Any statistical analysis of data should consider the role that influential observations have on the results. For linear regression, exact formulae exist for the change in regression coefficients when an observation is deleted from the data set (Belsley et al., 1980). In contrast, iterative computational routines are required to obtain a solution for most generalized linear models (GLIM) and so exact formulae for the change in regression coefficients are not available. However, simple computational formulae for single-case deletion diagnostics, such as leverage, Cook's distance and DFBETA, provide approximations to the desired quantities (Pregibon, 1981; Williams, 1987). Although these GLIM diagnostics were introduced about two decades ago, their use in practice appears to be more limited than their corresponding linear regression counterparts. One indication of this is that these diagnostics are not computed in the current version of SAS PROC GENMOD (version 8.2, SAS Institute Inc., 1999). This article introduces the use of alternative computational formulae for calculating GLIM diagnostics. The proposed formulae are implemented with SAS PROC IML using observation statistics that are output from PROC GENMOD. The paper provides one illustration of how familiarity with the construction of common regression diagnostics formulae can lead to useful alternative formulae when the computer software of interest provides numerical values for only some of the component statistics, but not for the diagnostics directly.

The illustration involves data (Preisser and Qaqish, 1999) from the Guidelines for Urinary Incontinence Discussion and Evaluation (GUIDE) study, a randomized controlled trial directed at assessing the impact of urinary incontinence (UI) guideline adoption by primary care providers on patient outcomes. UI, or loss of bladder control, is a disorder that affects over 13 million men and women in the United States (Dugan et al., 2001). Although our illustration of the use of PROC GENMOD to obtain the diagnostics is based upon special formulae for a generalized linear model with canonical link function, general procedures for computing diagnostics are also described.

Section snippets

Traditional computational formulae for GLIM diagnostics

A generalized linear model has the form g(μi)=ηi for i=1,…,n independent observations, where ηi=xiTβ is the linear predictor defined by a p×1 covariate vector xi and a β parameter vector of interest. The link function, g(·), equates the expected value of the response μi=E(Yi) with ηi. The mean, μi=b′(θi), expressed as the first derivative of the function b(θi), derives from the random part of the model given by a distribution belonging to the general exponential family (McCullagh and Nelder,

Alternative computational formulae for GLIM diagnostics

Alternative expressions for hi and DFBETAi facilitate their computation in SAS. Given values of rpi and rpi (e.g., provided by SAS PROC GENMOD), leverage, hi, can be determined as follows:1−(rpi/r′pi)2/φ=1−1φei/[vi]1/2ei/[viφ̂(1−hi)]1/22=1−{[1−hi]1/2}2=hi.Given hi and rpi the standard computational formula for Di given at the end of Section 2 may be applied. For any GLIM including those with non-canonical link functions, expression (6) may be calculated after first determining wi as a

Implementation of alternative formulae in SAS

This section describes the computation of leverage, Cook statistic and DFBETA for GLIMs with canonical link function through PROC GENMOD. Since PROC GENMOD does not directly provide these diagnostics as output, PROC IML is used to perform the matrix arithmetic of expressions (10), thereby obtaining Di as well, and (11). The code provided below models the binary data from GUIDE using logistic regression. This data has been considered by Preisser and Qaqish (1999) and is available for download at

Illustration with GUIDE study data

Regression diagnostics are produced for the logistic regression model specified by the SAS code given by (12) and (13). Table 2 shows the parameter estimates for the complete data set of 137 observations and for a data set obtained by deleting observations from three patients. A few observations appear to have a large impact on the results, particularly, for TOILET.

For the data set based upon all 137 observations, PROC GENMOD and PROC LOGISTIC gave identical results to three decimal places for

Discussion

Alternative computational formulae for deletion diagnostics from generalized linear models are useful in many instances when SAS software is employed. While their use was illustrated with logistic regression for which SAS PROC LOGISTIC provides direct results, the formulae apply to other GLIMs such as poisson regression for which direct results are currently unavailable in SAS PROC GENMOD (version 8.2). In the case of non-canonical link GLIMs, the proposed diagnostic formulae apply to a broad

References (13)

  • D.A. Belsley et al.

    Regression Diagnostics

    (1980)
  • E. Dugan et al.

    Why older community-dwelling adults do not discuss urinary incontinence with their primary care physicians

    J. Amer. Geriatrics Soc

    (2001)
  • M.P. Fay

    Measuring a binary response's range of influence in logistic regression

    Amer. Statist

    (2002)
  • J.W. Hardin et al.

    Generalized Estimating Equations

    (2003)
  • D.W. Hosmer et al.

    Applied Logistic Regression

    (1989)
  • P. McCullagh et al.

    Generalized Linear Models

    (1989)
There are more references available in the full text version of this article.

Cited by (5)

  • Generalized estimating equations and regression diagnostics for longitudinal controlled clinical trials: A case study

    2012, Computational Statistics and Data Analysis
    Citation Excerpt :

    Therefore, a series of different algorithms has been proposed in recent years to increase computation speed (Preisser and Perin, 2007; Preisser et al., 2008; Wei and Fung, 1999). Standard regression diagnostics are available only in a few standard packages, including SAS/IML macros (Hammill and Preisser, 2006; Preisser and Garcia, 2005) and PROC GENMOD (SAS ver. 9.2).

  • Deletion diagnostics for generalized linear models using the adjusted Poisson likelihood function

    2011, Journal of Statistical Planning and Inference
    Citation Excerpt :

    McCullagh and Nelder (1989, Chapter 12) described several diagnostic tools for checking GLMs. Due to these GLM measures that are not flexibly computed in the commonly used statistical packages, Preisser and Garcia (2005) introduced alternative computational formulae for calculating GLM diagnostics, which can be easily implemented by SAS software. However, the legitimacy of these traditional GLM diagnostics relies decisively on the correct specification of the underlying distributions.

  • A SAS/IML software program for GEE and regression diagnostics

    2006, Computational Statistics and Data Analysis
    Citation Excerpt :

    Similar to the diagnostics proposed by Pregibon (1981) for logistic regression (see Preisser and Garcia (2005) for generalized linear model computations using SAS), the GEE diagnostics are one-step approximations to the fully iterated “exact” quantities.

View full text