Alternative computational formulae for generalized linear model diagnostics: identifying influential observations with SAS software
Introduction
Any statistical analysis of data should consider the role that influential observations have on the results. For linear regression, exact formulae exist for the change in regression coefficients when an observation is deleted from the data set (Belsley et al., 1980). In contrast, iterative computational routines are required to obtain a solution for most generalized linear models (GLIM) and so exact formulae for the change in regression coefficients are not available. However, simple computational formulae for single-case deletion diagnostics, such as leverage, Cook's distance and DFBETA, provide approximations to the desired quantities (Pregibon, 1981; Williams, 1987). Although these GLIM diagnostics were introduced about two decades ago, their use in practice appears to be more limited than their corresponding linear regression counterparts. One indication of this is that these diagnostics are not computed in the current version of SAS PROC GENMOD (version 8.2, SAS Institute Inc., 1999). This article introduces the use of alternative computational formulae for calculating GLIM diagnostics. The proposed formulae are implemented with SAS PROC IML using observation statistics that are output from PROC GENMOD. The paper provides one illustration of how familiarity with the construction of common regression diagnostics formulae can lead to useful alternative formulae when the computer software of interest provides numerical values for only some of the component statistics, but not for the diagnostics directly.
The illustration involves data (Preisser and Qaqish, 1999) from the Guidelines for Urinary Incontinence Discussion and Evaluation (GUIDE) study, a randomized controlled trial directed at assessing the impact of urinary incontinence (UI) guideline adoption by primary care providers on patient outcomes. UI, or loss of bladder control, is a disorder that affects over 13 million men and women in the United States (Dugan et al., 2001). Although our illustration of the use of PROC GENMOD to obtain the diagnostics is based upon special formulae for a generalized linear model with canonical link function, general procedures for computing diagnostics are also described.
Section snippets
Traditional computational formulae for GLIM diagnostics
A generalized linear model has the form g(μi)=ηi for i=1,…,n independent observations, where ηi=xiTβ is the linear predictor defined by a p×1 covariate vector xi and a β parameter vector of interest. The link function, g(·), equates the expected value of the response μi=E(Yi) with ηi. The mean, μi=b′(θi), expressed as the first derivative of the function b(θi), derives from the random part of the model given by a distribution belonging to the general exponential family (McCullagh and Nelder,
Alternative computational formulae for GLIM diagnostics
Alternative expressions for hi and DFBETAi facilitate their computation in SAS. Given values of rpi and r′pi (e.g., provided by SAS PROC GENMOD), leverage, hi, can be determined as follows:Given hi and r′pi the standard computational formula for Di given at the end of Section 2 may be applied. For any GLIM including those with non-canonical link functions, expression (6) may be calculated after first determining wi as a
Implementation of alternative formulae in SAS
This section describes the computation of leverage, Cook statistic and DFBETA for GLIMs with canonical link function through PROC GENMOD. Since PROC GENMOD does not directly provide these diagnostics as output, PROC IML is used to perform the matrix arithmetic of expressions (10), thereby obtaining Di as well, and (11). The code provided below models the binary data from GUIDE using logistic regression. This data has been considered by Preisser and Qaqish (1999) and is available for download at
Illustration with GUIDE study data
Regression diagnostics are produced for the logistic regression model specified by the SAS code given by (12) and (13). Table 2 shows the parameter estimates for the complete data set of 137 observations and for a data set obtained by deleting observations from three patients. A few observations appear to have a large impact on the results, particularly, for TOILET.
For the data set based upon all 137 observations, PROC GENMOD and PROC LOGISTIC gave identical results to three decimal places for
Discussion
Alternative computational formulae for deletion diagnostics from generalized linear models are useful in many instances when SAS software is employed. While their use was illustrated with logistic regression for which SAS PROC LOGISTIC provides direct results, the formulae apply to other GLIMs such as poisson regression for which direct results are currently unavailable in SAS PROC GENMOD (version 8.2). In the case of non-canonical link GLIMs, the proposed diagnostic formulae apply to a broad
References (13)
- et al.
Regression Diagnostics
(1980) - et al.
Why older community-dwelling adults do not discuss urinary incontinence with their primary care physicians
J. Amer. Geriatrics Soc
(2001) Measuring a binary response's range of influence in logistic regression
Amer. Statist
(2002)- et al.
Generalized Estimating Equations
(2003) - et al.
Applied Logistic Regression
(1989) - et al.
Generalized Linear Models
(1989)
Cited by (5)
The gradient test statistic for outlier detection in generalized estimating equations
2024, Statistics and Probability LettersGeneralized estimating equations and regression diagnostics for longitudinal controlled clinical trials: A case study
2012, Computational Statistics and Data AnalysisCitation Excerpt :Therefore, a series of different algorithms has been proposed in recent years to increase computation speed (Preisser and Perin, 2007; Preisser et al., 2008; Wei and Fung, 1999). Standard regression diagnostics are available only in a few standard packages, including SAS/IML macros (Hammill and Preisser, 2006; Preisser and Garcia, 2005) and PROC GENMOD (SAS ver. 9.2).
Deletion diagnostics for generalized linear models using the adjusted Poisson likelihood function
2011, Journal of Statistical Planning and InferenceCitation Excerpt :McCullagh and Nelder (1989, Chapter 12) described several diagnostic tools for checking GLMs. Due to these GLM measures that are not flexibly computed in the commonly used statistical packages, Preisser and Garcia (2005) introduced alternative computational formulae for calculating GLM diagnostics, which can be easily implemented by SAS software. However, the legitimacy of these traditional GLM diagnostics relies decisively on the correct specification of the underlying distributions.
A SAS/IML software program for GEE and regression diagnostics
2006, Computational Statistics and Data AnalysisCitation Excerpt :Similar to the diagnostics proposed by Pregibon (1981) for logistic regression (see Preisser and Garcia (2005) for generalized linear model computations using SAS), the GEE diagnostics are one-step approximations to the fully iterated “exact” quantities.
Logistic regression diagnostics in ridge regression
2018, Computational Statistics