Multivariate cubic spline smoothing in multiple prediction

https://doi.org/10.1016/S0169-2607(01)00114-6Get rights and content

Abstract

Given longitudinal data for several variables, including a given outcome variable, it is desired to predict the outcome for a specific individual, or more generally experimental unit, in such a way that the predicted value is both accurate and resistant (i.e. has good cross-validation). There are certain data-analytic difficulties associated with long-term multivariate longitudinal data that must be overcome in the prediction process. This paper provides a program written in the Statistical Analysis System (SAS) programming language, based generally on the Roche-Wainer-Thissen stature prediction model, that enables the researcher to overcome these difficulties.

Introduction

Consider a longitudinal set of data in which several (predictor) variables and an outcome variable of interest, Y, are measured with respect to a reference variable such as time. It is desired to construct a prediction model using such data so that Y can be predicted for a given individual (or, generally, experimental unit) at a specific time. In general, non-longitudinal variables may also be included in the model as predictors. An example of such a data set is the set of stature, weight, and skeletal age measurements for boys measured every 6 months from age 3 to 18 years. We might wish to predict stature at age 18 (regarded as adult stature) for a 5-year-old boy having a given stature, weight, and skeletal age. As a non-longitudinal predictor, we might include the average stature of the child's parents.

In order to build a prediction model for data of this structure, several data-analytic problems must be overcome. These problems are enumerated below.

  • 1.

    The predictor variables may be correlated with one another

  • 2.

    The sample sizes from one time point to the next may be highly variable

  • 3.

    Using the least squares estimates of regression coefficients may not provide resistant estimates (i.e. estimates that compare well in cross-validation)

  • 4.

    Typically it is desired that the regression coefficient for a given predictor be reasonably similar at adjacent time points

If the predictor variables are correlated (problem (i)), then the errors of fit caused by subsequent smoothing will be correlated. To avoid this problem, the matrix of least squares coefficients (t time points by p predictors) is orthonormalized columnwise. Some form of smoothing of the orthonormalized regressor coefficients is used to address problems (ii) and (iv). Finally, the smoothing method used should be non- or semi-parametric and sufficiently flexible to address problem (iii). If the smoothing method leads to an overfitting of the data, the prediction technique will not be resistant; i.e. it will give discrepant predictions in cross-validation. If the smoothing method leads to an underfitting of the data, the prediction technique will not be sufficiently accurate.

Section snippets

Statistical method

Wainer and Thissen [1] provide the statistical methodology for the construction of a prediction model based on long-term longitudinal data involving several variables. Their methodology is designed to address the difficulties enumerated above. After an extensive comparative study, Khamis and Guo [2] provided a modification to this methodology which leads to a number of improvements in the prediction model. The new methodology is referred to as multivariate cubic spline smoothing, denoted by MCS2

Computer program

A program with documentation for carrying out the MCS2 procedure is given on the internet at the following address: http://www.math.wright.edu/Statistics/MCS2.sas

The program is written for four predictors (p=4), called a, b, c, and d, and t=30 time points (1, 2, …, 30), however, it is easily modified to accommodate arbitrary positive integer values of p and t. Similarly, the choice of knots in the program are k1=1 and k2=4, where c1=15 in step C of the procedure and e1=6, e2=12, e3=18, and e4

Example

The Fels Longitudinal Study is the largest and oldest longitudinal study of human growth and development in the world [4]. It is managed by the Division of Human Biology in the School of Medicine at Wright State University, Dayton, OH, USA. Since prediction of adult stature in children can be important for many medical and psychological reasons (see [5] for a discussion), data from the Fels Longitudinal Study were used to construct a stature prediction model. The specific data used were stature

Discussion

The program provided in this paper addresses the following data structure:

  • 1.

    Several continuous variables are measured with respect to a reference variable, such as time.

  • 2.

    One such variable is the outcome variable which is to be predicted for a given individual (or more generally, experimental unit) at a given time.

  • 3.

    The remaining variables are predictors.

It is desired that the prediction procedure be both accurate and resistant. In order to accommodate such a procedure, several data analytic

References (6)

  • H. Wainer et al.

    Multivariate semi-metric smoothing in multiple prediction

    J. Am. Statistical Assoc.

    (1975)
  • H. Khamis et al.

    Improvement in the Roche-Wainer-Thissen stature prediction model: a comparative study

    Am. J. Hum. Biology

    (1993)
  • Å. Björk

    Solving linear least squares problems by Gram-Schmidt orthogonalization

    Nord. T. Informationsbehandlung

    (1967)
There are more references available in the full text version of this article.

Cited by (4)

  • Changes in alcohol use associated with changes in HIV disease severity over time: A national longitudinal study in the Veterans Aging Cohort

    2018, Drug and Alcohol Dependence
    Citation Excerpt :

    We then used linear regression models to assess change in CD4 and change in logVL (outcome measures) associated with change in AUDIT-C (independent variable). To allow for a non-linear association, we modeled this association flexibly using restricted cubic splines, a function formed by connecting segments (thus allowing non-linear combinations of estimates) (Khamis and Kepler, 2002). Spline knots (i.e., points where segments connect) were set at -3, -1, 0, 1, and 3 based on examination of the unadjusted association using different knot placements and identification of knots that approximated the fully nonparametric association (i.e., the model that allowed for a different effect at each value of AUDIT-C change) (Royston, 2000).

View full text