Multivariate cubic spline smoothing in multiple prediction
Introduction
Consider a longitudinal set of data in which several (predictor) variables and an outcome variable of interest, Y, are measured with respect to a reference variable such as time. It is desired to construct a prediction model using such data so that Y can be predicted for a given individual (or, generally, experimental unit) at a specific time. In general, non-longitudinal variables may also be included in the model as predictors. An example of such a data set is the set of stature, weight, and skeletal age measurements for boys measured every 6 months from age 3 to 18 years. We might wish to predict stature at age 18 (regarded as adult stature) for a 5-year-old boy having a given stature, weight, and skeletal age. As a non-longitudinal predictor, we might include the average stature of the child's parents.
In order to build a prediction model for data of this structure, several data-analytic problems must be overcome. These problems are enumerated below.
- 1.
The predictor variables may be correlated with one another
- 2.
The sample sizes from one time point to the next may be highly variable
- 3.
Using the least squares estimates of regression coefficients may not provide resistant estimates (i.e. estimates that compare well in cross-validation)
- 4.
Typically it is desired that the regression coefficient for a given predictor be reasonably similar at adjacent time points
If the predictor variables are correlated (problem (i)), then the errors of fit caused by subsequent smoothing will be correlated. To avoid this problem, the matrix of least squares coefficients (t time points by p predictors) is orthonormalized columnwise. Some form of smoothing of the orthonormalized regressor coefficients is used to address problems (ii) and (iv). Finally, the smoothing method used should be non- or semi-parametric and sufficiently flexible to address problem (iii). If the smoothing method leads to an overfitting of the data, the prediction technique will not be resistant; i.e. it will give discrepant predictions in cross-validation. If the smoothing method leads to an underfitting of the data, the prediction technique will not be sufficiently accurate.
Section snippets
Statistical method
Wainer and Thissen [1] provide the statistical methodology for the construction of a prediction model based on long-term longitudinal data involving several variables. Their methodology is designed to address the difficulties enumerated above. After an extensive comparative study, Khamis and Guo [2] provided a modification to this methodology which leads to a number of improvements in the prediction model. The new methodology is referred to as multivariate cubic spline smoothing, denoted by MCS2
Computer program
A program with documentation for carrying out the MCS2 procedure is given on the internet at the following address: http://www.math.wright.edu/Statistics/MCS2.sas
The program is written for four predictors (p=4), called a, b, c, and d, and t=30 time points (1, 2, …, 30), however, it is easily modified to accommodate arbitrary positive integer values of p and t. Similarly, the choice of knots in the program are k1=1 and k2=4, where c1=15 in step C of the procedure and e1=6, e2=12, e3=18, and e4
Example
The Fels Longitudinal Study is the largest and oldest longitudinal study of human growth and development in the world [4]. It is managed by the Division of Human Biology in the School of Medicine at Wright State University, Dayton, OH, USA. Since prediction of adult stature in children can be important for many medical and psychological reasons (see [5] for a discussion), data from the Fels Longitudinal Study were used to construct a stature prediction model. The specific data used were stature
Discussion
The program provided in this paper addresses the following data structure:
- 1.
Several continuous variables are measured with respect to a reference variable, such as time.
- 2.
One such variable is the outcome variable which is to be predicted for a given individual (or more generally, experimental unit) at a given time.
- 3.
The remaining variables are predictors.
It is desired that the prediction procedure be both accurate and resistant. In order to accommodate such a procedure, several data analytic
References (6)
- et al.
Multivariate semi-metric smoothing in multiple prediction
J. Am. Statistical Assoc.
(1975) - et al.
Improvement in the Roche-Wainer-Thissen stature prediction model: a comparative study
Am. J. Hum. Biology
(1993) Solving linear least squares problems by Gram-Schmidt orthogonalization
Nord. T. Informationsbehandlung
(1967)
Cited by (4)
Changes in alcohol use associated with changes in HIV disease severity over time: A national longitudinal study in the Veterans Aging Cohort
2018, Drug and Alcohol DependenceCitation Excerpt :We then used linear regression models to assess change in CD4 and change in logVL (outcome measures) associated with change in AUDIT-C (independent variable). To allow for a non-linear association, we modeled this association flexibly using restricted cubic splines, a function formed by connecting segments (thus allowing non-linear combinations of estimates) (Khamis and Kepler, 2002). Spline knots (i.e., points where segments connect) were set at -3, -1, 0, 1, and 3 based on examination of the unadjusted association using different knot placements and identification of knots that approximated the fully nonparametric association (i.e., the model that allowed for a different effect at each value of AUDIT-C change) (Royston, 2000).
HIV Disease Severity Is Sensitive to Temporal Changes in Alcohol Use: A National Study of VA Patients with HIV
2019, Journal of Acquired Immune Deficiency Syndromes