Abstract
In clinical practice, surrogate variables are commonly used as an indirect measure when it is difficult or expensive to measure the primary outcome variable X, based on which the disease status is assessed. In this article, we consider the problem of constructing an optimal binary surrogate Y to substitute such the feature variable X. To retain samples that have rare values in X, the paired sample (X, Y) is usually selected based on stratified sampling, where the strata are constructed using the disjoint intervals with the support of X. For such a sampling design, the stratum proportions are usually unknown such that proportional allocation is infeasible and (X, Y)’s cannot be regarded as an i.i.d. sample between strata. We estimate the unknown cutoff determining higher/lower levels of X that optimally match the variable Y and provide the true positive rates (TPR) adjusted for the disproportionate stratum weights. Our approach is to estimate the underlying distribution of X, then conduct an ad-hoc estimation for the TPR and for the expected prediction errors under zero-one loss function. We develop parametric estimate of the distribution of X under exponential family assumption and a weighted-kernel density estimator when the distribution of X is unspecified. We illustrate our methods on various simulation studies and on a real example where binary surrogates were evaluated for a medical device. The simulation results indicate that our approach performs well.










Similar content being viewed by others
Notes
For \(X < c\), \(Y{}\overline{\phantom {\text {Y}}}_1=\frac{\sum _{i,s_1} Y_{si} + \cdots + \sum _{i,s^{*}}Y_{si}}{N_1} \approx \frac{n_1w_1\mu _1 + \cdots + n_{s^{*}} w_{s^{*} \mu _1}}{N_1} = \mu _1\)
References
Beskos A, Papaspiliopoulos O, Roberts G (2009) Monte carlo maximum likelihood estimation for discretely observed diffusion processes. Ann Stat 37:223–245
Bowman AW (1984) An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71:353–360
Buyse M, Molenberghs G, Paoletti X, Oba K, Alonso A, der Elst W, Burzykowski T (2016) Statistical evaluation of surrogate endpoints with examples from cancer clinical trials. Biom J 58:104–132
Chan K, Ledolter J (1995) Monte carlo em estimation for time series models involving counts. J Am Stat Assoc 90:242–252
Chen SX (1999) Beta kernel estimators for density functions. Comput Stat Data Anal 31:131–145
Contal C, O’Quigley J (1999) An application of changepoint methods in studying the effect of age on survival in breast cancer. Comput Stat Data Anal 30:253–270. ISSN 0167-9473. https://doi.org/10.1016/S0167-9473(98)00096-6
Cortes C, Mohri M, Riley M, Rostamizadeh A (2008) Sample selection bias correction theory. In: Algorithmic learning theory, Springer, Berlin, pp 38–53
Cox DR, Reid N (2004) A note on pseudolikelihood constructed from marginal densities. Biometrika 91:729–737
da Silva GT, Klein JP (2011) Cutpoint selection for discretizing a continuous covariate for generalized estimating equations. Comput Stat Data Anal 55:226–235. ISSN 0167-9473. https://doi.org/10.1016/j.csda.2010.02.016
Ferrier S, Watson G, Pearce J, Drielsma M (2002) Extended statistical approaches to modelling spatial pattern in biodiversity in northeast new south wales. i. Species-level modelling. Biodivers Conserv 11:2275–2307
Fokianos K (2004) Merging information for semiparametric density estimation. J R Stat Soc Ser B (Stat Methodol) 66:941–958
Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning, vol 1. Springer series in statistics Springer, Berlin
Geyer CJ (1991) Markov chain Monte Carlo maximum likelihood. Interface Foundation of North America
Geyer CJ (1994) On the convergence of monte carlo maximum likelihood calculations. J R Stat Soc Ser B (Methodol) 56:261–274
Gilbert PB (2000) Large sample theory of maximum likelihood estimates in semiparametric biased sampling models. Ann Stat 28:151–194
Gill RD, Vardi Y, Wellner JA (1988) Large sample theory of empirical distributions in biased sampling models. Ann Stat 16:1069–1112
Godambe V (1976) Conditional likelihood and unconditional optimum estimating equations. Biometrika 63:277–284
Heckman JJ (1979) Sample selection bias as a specification error. Econom J Econom Soc 47:153–161
Jones MC, Marron JS, Sheather SJ (1996) A brief survey of bandwidth selection for density estimation. J Am Stat Assoc 91:401–407
Lausen B, Schumacher M (1996) Evaluating the effect of optimized cutoff values in the assessment of prognostic factors. Comput Stat Data Anal 21:307–326. ISSN 0167-9473. https://doi.org/10.1016/0167-9473(95)00016-X
Lindsay B (1982) Conditional score functions: some optimality results. Biometrika 69:503–512
Lindsay BG (1988) Composite likelihood methods. Contemp Math 80:221–239
Liu, A. and Ziebart, B. (2014). Robust classification under sample selection bias. In: Advances in neural information processing systems, pp 37–45
Martsynyuk YV (2012) Invariance principles for a multivariate student process in the generalized domain of attraction of the multivariate normal law. Stat Probab Lett 82:2270–2277
Richards JW, Starr DL, Brink H, Miller AA, Bloom JS, Butler NR, James JB, Long JP, Rice J (2012) Active learning to overcome sample selection bias: application to photometric variable star classification. Astrophys J 744:192
Sheather SJ (2004) Density estimation. Stat Sci 19:588–597
Vardi Y (1985) Empirical distributions in selection bias models. Ann Stat 13:178–203
Varin C, Reid N, Firth D (2011) An overview of composite likelihood methods. Stat Sin 21:5–42
Wang B, Sun J (2009) Inferences from biased samples with a memory effect. J Stat Plan Inference 139:441–453
Wang B, Wang X (2007) Bandwidth selection for weighted kernel density estimation. arXiv preprint arXiv:0709.1616
Wang J-F, Li X-H, Christakos G, Liao Y-L, Zhang T, Gu X, Zheng X-Y (2010) Geographical detectors-based health risk assessment and its application in the neural tube defects study of the heshun region, china. Int J Geogr Inf Sci 24:107–127
Wang J-F, Zhang T-L, Fu B-J (2016) A measure of spatial stratified heterogeneity. Ecol Indic 67:250–256
Wei GC, Tanner MA (1990) A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85:699–704
Wu CO (1997) A cross-validation bandwidth choice for kernel density estimates with selection biased data. J Multivar Anal 61:38–60
Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias p 114
Zhang S, Karunamuni R, Jones M (1999) An improved estimator of the density function at the boundary. J Am Stat Assoc 94:1231–1240
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Huang, YM. Binary surrogates with stratified samples when weights are unknown. Comput Stat 34, 653–682 (2019). https://doi.org/10.1007/s00180-018-0838-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-018-0838-3