Skip to main content
Log in

Binary surrogates with stratified samples when weights are unknown

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

In clinical practice, surrogate variables are commonly used as an indirect measure when it is difficult or expensive to measure the primary outcome variable X, based on which the disease status is assessed. In this article, we consider the problem of constructing an optimal binary surrogate Y to substitute such the feature variable X. To retain samples that have rare values in X, the paired sample (XY) is usually selected based on stratified sampling, where the strata are constructed using the disjoint intervals with the support of X. For such a sampling design, the stratum proportions are usually unknown such that proportional allocation is infeasible and (XY)’s cannot be regarded as an i.i.d. sample between strata. We estimate the unknown cutoff determining higher/lower levels of X that optimally match the variable Y and provide the true positive rates (TPR) adjusted for the disproportionate stratum weights. Our approach is to estimate the underlying distribution of X, then conduct an ad-hoc estimation for the TPR and for the expected prediction errors under zero-one loss function. We develop parametric estimate of the distribution of X under exponential family assumption and a weighted-kernel density estimator when the distribution of X is unspecified. We illustrate our methods on various simulation studies and on a real example where binary surrogates were evaluated for a medical device. The simulation results indicate that our approach performs well.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. For \(X < c\), \(Y{}\overline{\phantom {\text {Y}}}_1=\frac{\sum _{i,s_1} Y_{si} + \cdots + \sum _{i,s^{*}}Y_{si}}{N_1} \approx \frac{n_1w_1\mu _1 + \cdots + n_{s^{*}} w_{s^{*} \mu _1}}{N_1} = \mu _1\)

References

  • Beskos A, Papaspiliopoulos O, Roberts G (2009) Monte carlo maximum likelihood estimation for discretely observed diffusion processes. Ann Stat 37:223–245

    Article  MathSciNet  MATH  Google Scholar 

  • Bowman AW (1984) An alternative method of cross-validation for the smoothing of density estimates. Biometrika 71:353–360

    Article  MathSciNet  Google Scholar 

  • Buyse M, Molenberghs G, Paoletti X, Oba K, Alonso A, der Elst W, Burzykowski T (2016) Statistical evaluation of surrogate endpoints with examples from cancer clinical trials. Biom J 58:104–132

    Article  MathSciNet  MATH  Google Scholar 

  • Chan K, Ledolter J (1995) Monte carlo em estimation for time series models involving counts. J Am Stat Assoc 90:242–252

    Article  MathSciNet  MATH  Google Scholar 

  • Chen SX (1999) Beta kernel estimators for density functions. Comput Stat Data Anal 31:131–145

    Article  MathSciNet  MATH  Google Scholar 

  • Contal C, O’Quigley J (1999) An application of changepoint methods in studying the effect of age on survival in breast cancer. Comput Stat Data Anal 30:253–270. ISSN 0167-9473. https://doi.org/10.1016/S0167-9473(98)00096-6

  • Cortes C, Mohri M, Riley M, Rostamizadeh A (2008) Sample selection bias correction theory. In: Algorithmic learning theory, Springer, Berlin, pp 38–53

  • Cox DR, Reid N (2004) A note on pseudolikelihood constructed from marginal densities. Biometrika 91:729–737

    Article  MathSciNet  MATH  Google Scholar 

  • da Silva GT, Klein JP (2011) Cutpoint selection for discretizing a continuous covariate for generalized estimating equations. Comput Stat Data Anal 55:226–235. ISSN 0167-9473. https://doi.org/10.1016/j.csda.2010.02.016

  • Ferrier S, Watson G, Pearce J, Drielsma M (2002) Extended statistical approaches to modelling spatial pattern in biodiversity in northeast new south wales. i. Species-level modelling. Biodivers Conserv 11:2275–2307

    Article  Google Scholar 

  • Fokianos K (2004) Merging information for semiparametric density estimation. J R Stat Soc Ser B (Stat Methodol) 66:941–958

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning, vol 1. Springer series in statistics Springer, Berlin

    MATH  Google Scholar 

  • Geyer CJ (1991) Markov chain Monte Carlo maximum likelihood. Interface Foundation of North America

  • Geyer CJ (1994) On the convergence of monte carlo maximum likelihood calculations. J R Stat Soc Ser B (Methodol) 56:261–274

    MathSciNet  MATH  Google Scholar 

  • Gilbert PB (2000) Large sample theory of maximum likelihood estimates in semiparametric biased sampling models. Ann Stat 28:151–194

    Article  MathSciNet  MATH  Google Scholar 

  • Gill RD, Vardi Y, Wellner JA (1988) Large sample theory of empirical distributions in biased sampling models. Ann Stat 16:1069–1112

    Article  MathSciNet  MATH  Google Scholar 

  • Godambe V (1976) Conditional likelihood and unconditional optimum estimating equations. Biometrika 63:277–284

    Article  MathSciNet  MATH  Google Scholar 

  • Heckman JJ (1979) Sample selection bias as a specification error. Econom J Econom Soc 47:153–161

    MathSciNet  MATH  Google Scholar 

  • Jones MC, Marron JS, Sheather SJ (1996) A brief survey of bandwidth selection for density estimation. J Am Stat Assoc 91:401–407

    Article  MathSciNet  MATH  Google Scholar 

  • Lausen B, Schumacher M (1996) Evaluating the effect of optimized cutoff values in the assessment of prognostic factors. Comput Stat Data Anal 21:307–326. ISSN 0167-9473. https://doi.org/10.1016/0167-9473(95)00016-X

  • Lindsay B (1982) Conditional score functions: some optimality results. Biometrika 69:503–512

    Article  MathSciNet  MATH  Google Scholar 

  • Lindsay BG (1988) Composite likelihood methods. Contemp Math 80:221–239

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, A. and Ziebart, B. (2014). Robust classification under sample selection bias. In: Advances in neural information processing systems, pp 37–45

  • Martsynyuk YV (2012) Invariance principles for a multivariate student process in the generalized domain of attraction of the multivariate normal law. Stat Probab Lett 82:2270–2277

    Article  MathSciNet  Google Scholar 

  • Richards JW, Starr DL, Brink H, Miller AA, Bloom JS, Butler NR, James JB, Long JP, Rice J (2012) Active learning to overcome sample selection bias: application to photometric variable star classification. Astrophys J 744:192

    Article  Google Scholar 

  • Sheather SJ (2004) Density estimation. Stat Sci 19:588–597

    Article  MATH  Google Scholar 

  • Vardi Y (1985) Empirical distributions in selection bias models. Ann Stat 13:178–203

    Article  MathSciNet  MATH  Google Scholar 

  • Varin C, Reid N, Firth D (2011) An overview of composite likelihood methods. Stat Sin 21:5–42

    MathSciNet  MATH  Google Scholar 

  • Wang B, Sun J (2009) Inferences from biased samples with a memory effect. J Stat Plan Inference 139:441–453

    Article  MathSciNet  MATH  Google Scholar 

  • Wang B, Wang X (2007) Bandwidth selection for weighted kernel density estimation. arXiv preprint arXiv:0709.1616

  • Wang J-F, Li X-H, Christakos G, Liao Y-L, Zhang T, Gu X, Zheng X-Y (2010) Geographical detectors-based health risk assessment and its application in the neural tube defects study of the heshun region, china. Int J Geogr Inf Sci 24:107–127

    Article  Google Scholar 

  • Wang J-F, Zhang T-L, Fu B-J (2016) A measure of spatial stratified heterogeneity. Ecol Indic 67:250–256

    Article  Google Scholar 

  • Wei GC, Tanner MA (1990) A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85:699–704

    Article  Google Scholar 

  • Wu CO (1997) A cross-validation bandwidth choice for kernel density estimates with selection biased data. J Multivar Anal 61:38–60

    Article  MathSciNet  MATH  Google Scholar 

  • Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias p 114

  • Zhang S, Karunamuni R, Jones M (1999) An improved estimator of the density function at the boundary. J Am Stat Assoc 94:1231–1240

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu-Min Huang.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, YM. Binary surrogates with stratified samples when weights are unknown. Comput Stat 34, 653–682 (2019). https://doi.org/10.1007/s00180-018-0838-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-018-0838-3

Keywords

Navigation