A Bayesian approach to sample size determination for studies designed to evaluate continuous medical tests

https://doi.org/10.1016/j.csda.2009.09.024Get rights and content

Abstract

We develop a Bayesian approach to sample size and power calculations for cross-sectional studies that are designed to evaluate and compare continuous medical tests. For studies that involve one test or two conditionally independent or dependent tests, we present methods that are applicable when the true disease status of sampled individuals will be available and when it will not. Within a hypothesis testing framework, we consider the goal of demonstrating that a medical test has area under the receiver operating characteristic (ROC) curve that exceeds a minimum acceptable level or another relevant threshold, and the goals of establishing the superiority or equivalence of one test relative to another. A Bayesian average power criterion is used to determine a sample size that will yield high posterior probability, on average, of a future study correctly deciding in favor of these goals. The impacts on Bayesian average power of prior distributions, the proportion of diseased subjects in the study, and correlation among tests are investigated through simulation. The computational algorithm we develop involves simulating multiple data sets that are fit with Bayesian models using Gibbs sampling, and is executed by using WinBUGS in tandem with R.

Introduction

Medical tests are used to accurately classify individuals into one of several groups. In the two-group classification problem that we consider here, one or two tests are used to distinguish between two groups of individuals, which, for ease of discussion, we will refer to as a “diseased” (D) group and a “non-diseased” (D̄) group. One phase in the development of a new medical test involves characterizing the test’s ability to accurately discern D from D̄ individuals in the target population. The accuracy of a continuous test can be quantified by first defining a cutoff threshold, c, for a positive test, and then estimating the sensitivity, η(c), and specificity, θ(c), of the test at that cutoff. The parameter η(c) denotes the probability of a diseased individual having a positive test result at cutoff c, and θ(c) is the probability of a non-diseased individual having a negative result. Without loss of generality we adopt the usual convention that test scores (y) are expected to be larger for the D group, so that η(c)=Pr(y>c|D) and θ(c)=Pr(y<c|D̄).

Instead of focusing inference on a single cutoff value, an alternative approach to evaluating the accuracy of continuous tests that avoids the loss of information that comes from dichotomization, involves estimating the receiver operating characteristic (ROC) curve. The ROC curve represents the plot of a test’s true positive fraction (sensitivity) versus its false positive fraction (1–specificity) across all possible cutoff thresholds. Thus, the ROC curve is obtained by plotting the pairs (1θ(c),η(c)) for all values of c. The area under the ROC curve (AUC) is a summary index that measures the overall accuracy of a test, reflecting with equal weight the test’s ability to distinguish between subjects with and without a medical condition. The value of AUC typically ranges from 0.5 (for a useless diagnostic procedure that classifies disease status in a purely random fashion) to 1 (for tests that have perfect classification accuracy). In this paper, we treat AUC as the focal parameter for use in evaluating and comparing continuous medical tests when true disease status is known and when it is not, and we develop a simulation-based procedure for sample size estimation and power calculations in these contexts.

We emphasize, at the outset, that although our focus is on the use of medical tests to classify health status, and our notation and terminology are consistent with biomedical applications, the methods presented in this paper apply more broadly. For instance, the methods we develop here can aid in sample size selection to investigate any general continuous classification procedure.

The remainder of the paper is organized as follows. Common goals of test accuracy studies and some background on ROC analysis are outlined in Section 2. Section 3 details the Bayesian models that we use in our sample size determination procedure. In Section 4 we discuss the Bayesian average power criterion used in our computational algorithm. Results from simulations are presented in Section 5, and concluding remarks are given in Section 6.

Section snippets

Goals and background

In designing a study that will measure and/or compare test performance, an appropriate sample size that will ensure adequate statistical power without overextending limited resources is needed. We consider study designs that involve either a single medical test, or two conditionally independent or correlated tests. The possible goals of test accuracy studies are numerous. We focus on three common goals and we note that many other cases can be handled with slight modifications of the ideas and

One gold-standard test

Let TSiD (i=1,,n1) and TSjD̄ (j=1,,n2) denote scores of a new test obtained from a random sample of n (n=n1+n2) individuals who have a disease or are disease-free, respectively. We suppose that a GS test has been used to identify each individual’s true disease status before the application of the new test. We further assume that TSiD and TSjD̄ are both normally distributed or could be modeled with normal families after an appropriate transformation is applied, namely TSiDN(μD,σD2),i=1,,n1,TS

Bayesian power criterion and simulation algorithm

We apply the Bayesian power criterion proposed by Wang and Gelfand (2002), which is used in hypothesis testing. For case (i), this criterion selects a combination of n1 and n2 (GS test), or n (NGS test) so that the posterior probability, averaged over potential future data sets, that the AUC of the new diagnostic test exceeds some benchmark, μ0, is sufficiently high when in fact the AUC is expected to be greater than μ0. Specifically, the average power criterion in this setting is E{Pr(AUCN>μ0|T

Illustrations

We consider sample size calculations in a variety of scenarios. We first investigate the impact of μ0 and λ0 on the required sample size. In the one test and two tests settings, we compare Bayesian average power when disease status is available to the case where this information is not ascertained. The simulations also assess the influence of sampling priors on average power. The issue of the fraction of non-diseased versus diseased subjects is also studied. Moreover, we discuss the influence

Conclusions

We present a Bayesian approach to sample size and power calculations for ROC studies designed to measure and compare the performance of medical tests. The criterion adopted for this problem is Bayesian average power, which can be applied to several common study designs involving a single test or two tests, both with and without gold standard information. Through simulation studies, we illustrated the impact of effect size, the ratio of the number of diseased to non-diseased subjects enrolled

Acknowledgments

We thank two anonymous referees for their helpful suggestions, which resulted in an improved manuscript.

References (15)

  • A.J. Branscum et al.

    Estimation of diagnostic test sensitivity and specificity through Bayesian modeling

    Preventive Veterinary Medicine

    (2005)
  • P.S. Albert

    Random effects modeling approaches for estimating ROC curves from repeated ordinal tests without a gold standard

    Biometrics

    (2007)
  • A.J. Branscum et al.

    Sample size calculations for studies designed to evaluate diagnostic test accuracy

    Journal of Agricultural, Biological, and Environmental Statistics

    (2007)
  • A.J. Branscum et al.

    Bayesian semiparametric ROC curve estimation and disease diagnosis

    Statistics in Medicine

    (2008)
  • D. Cheng et al.

    Bayesian approach to average power calculation for binary regression with misclassified outcomes

    Statistics in Medicine

    (2009)
  • Y.-K. Choi et al.

    Bayesian inferences for receiver operating characteristic curves in the absence of a gold standard

    Journal of Agricultural, Biological, and Environmental Statistics

    (2006)
  • A. Erkanli et al.

    Bayesian semi-parametric ROC analysis

    Statistics in Medicine

    (2006)
There are more references available in the full text version of this article.

Cited by (15)

  • Criterion to determine the minimum sample size for load spectrum measurement and statistical extrapolation

    2021, Measurement: Journal of the International Measurement Confederation
    Citation Excerpt :

    Methods for calculating the minimum sample size of a load spectrum have been extensively studied in the past. However, the theoretical foundations of these methods differ, e.g. the method for computing the sample size for a Chi-square test to determine the equality of multinomial distributions [2], the method for calculating the minimum sample size for t-test theory in statistics [34], and the method of sample size determination based on Bayesian theory [56]. Minimum sample size calculations are also widely used in the fields of medicine and engineering.

  • An expected power approach for the assessment of composite endpoints and their components

    2013, Computational Statistics and Data Analysis
    Citation Excerpt :

    The expected power can be interpreted as a semi-Bayesian power approach. Solutions for the Bayesian and semi-Bayesian sample size calculation have been widely discussed in the literature for the comparison of two proportions (Spiegelhalter and Freedman, 1986; Joseph et al., 1997; Katsis and Toman, 1999; Pham-Gia and Turkkan, 2003; Daimon, 2008) as well as for continuous outcomes (Cheng et al., 2010). However, so far no approach for a multiple binary test problem with correlated test statistics has been proposed.

View all citing articles on Scopus
View full text