A Bayesian approach to sample size determination for studies designed to evaluate continuous medical tests
Introduction
Medical tests are used to accurately classify individuals into one of several groups. In the two-group classification problem that we consider here, one or two tests are used to distinguish between two groups of individuals, which, for ease of discussion, we will refer to as a “diseased” () group and a “non-diseased” () group. One phase in the development of a new medical test involves characterizing the test’s ability to accurately discern from individuals in the target population. The accuracy of a continuous test can be quantified by first defining a cutoff threshold, , for a positive test, and then estimating the sensitivity, , and specificity, , of the test at that cutoff. The parameter denotes the probability of a diseased individual having a positive test result at cutoff , and is the probability of a non-diseased individual having a negative result. Without loss of generality we adopt the usual convention that test scores are expected to be larger for the group, so that and .
Instead of focusing inference on a single cutoff value, an alternative approach to evaluating the accuracy of continuous tests that avoids the loss of information that comes from dichotomization, involves estimating the receiver operating characteristic (ROC) curve. The ROC curve represents the plot of a test’s true positive fraction (sensitivity) versus its false positive fraction (1–specificity) across all possible cutoff thresholds. Thus, the ROC curve is obtained by plotting the pairs for all values of . The area under the ROC curve (AUC) is a summary index that measures the overall accuracy of a test, reflecting with equal weight the test’s ability to distinguish between subjects with and without a medical condition. The value of AUC typically ranges from 0.5 (for a useless diagnostic procedure that classifies disease status in a purely random fashion) to 1 (for tests that have perfect classification accuracy). In this paper, we treat AUC as the focal parameter for use in evaluating and comparing continuous medical tests when true disease status is known and when it is not, and we develop a simulation-based procedure for sample size estimation and power calculations in these contexts.
We emphasize, at the outset, that although our focus is on the use of medical tests to classify health status, and our notation and terminology are consistent with biomedical applications, the methods presented in this paper apply more broadly. For instance, the methods we develop here can aid in sample size selection to investigate any general continuous classification procedure.
The remainder of the paper is organized as follows. Common goals of test accuracy studies and some background on ROC analysis are outlined in Section 2. Section 3 details the Bayesian models that we use in our sample size determination procedure. In Section 4 we discuss the Bayesian average power criterion used in our computational algorithm. Results from simulations are presented in Section 5, and concluding remarks are given in Section 6.
Section snippets
Goals and background
In designing a study that will measure and/or compare test performance, an appropriate sample size that will ensure adequate statistical power without overextending limited resources is needed. We consider study designs that involve either a single medical test, or two conditionally independent or correlated tests. The possible goals of test accuracy studies are numerous. We focus on three common goals and we note that many other cases can be handled with slight modifications of the ideas and
One gold-standard test
Let () and () denote scores of a new test obtained from a random sample of () individuals who have a disease or are disease-free, respectively. We suppose that a GS test has been used to identify each individual’s true disease status before the application of the new test. We further assume that and are both normally distributed or could be modeled with normal families after an appropriate transformation is applied, namely
Bayesian power criterion and simulation algorithm
We apply the Bayesian power criterion proposed by Wang and Gelfand (2002), which is used in hypothesis testing. For case (i), this criterion selects a combination of and (GS test), or (NGS test) so that the posterior probability, averaged over potential future data sets, that the AUC of the new diagnostic test exceeds some benchmark, , is sufficiently high when in fact the AUC is expected to be greater than . Specifically, the average power criterion in this setting is
Illustrations
We consider sample size calculations in a variety of scenarios. We first investigate the impact of and on the required sample size. In the one test and two tests settings, we compare Bayesian average power when disease status is available to the case where this information is not ascertained. The simulations also assess the influence of sampling priors on average power. The issue of the fraction of non-diseased versus diseased subjects is also studied. Moreover, we discuss the influence
Conclusions
We present a Bayesian approach to sample size and power calculations for ROC studies designed to measure and compare the performance of medical tests. The criterion adopted for this problem is Bayesian average power, which can be applied to several common study designs involving a single test or two tests, both with and without gold standard information. Through simulation studies, we illustrated the impact of effect size, the ratio of the number of diseased to non-diseased subjects enrolled
Acknowledgments
We thank two anonymous referees for their helpful suggestions, which resulted in an improved manuscript.
References (15)
- et al.
Estimation of diagnostic test sensitivity and specificity through Bayesian modeling
Preventive Veterinary Medicine
(2005) Random effects modeling approaches for estimating ROC curves from repeated ordinal tests without a gold standard
Biometrics
(2007)- et al.
Sample size calculations for studies designed to evaluate diagnostic test accuracy
Journal of Agricultural, Biological, and Environmental Statistics
(2007) - et al.
Bayesian semiparametric ROC curve estimation and disease diagnosis
Statistics in Medicine
(2008) - et al.
Bayesian approach to average power calculation for binary regression with misclassified outcomes
Statistics in Medicine
(2009) - et al.
Bayesian inferences for receiver operating characteristic curves in the absence of a gold standard
Journal of Agricultural, Biological, and Environmental Statistics
(2006) - et al.
Bayesian semi-parametric ROC analysis
Statistics in Medicine
(2006)
Cited by (15)
Criterion to determine the minimum sample size for load spectrum measurement and statistical extrapolation
2021, Measurement: Journal of the International Measurement ConfederationCitation Excerpt :Methods for calculating the minimum sample size of a load spectrum have been extensively studied in the past. However, the theoretical foundations of these methods differ, e.g. the method for computing the sample size for a Chi-square test to determine the equality of multinomial distributions [2], the method for calculating the minimum sample size for t-test theory in statistics [34], and the method of sample size determination based on Bayesian theory [56]. Minimum sample size calculations are also widely used in the fields of medicine and engineering.
An expected power approach for the assessment of composite endpoints and their components
2013, Computational Statistics and Data AnalysisCitation Excerpt :The expected power can be interpreted as a semi-Bayesian power approach. Solutions for the Bayesian and semi-Bayesian sample size calculation have been widely discussed in the literature for the comparison of two proportions (Spiegelhalter and Freedman, 1986; Joseph et al., 1997; Katsis and Toman, 1999; Pham-Gia and Turkkan, 2003; Daimon, 2008) as well as for continuous outcomes (Cheng et al., 2010). However, so far no approach for a multiple binary test problem with correlated test statistics has been proposed.
A power-controlled reliability assessment for multi-class probabilistic classifiers
2023, Advances in Data Analysis and ClassificationCovariate-adjusted Bayesian estimation of the performance of a continuous diagnostic test with a limit of detection in the absence of a reference standard: a simulation study
2023, Communications in Statistics: Simulation and Computation