Elsevier

NeuroImage

Volume 45, Issue 3, 15 April 2009, Pages 758-768
NeuroImage

Measuring fMRI reliability with the intra-class correlation coefficient

https://doi.org/10.1016/j.neuroimage.2008.12.035Get rights and content

Abstract

The intra-class class correlation coefficient (ICC) is a prominent statistic to measure test–retest reliability of fMRI data. It can be used to address the question of whether regions of high group activation in a first scan session will show preserved subject differentiability in a second session. With this purpose, we present a method that extends voxel-wise ICC analysis. We show that voxels with high group activation have more probability of being reliable, if a subsequent session is performed, than typical voxels across the brain or across white matter. We also find that the existence of some voxels with high ICC but low group activation can be explained by stable signals across sessions that poorly fit the HRF model. At a region of interest level, we show that our voxel-wise ICC calculation is more robust than previous implementations under variations of smoothing and cluster size. The method also allows formal comparisons between the reliabilities of given brain regions; aimed at establishing which ROIs discriminate best between individuals. The method is applied to an auditory and a verbal working memory task. A reliability toolbox for SPM5 is provided at http://brainmap.co.uk.

Introduction

Test–retest studies are essential to determine the reliability of functional magnetic resonance imaging (fMRI). Together with studies of statistical power, they constitute the basis for the design of large longitudinal experiments. Previous test–retest studies have quantified fMRI reliability for a wide number of tasks, ranging from primary sensory (motor, visual and auditory) to cognitive and emotional paradigms (Liu et al., 2004, Yoo et al., 2005, Kong et al., 2007, Rombouts et al., 1997, Kiehl and Liddle, 2003, Manoach et al., 2001, Wei et al., 2004, Aron et al., 2006, Johnstone et al., 2005). Their results are varied as are the statistical methods used to report the analysis of repeated-measures. A prominent measure of reliability, amongst these, is the intra-class correlation coefficient (ICC), which informs on the ability of fMRI to assess differences in brain activity between subjects.

A number of specific methods to neuroimaging data have been developed to assess the stability of brain activation. An initial interest is to assess whether the volume of group activation in a first session is similar to that of a second session. Some studies (Yoo et al., 2005, Rombouts et al., 1997) focus on the extent of the activation, comparing the sessions by the amount of activated voxels in each occasion. The main weakness of this approach is that it strongly depends on the statistical threshold used to define activation. One can easily conceive a hypothetical situation where both group maps are identical except for an additive constant across the whole brain volume. In that situation the method may report low agreement, when there is in fact a consistent signal distribution. A further limitation is that, even in the case where the two activated volumes are the same, it does not inform whether each subject activated consistently within the group. The same group activation could be obtained by a fortuitous rearrangement of individual activations.

A better alternative is to determine the areas of group activation in the first session and then ask if, in the same region, the rank order of subject activations will be preserved in a subsequent session. Or equivalently, we can ask whether the level of group activation of the first session can predict the consistency in subject activations. These issues can be addressed with standard statistical analysis, the ICC being the most appropriate.

The most general approach would be, however, to assess the repeatability of observations by quantifying the error measurement, given by the within-subject variance (Zandbelt et al., 2008). Repeatable activations are those whose within-subject variances are smaller than an agreed limit (Bland and Altman, 1986). In fMRI there is not yet a predetermined standard for the acceptance of the error. And in consequence, reliability is more commonly assessed rather than repeatability.

Reliability is understood as a relative scaling of the measurement error. And although, it is usually interchanged with reproducibility, we reserve this term for the more fundamental case of experimental results being independent of the experimenter or the population sample.

Two different types of statistic can be regarded as the scaling of the measurement error. The first kind is the coefficient of variation (CV) where the error variance is scaled by the magnitude of activation. More precisely, the CV is the ratio between the standard deviation and the magnitude of signal change between two conditions. An example of its implementation to neuroimaging data is given by Tjandra et al. (2005), where they compare the CV for BOLD and MR perfusion imaging. The main limitation of the coefficient of variation, for the purpose of this work, is that it cannot be used to assess the relative error when the observation values are low or negative, even if the rank order of the subjects is preserved.

The second kind of scaling of the measurement error is the ICC (Shrout and Fleiss, 1979), defined as the ratio of the between-subject variance and the total variance. Given that the error variance is included in the total variance, in some cases, the ICC can be written in terms of the error variance divided by between-subject variance. The coefficient conveniently assesses either the absolute or consistent agreement of subject activations from session to session (McGraw and Wong, 1996). Intra-subject reliability ranges from zero (no reliability) to one (perfect reliability). As Bland and Altman (1996) explain, the coefficient can be understood as a measure of discrimination between subjects. In the context of neuroimaging data, it then allows the identification of single observations (e.g. voxel t-scores) to subjects and, therefore, the tracking of individuals across sessions.

A main feature of the ICC is that it is calculated from the variance structure of the data. Based on this characteristic, it has been used to show that the between-subject variance of BOLD activation is higher than the within-subject variance (Wei et al., 2004). A more recent study (Friedman et al., 2007) shows between-site reliability derived from a variance component analysis. Since the ICC depends exclusively on the variance, it can be computed for any level of activation. It can be shown (see Materials and methods section) that reliability brings additional and complementary information to group activation. In particular we can have situations in which voxels that fail a group t-test can present high reliability, meaning that their measurements are still consistent across sessions. For instance, non-linear responses that poorly fit the hemodynamic model may yet be consistent for each individual subject. A more fundamental question is to assess whether voxels of high group activation in the first session are likely to be reliable, or, if one-session group-activation is a predictor of intra-subject reliability.

There are three main possible ICC implementations on neuroimaging data, as reported in the literature. Typically, a summary statistic for each subject is obtained for a region of interest (ROI). This can be the mean or median contrast value within the region, or the value of the contrast at the peak of group activation (Manoach et al., 2001, Wei et al., 2004, Kong et al., 2007, Raemaekers et al., 2007, Friedman et al., 2007). ICCs are then computed for these values. Obtaining one ICC for each activated ROI, one would like to ask if there are significant differences between regional reliabilities. From the ICC inferences in McGraw and Wong (1996), it can be shown that that the low number of subjects, common in neuroimaging experiments, hinders the power to detect ROI differences in ICCs. A typical example can be found in Raemaekers et al. (2007) where they report a highly significant reliability of statistical sensitivity, given by ICC = 0.80, p < 0.001 with confidence interval (0.45,0.94), for a 12-subject experiment. The 95 % confident interval of ICCs is so large that significant differences between ROI reliabilities are difficult to obtain.

A second ICC implementation to compute regional reliabilities is a within-subject measurement (see Raemaekers et al. (2007) for ICC and Specht et al. (2003) for coefficients of determination). Here the reliability of the test–retest signal across ROI voxels is assessed for each subject. This is a measurement of the amount of total variance that can be explained by the intra-voxel variance, and tests the consistency of the spatial distribution of the BOLD signal in a given region, for each individual. Although within-subject ICC is evidently affected by spatial smoothing, it can be used to determine differences between subjects.

A final implementation is the computation of ICC maps (Specht et al., 2003, Aron et al., 2006, Jahng et al., 2005). Although a promising technique, it has not been fully exploited to overcome the limitations of other methods. Aron and colleagues (2006) used voxel-wise ICCs to explore the reliability of activated regions of interest (ROI) for a classification-learning task. They importantly reported the distribution of positive ICC values across a region, and concluded that the relative number of voxels in these regions is higher than in a non-activated area. However they did not examine the whole brain volume or the white matter to account for reliability not associated to the task, nor assigned reliability measures to particular regions.

In the present work we report the reliability of an ROI as the full distribution of ICC values (including negative values) in that region. The reliability distribution is then summarized by its median. This allows us to formally compare the reliabilities across ROIs, increasing the power to detect differences.

The objective of the present study is to explore four aspects of fMRI reliability using the voxel value distributions of ICCs. First, we address the question whether voxels of high group activation in a first scan session are likely to preserve subject differentiability in a second session. In other words, we determine to what degree ICC reliability can be derived from the activation strength of a single session. We therefore evaluate the association between ICC map and the group t-map for the first session. The ICC distribution of voxels within the area of high activation is compared with the distribution across the brain and white matter, which are regions not specifically related to the task. This allows us to importantly assess the relative increment of the network reliability that can be associated to task response and not, for instance, to non-specific contributors to reliability such as high between-subject variance due to normalization error. Second, we ask whether voxels of high ICC, but low group t-value (i.e. not consistently activated) can nonetheless have a consistent behavior across sessions. We consequently select the cluster with highest ICC and suboptimal group t-value, and compute the regression of the second-session time-series with that of the first session, for each individual subject. Third, we define the reliability of specific ROIs by the median of their ICC distributions and compare it with three previous implementations, which include the ICCmed for the ROI medians (Friedman et al., 2007); the ICCmax at the maximum of group activation (Manoach et al., 2001); and the within measurements (intra-voxel) ICCv (Raemaekers et al., 2007, Specht et al., 2003). The comparisons are carried out for different smoothing kernels and cluster sizes, which are assumed to mostly affect ICCmed and the ICCv respectively. Finally, we assess the differences in reliability across activated clusters in order to assess which regions discriminate best between subjects. We have applied these methods to an auditory target detection task, in order to examine simple sensory activations, and an n-back task to examine more complex processing in a commonly used paradigm. We chose tasks that activate very different networks to test the robustness of the method.

Section snippets

Subjects

Ten right-handed, healthy, male volunteers, aged 23–37 (mean 28.7, S.D. 4.6), underwent two scanning sessions separated by three months. Participants were screened for DSM-IV axis I and II disorders using the Structured Clinical Interview for DSM-IV (First et al., 1996). Other exclusion criteria were history of neurological disorders, use of prescription or non-prescription medication that may interfere with interpretation of this study and a score of 8 or more on the Beck Depression Inventory.

Brain volume and activation network

An ICC brain map was obtained by applying Eq. (1) to the subject contrasts images. The same contrast images, for the first session only, were used to extract a group t-map. Thresholding of t-values corresponding to p = 0.001 were used to define the activation networks evoked by the tasks. These maps — illustrated in Fig. 1 for the working memory task — show that, although there are some overlapping regions of high ICC and t-values (e.g. parietal cortex), there are other regions of high ICC but

Discussion

In this paper we have presented a robust implementation of test–retest reliability based on the intra-class correlation coefficient. The method extends the voxel-wise calculation of ICCs, based on the medians of ICC distributions of given regions. This measure allows the assessment of the reliability of the activated network relative to the whole brain volume and white matter. It also enables comparisons of the reliabilities across regions of activation, revealing which activated cluster

Acknowledgments

We thank Professor Michael Brammer for useful discussion and GlaxoSmithKline for financial support in the collection of the data.

References (28)

  • BlandJ. et al.

    Statistical methods for assessing agreement between two methods of clinical measurement

    Lancet

    (1986)
  • BlandM.

    An Introduction to Medical Statistics

    (2000)
  • BlandJ. et al.

    Statistics notes: measurement error and correlation coefficients

    BMJ

    (1996)
  • BrettM. et al.

    Region of interest analysis using an SPM toolbox

    NeuroImage

    (2002)
  • Cited by (198)

    View all citing articles on Scopus
    View full text