Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In many research fields, organizing challenges for international benchmarking has become increasingly common. Since the first MICCAI grand challenge was organized in 2007 [4], the impact of challenges on both the research field as well as on individual careers has been steadily growing. For example, the acceptance of a journal article today often depends on the performance of a new algorithm being assessed against the state-of-the-art work on publicly available challenge datasets. Yet, while the publication of papers in scientific journals and prestigious conferences, such as MICCAI, undergoes strict quality control, the design and organization of challenges do not. Given the discrepancy between challenge impact and quality control, the contributions of this paper can be summarized as follows:

  1. 1.

    Based on analysis of past MICCAI challenges, we show that current practice is heavily based on trust in challenge organizers and participants.

  2. 2.

    We experimentally show how “security holes” related to current challenge design and organization can be used to potentially manipulate rankings.

  3. 3.

    To overcome these problems, we propose best practice recommendations to remove opportunities for cheating.

2 Methods

Analysis of Common Practice: To review common practice in MICCAI challenge design, we systematically captured the publicly available information from publications and websites. Based on the data acquired, we generated descriptive statistics on the ranking scheme and several further aspects related to the challenge organization, with a particular focus on segmentation challenges.

Experiments on Rank Manipulation: While our analysis demonstrates the great impact of challenges on the field of biomedical image analysis it also revealed several weaknesses related to challenge design and organization that can potentially be exploited by challenge organizers and participants to manipulate rankings (see Table 2). To experimentally investigate the potential effect of these weaknesses, we designed experiments based on the most common challenge design choices. As detailed in Sect. 3, our comprehensive analysis revealed segmentation as the most common algorithm category, single-metric ranking with mean and metric-based aggregation as the most frequently used ranking scheme and the Dice similarity coefficient (DSC) as the most commonly used segmentation metric. We thus consider single-metric ranking based on the DSC (aggregate with mean, then rank) as the default ranking scheme for segmentation challenges in this paper. For our analysis, the organizers of the MICCAI 2015 segmentation challenges provided the following datasets for all tasks (\(n_{tasks}=50\) in total) of their challengesFootnote 1 that met our inclusion criteriaFootnote 2: For each participating algorithm (\(n_{algo}=445\) in total) and each test case, the metric values for those metrics \(\in \{\)DSC, HD, HD95\(\}\) (HD: Hausdorff distance (HD); HD95: 95% variant) that had been part of the original challenge ranking were provided. Note in this context that the DSC and the HD/HD95 were the most frequently used segmentation metrics in 2015. Based on this data, the following three scenarios were analyzed:

Scenario 1: Increasing One’s Rank by Selective Test Case Submission

According to our analysis, only 33% of all MICCAI tasks provide information on missing data handling and punish missing submitted values in some way when determining a challenge ranking (see Sect. 3). However, out of the 445 algorithms who participated in the 2015 segmentations tasks we investigated, 17% of participating teams did not submit results for all test cases. For these algorithms, the mean/maximum amount of missing values was 16%/73%. In theory, challenge participants could exploit the practice of missing data handling by only submitting the results on the easiest cases. To investigate this problem in more depth, we used the MICCAI 2015 segmentation challenges with default ranking scheme to perform the following analysis: For each algorithm and each task of each challenge that met our inclusion criteria (see footnote 2), we artificially removed those test set results (i.e. set the result to N/A) whose DSC was below a threshold of \(t_{DSC} = 0.5\). We assume that these cases could have been relatively easily identified by visual inspection even without having access to the reference annotations. We then compared the new ranking position of the algorithm with the position in the original (default) ranking.

Scenario 2a: Decreasing a Competitor’s Rank by Changing the Ranking Scheme

According to our analysis of common practice, the ranking scheme is not published in 20% of all challenges. Consulting challenge organizers further revealed that roughly 40% of the organizers did not publish the (complete) ranking scheme before the challenge took place. While there may be good reasons to do so (e.g. organizers want to prevent algorithms from overfitting to a certain assessment method), this practice may – in theory – be exploited by challenge organizers to their own benefit. In this scenario, we explored the hypothetical case where the challenge organizers do not want the winning team, according to the default ranking method, to become the challenge winner (e.g. because the winning team is their main competitor). Based on the MICCAI 2015 segmentation challenges, we performed the following experiment for all tasks that met our inclusion criteria (see footnote 2) and had used both the DSC and the HD/HD95 (leading to \(n=45\) tasks and \(n_{algo}=424\) for Scenario 2a and 2b): We simulated 12 different rankings based on the most commonly applied metrics (DSC, HD, HD95), rank aggregation methods (rank then aggregate vs aggregate then rank) and aggregation operators (mean vs median). We then used Kendall’s tau correlation coefficient [6] to compare the 11 simulated rankings with the original (default) ranking. Furthermore, we computed the maximal change in the ranking over all rank variations for the winners of the default ranking and the non-winning algorithms.

Scenario 2b: Decreasing a Competitor’s Rank by Changing the Aggregation Method

As a variant of Scenario 2a, we assume that the organizers published the metric(s) they want to use before the challenge, but not the way they want to aggregate metric values. For the three metrics DSC, HD and HD95, we thus varied only the rank aggregation method and the aggregation operator while keeping the metric fixed. The analysis was then performed in analogy to that of scenario 2a.

3 Results

Between 2007 and 2016, a total of 75 grand challenges with a total of 275 tasks have been hosted by MICCAI. 60% of these challenges published their results in journals or conference proceedings. The median number of citations (in May 2018) was 46 (max: 626). Most challenges (48; 64%) and tasks (222; 81%) dealt with segmentation as algorithm category. The computation of the ranking in segmentation competitions was highly heterogeneous. Overall, 34 different metrics were proposed for segmentation challenges (see Table 1), 38% of which were only applied by a single task. The DSC (75%) was the most commonly used metric, and metric values were typically aggregated with the mean (59%) rather than with the median (3%) (39%: N/A). When a final ranking was provided (49%), it was based on one of the following schemes:

  • Metric-based aggregation (76%): Initially, a rank for each metric and algorithm is computed by aggregating metric values over all test cases. If multiple metrics are used (56% of all tasks), the final rank is then determined by aggregating metric ranks.

  • Case-based aggregation (2%): Initially, a rank for each test case and algorithm is computed for one or multiple metrics. The final rank is determined by aggregating test case ranks.

  • Other (2%): Highly individualized ranking scheme (e.g. [2])

  • No information provided (20%)

As detailed in Table 2, our analysis further revealed several weaknesses of current challenge design and organization that could potentially be exploited for rank manipulation. Consequences of this practice have been investigated in our experiments on rank manipulation:

Scenario 1: Our re-evaluation of all MICCAI 2015 segmentation challenges revealed that 25% of all 396 non-winning algorithms would have been ranked first if they had systematically not submitted the worst results. In 8% of the 50 tasks investigated, every single participating algorithm (including the one ranked last) could have been ranked first if they had selectively submitted results. Note that a threshold of \(t_{DSC} = 0.5\) corresponds to a median of 25% test cases set to N/A. Even when leaving out only the 5% worst results, still 11% of all non-winning algorithms would have been ranked first.

Scenario 2a: As illustrated in Fig. 1, the ranking depends crucially on the metric(s), the rank aggregation method and the aggregation operator. In 93% of the tasks, it was possible to change the winner by changing one or multiple of these parameters. On average, the winner according to the default ranking was only ranked first in 28% of the ranking variations. In two cases, the first place dropped to rank 11. 16% of all (originally non-winning) 379 algorithms became the winner in at least one ranking scheme.

Fig. 1.
figure 1

Effect of different ranking schemes (RS) applied to one example MICCAI 2015 segmentation task. Design choices are indicated in the gray header: RS xy defines the different ranking schemes. The following three rows indicate the used metric \(\in \{\)DSC, HD, HD95\(\}\), the aggregation method based on \(\{\)Metric, Cases\(\}\) and the aggregation operator \(\in \{\)Mean, Median\(\}\). RS 00 (single-metric ranking with DSC; aggregate with mean, then rank) is considered as the default ranking scheme. For each RS, the resulting ranking is shown for algorithms A1 to A13. To illustrate the effect of different RS on single algorithms, A1, A6 and A11 are highlighted.

Scenario 2b: When assuming a fixed metric (DSC/HD/HD95) and only changing the rank aggregation method and/or the aggregation operator (three ranking variations), the winner remains stable in 67% (DSC), 24% (HD) and 31% (HD95) of the experiments. In these cases 7% (DSC), 13% (HD) and 7% (HD95) of all (originally non-winning) 379 algorithms became the winner in at least one ranking scheme. To overcome the problems related to potential cheating, we compiled several best practice recommendations, as detailed in Table 2.

Table 1. Metrics used by MICCAI segmentation tasks between 2007 and 2016.
Table 2. Weaknesses of current challenge design and organization that can potentially be exploited by challenge organizers and participants along with best practice recommendations to address existing issues.

4 Discussion

To our knowledge, we are the first to investigate common practice and weaknesses related to MICCAI challenge design and organization. According to our experiments, a number of different ranking design choices (metrics, aggregation method, missing data handling) have a substantial influence on the ranking. Further, the instability of the rankings combined with common practice of reporting/challenge organization can – in theory – be exploited by both challenge participants and organizers to manipulate rankings. Our analysis also revealed that challenge design and organization of MICCAI challenges are highly heterogeneous and lot of relevant information is commonly not reported. While initial valuable steps towards more quality control related to MICCAI challenges have subsequently been taken, these initiatives have so far been focusing on the selection of challenge proposals, while no quality control process has been put in place to monitor the implementation of the proposed design. A weakness of our experimental analysis could be seen in the fact that we simulated the removed test case results by applying a threshold to the DSC values based on the known reference annotations rather than performing a visual inspection. Yet, we strongly believe that the poorly performing cases with a DSC below 0.5 would have also been identified visually. Our approach, in turn, ensured an objective, scalable and reproducible process. Note that an investigation with the HD/HD95 as metric in an analogous manner would not have been reasonable as a threshold would strongly depend on the task/images. Secondly, it is worth mentioning that instead of applying the different variations of ranking schemes as used in the challenges we focused on the most commonly used ranking scheme in order to perform a statistical analysis that enables a valid comparison across challenges. Given that all rankings of the challenges investigated are based on the DSC as metric, we consider this procedure as valid. Finally, it could be argued that our work is of limited practical value as challenge organizers and participants are fair in general. While this may hold true for the majority, we expect every “security hole” to be exploited sooner or later [5]. Furthermore, our study not only investigates the effect of challenge weaknesses in the context of cheating but also demonstrates the instabilities of rankings for the first time.

In conclusion, we believe that the insights of this study along with the best practice recommendations provided should be carefully considered in future MICCAI challenges. A key message from this paper is to make the challenge design, organization and results as transparent as possible.