How to Exploit Weaknesses in Biomedical Challenge Design and Organization

Reinke, Annika; Eisenmann, Matthias; Onogur, Sinan; Stankovic, Marko; Scholz, Patrick; Full, Peter M.; Bogunovic, Hrvoje; Landman, Bennett A.; Maier, Oskar; Menze, Bjoern; Sharp, Gregory C.; Sirinukunwattana, Korsuk; Speidel, Stefanie; van der Sommen, Fons; Zheng, Guoyan; Müller, Henning; Kozubek, Michal; Arbel, Tal; Bradley, Andrew P.; Jannin, Pierre; Kopp-Schneider, Annette; Maier-Hein, Lena

doi:10.1007/978-3-030-00937-3_45

Annika Reinke¹⁸,
Matthias Eisenmann¹⁸,
Sinan Onogur¹⁸,
Marko Stankovic¹⁸,
Patrick Scholz¹⁸,
Peter M. Full¹⁸,
Hrvoje Bogunovic¹⁹,
Bennett A. Landman²⁰,
Oskar Maier²¹,
Bjoern Menze²²,
Gregory C. Sharp²³,
Korsuk Sirinukunwattana²⁴,
Stefanie Speidel²⁵,
Fons van der Sommen²⁶,
Guoyan Zheng²⁷,
Henning Müller²⁸,
Michal Kozubek²⁹,
Tal Arbel³⁰,
Andrew P. Bradley³¹,
Pierre Jannin³²,
Annette Kopp-Schneider³³ &
…
Lena Maier-Hein¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11073))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

10k Accesses
10 Citations
3 Altmetric

Abstract

Since the first MICCAI grand challenge organized in 2007 in Brisbane, challenges have become an integral part of MICCAI conferences. In the meantime, challenge datasets have become widely recognized as international benchmarking datasets and thus have a great influence on the research community and individual careers. In this paper, we show several ways in which weaknesses related to current challenge design and organization can potentially be exploited. Our experimental analysis, based on MICCAI segmentation challenges organized in 2015, demonstrates that both challenge organizers and participants can potentially undertake measures to substantially tune rankings. To overcome these problems we present best practice recommendations for improving challenge design and organization.

A. Reinke, M. Eisenmann, A. Kopp-Schneider and L. Maier-Hein—Shared first/senior authors.

You have full access to this open access chapter, Download conference paper PDF

How Complementary Are Different Information Retrieval Techniques? A Study in Biomedicine Domain

Why rankings of biomedical image analysis competitions should be interpreted with care

Article Open access 06 December 2018

Lena Maier-Hein, Matthias Eisenmann, … Annette Kopp-Schneider

MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank

Article Open access 17 April 2017

Yuqing Mao & Zhiyong Lu

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In many research fields, organizing challenges for international benchmarking has become increasingly common. Since the first MICCAI grand challenge was organized in 2007 [4], the impact of challenges on both the research field as well as on individual careers has been steadily growing. For example, the acceptance of a journal article today often depends on the performance of a new algorithm being assessed against the state-of-the-art work on publicly available challenge datasets. Yet, while the publication of papers in scientific journals and prestigious conferences, such as MICCAI, undergoes strict quality control, the design and organization of challenges do not. Given the discrepancy between challenge impact and quality control, the contributions of this paper can be summarized as follows:

1.
Based on analysis of past MICCAI challenges, we show that current practice is heavily based on trust in challenge organizers and participants.
2.
We experimentally show how “security holes” related to current challenge design and organization can be used to potentially manipulate rankings.
3.
To overcome these problems, we propose best practice recommendations to remove opportunities for cheating.

2 Methods

Analysis of Common Practice: To review common practice in MICCAI challenge design, we systematically captured the publicly available information from publications and websites. Based on the data acquired, we generated descriptive statistics on the ranking scheme and several further aspects related to the challenge organization, with a particular focus on segmentation challenges.

Experiments on Rank Manipulation: While our analysis demonstrates the great impact of challenges on the field of biomedical image analysis it also revealed several weaknesses related to challenge design and organization that can potentially be exploited by challenge organizers and participants to manipulate rankings (see Table 2). To experimentally investigate the potential effect of these weaknesses, we designed experiments based on the most common challenge design choices. As detailed in Sect. 3, our comprehensive analysis revealed segmentation as the most common algorithm category, single-metric ranking with mean and metric-based aggregation as the most frequently used ranking scheme and the Dice similarity coefficient (DSC) as the most commonly used segmentation metric. We thus consider single-metric ranking based on the DSC (aggregate with mean, then rank) as the default ranking scheme for segmentation challenges in this paper. For our analysis, the organizers of the MICCAI 2015 segmentation challenges provided the following datasets for all tasks (\(n_{tasks}=50\) in total) of their challenges^{Footnote 1} that met our inclusion criteria^{Footnote 2}: For each participating algorithm (\(n_{algo}=445\) in total) and each test case, the metric values for those metrics \(\in \{\)DSC, HD, HD95\(\}\) (HD: Hausdorff distance (HD); HD95: 95% variant) that had been part of the original challenge ranking were provided. Note in this context that the DSC and the HD/HD95 were the most frequently used segmentation metrics in 2015. Based on this data, the following three scenarios were analyzed:

Scenario 1: Increasing One’s Rank by Selective Test Case Submission

According to our analysis, only 33% of all MICCAI tasks provide information on missing data handling and punish missing submitted values in some way when determining a challenge ranking (see Sect. 3). However, out of the 445 algorithms who participated in the 2015 segmentations tasks we investigated, 17% of participating teams did not submit results for all test cases. For these algorithms, the mean/maximum amount of missing values was 16%/73%. In theory, challenge participants could exploit the practice of missing data handling by only submitting the results on the easiest cases. To investigate this problem in more depth, we used the MICCAI 2015 segmentation challenges with default ranking scheme to perform the following analysis: For each algorithm and each task of each challenge that met our inclusion criteria (see footnote 2), we artificially removed those test set results (i.e. set the result to N/A) whose DSC was below a threshold of \(t_{DSC} = 0.5\). We assume that these cases could have been relatively easily identified by visual inspection even without having access to the reference annotations. We then compared the new ranking position of the algorithm with the position in the original (default) ranking.

Scenario 2a: Decreasing a Competitor’s Rank by Changing the Ranking Scheme

According to our analysis of common practice, the ranking scheme is not published in 20% of all challenges. Consulting challenge organizers further revealed that roughly 40% of the organizers did not publish the (complete) ranking scheme before the challenge took place. While there may be good reasons to do so (e.g. organizers want to prevent algorithms from overfitting to a certain assessment method), this practice may – in theory – be exploited by challenge organizers to their own benefit. In this scenario, we explored the hypothetical case where the challenge organizers do not want the winning team, according to the default ranking method, to become the challenge winner (e.g. because the winning team is their main competitor). Based on the MICCAI 2015 segmentation challenges, we performed the following experiment for all tasks that met our inclusion criteria (see footnote 2) and had used both the DSC and the HD/HD95 (leading to \(n=45\) tasks and \(n_{algo}=424\) for Scenario 2a and 2b): We simulated 12 different rankings based on the most commonly applied metrics (DSC, HD, HD95), rank aggregation methods (rank then aggregate vs aggregate then rank) and aggregation operators (mean vs median). We then used Kendall’s tau correlation coefficient [6] to compare the 11 simulated rankings with the original (default) ranking. Furthermore, we computed the maximal change in the ranking over all rank variations for the winners of the default ranking and the non-winning algorithms.

Scenario 2b: Decreasing a Competitor’s Rank by Changing the Aggregation Method

As a variant of Scenario 2a, we assume that the organizers published the metric(s) they want to use before the challenge, but not the way they want to aggregate metric values. For the three metrics DSC, HD and HD95, we thus varied only the rank aggregation method and the aggregation operator while keeping the metric fixed. The analysis was then performed in analogy to that of scenario 2a.

3 Results

Between 2007 and 2016, a total of 75 grand challenges with a total of 275 tasks have been hosted by MICCAI. 60% of these challenges published their results in journals or conference proceedings. The median number of citations (in May 2018) was 46 (max: 626). Most challenges (48; 64%) and tasks (222; 81%) dealt with segmentation as algorithm category. The computation of the ranking in segmentation competitions was highly heterogeneous. Overall, 34 different metrics were proposed for segmentation challenges (see Table 1), 38% of which were only applied by a single task. The DSC (75%) was the most commonly used metric, and metric values were typically aggregated with the mean (59%) rather than with the median (3%) (39%: N/A). When a final ranking was provided (49%), it was based on one of the following schemes:

Metric-based aggregation (76%): Initially, a rank for each metric and algorithm is computed by aggregating metric values over all test cases. If multiple metrics are used (56% of all tasks), the final rank is then determined by aggregating metric ranks.
Case-based aggregation (2%): Initially, a rank for each test case and algorithm is computed for one or multiple metrics. The final rank is determined by aggregating test case ranks.
Other (2%): Highly individualized ranking scheme (e.g. [2])
No information provided (20%)

As detailed in Table 2, our analysis further revealed several weaknesses of current challenge design and organization that could potentially be exploited for rank manipulation. Consequences of this practice have been investigated in our experiments on rank manipulation:

Scenario 1: Our re-evaluation of all MICCAI 2015 segmentation challenges revealed that 25% of all 396 non-winning algorithms would have been ranked first if they had systematically not submitted the worst results. In 8% of the 50 tasks investigated, every single participating algorithm (including the one ranked last) could have been ranked first if they had selectively submitted results. Note that a threshold of \(t_{DSC} = 0.5\) corresponds to a median of 25% test cases set to N/A. Even when leaving out only the 5% worst results, still 11% of all non-winning algorithms would have been ranked first.

Scenario 2a: As illustrated in Fig. 1, the ranking depends crucially on the metric(s), the rank aggregation method and the aggregation operator. In 93% of the tasks, it was possible to change the winner by changing one or multiple of these parameters. On average, the winner according to the default ranking was only ranked first in 28% of the ranking variations. In two cases, the first place dropped to rank 11. 16% of all (originally non-winning) 379 algorithms became the winner in at least one ranking scheme.

Scenario 2b: When assuming a fixed metric (DSC/HD/HD95) and only changing the rank aggregation method and/or the aggregation operator (three ranking variations), the winner remains stable in 67% (DSC), 24% (HD) and 31% (HD95) of the experiments. In these cases 7% (DSC), 13% (HD) and 7% (HD95) of all (originally non-winning) 379 algorithms became the winner in at least one ranking scheme. To overcome the problems related to potential cheating, we compiled several best practice recommendations, as detailed in Table 2.

Table 1. Metrics used by MICCAI segmentation tasks between 2007 and 2016.

Full size table

Table 2. Weaknesses of current challenge design and organization that can potentially be exploited by challenge organizers and participants along with best practice recommendations to address existing issues.

Full size table

4 Discussion

To our knowledge, we are the first to investigate common practice and weaknesses related to MICCAI challenge design and organization. According to our experiments, a number of different ranking design choices (metrics, aggregation method, missing data handling) have a substantial influence on the ranking. Further, the instability of the rankings combined with common practice of reporting/challenge organization can – in theory – be exploited by both challenge participants and organizers to manipulate rankings. Our analysis also revealed that challenge design and organization of MICCAI challenges are highly heterogeneous and lot of relevant information is commonly not reported. While initial valuable steps towards more quality control related to MICCAI challenges have subsequently been taken, these initiatives have so far been focusing on the selection of challenge proposals, while no quality control process has been put in place to monitor the implementation of the proposed design. A weakness of our experimental analysis could be seen in the fact that we simulated the removed test case results by applying a threshold to the DSC values based on the known reference annotations rather than performing a visual inspection. Yet, we strongly believe that the poorly performing cases with a DSC below 0.5 would have also been identified visually. Our approach, in turn, ensured an objective, scalable and reproducible process. Note that an investigation with the HD/HD95 as metric in an analogous manner would not have been reasonable as a threshold would strongly depend on the task/images. Secondly, it is worth mentioning that instead of applying the different variations of ranking schemes as used in the challenges we focused on the most commonly used ranking scheme in order to perform a statistical analysis that enables a valid comparison across challenges. Given that all rankings of the challenges investigated are based on the DSC as metric, we consider this procedure as valid. Finally, it could be argued that our work is of limited practical value as challenge organizers and participants are fair in general. While this may hold true for the majority, we expect every “security hole” to be exploited sooner or later [5]. Furthermore, our study not only investigates the effect of challenge weaknesses in the context of cheating but also demonstrates the instabilities of rankings for the first time.

In conclusion, we believe that the insights of this study along with the best practice recommendations provided should be carefully considered in future MICCAI challenges. A key message from this paper is to make the challenge design, organization and results as transparent as possible.

Notes

1.
A challenge may comprise several different tasks for which dedicated rankings/ leaderboards are provided (if any).
2.
Number of participating algorithms \({>}2\) and number of test cases \({>}1\).

References

Boettiger, C.: An introduction to docker for reproducible research. ACM SIGOPS Oper. Syst. Rev. 49(1), 71–79 (2015). https://doi.org/10.1145/2723872.2723882
Article Google Scholar
Carass, A., et al.: Longitudinal multiple sclerosis lesion segmentation: resource and challenge. NeuroImage 148, 77–102 (2017). https://doi.org/10.1016/j.neuroimage.2016.12.064
Article Google Scholar
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A.: The reusable holdout: preserving validity in adaptive data analysis. Science 349(6248), 636–638 (2015). https://doi.org/10.1126/science.aaa9375
Article MathSciNet MATH Google Scholar
van Ginneken, B., Heimann, T., Styner, M.: 3D Segmentation in the Clinic: A Grand Challenge, pp. 7–15 (2007)
Google Scholar
Ioannidis, J.P.: Why most published research findings are false. PLoS Med. 2(8), e124 (2005). https://doi.org/10.1371/journal.pmed.0020124
Article Google Scholar
Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2), 81–93 (1938)
Article Google Scholar
Maier, O., et al.: ISLES 2015 - a public evaluation benchmark for ischemic stroke lesion segmentation from multispectral MRI. Med. Image Anal. 35, 250–269 (2017). https://doi.org/10.1016/j.media.2016.07.009
Article Google Scholar
Maška, M., et al.: A benchmark for comparison of cell tracking algorithms. Bioinformatics 30(11), 1609–1617 (2014). https://doi.org/10.1093/bioinformatics/btu080
Article Google Scholar

Download references

Acknowledgments

We thank all of the organizers of the 2015 segmentation challenges who are not co-authoring this paper. We further thank A. Laha, D. Mindroc-Filimon, B. Pekdemir and J. Yoganathan (DKFZ, Germany) for helping with the comprehensive challenge capturing. Finally, we acknowledge support from the European Union through the ERC starting grant COMBIOSCOPY under the New Horizon Framework Programme under grant agreement ERC-2015-StG-37960.

Author information

Authors and Affiliations

Division Computer Assisted Medical Interventions (CAMI), German Cancer Research Center (DKFZ), Heidelberg, Germany
Annika Reinke, Matthias Eisenmann, Sinan Onogur, Marko Stankovic, Patrick Scholz, Peter M. Full & Lena Maier-Hein
Christian Doppler Laboratory for Ophthalmic Image Analysis, Department of Ophthalmology, Medical University Vienna, Vienna, Austria
Hrvoje Bogunovic
Electrical Engineering, Vanderbilt University, Nashville, TN, USA
Bennett A. Landman
Institute Medical Informatics, University of Lübeck, Lübeck, Germany
Oskar Maier
Institute Advanced Studies, Department of Informatics, Technical University of Munich, Munich, Germany
Bjoern Menze
Department Radiation Oncology, Massachusetts General Hospital, Boston, MA, USA
Gregory C. Sharp
Institute Biomedical Engineering, University of Oxford, Oxford, UK
Korsuk Sirinukunwattana
Division Translational Surgical Oncology (TCO), National Center for Tumor Diseases Dresden, Dresden, Germany
Stefanie Speidel
Department Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
Fons van der Sommen
Institute Surgical Technology and Biomechanics, University of Bern, Bern, Switzerland
Guoyan Zheng
Information System Institute, HES-SO, Sierre, Switzerland
Henning Müller
Centre for Biomedical Image Analysis, Masaryk University, Brno, Czech Republic
Michal Kozubek
Department of Electrical and Computer Engineering, McGill University, Montreal, QC, Canada
Tal Arbel
Science and Engineering Faculty, Queensland University of Technology, Brisbane, QLD, Australia
Andrew P. Bradley
Laboratoire du Traitement du Signal et de l’Image, INSERM, University of Rennes 1, Rennes, France
Pierre Jannin
Division Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany
Annette Kopp-Schneider

Authors

Annika Reinke
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Eisenmann
View author publications
You can also search for this author in PubMed Google Scholar
Sinan Onogur
View author publications
You can also search for this author in PubMed Google Scholar
Marko Stankovic
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Scholz
View author publications
You can also search for this author in PubMed Google Scholar
Peter M. Full
View author publications
You can also search for this author in PubMed Google Scholar
Hrvoje Bogunovic
View author publications
You can also search for this author in PubMed Google Scholar
Bennett A. Landman
View author publications
You can also search for this author in PubMed Google Scholar
Oskar Maier
View author publications
You can also search for this author in PubMed Google Scholar
Bjoern Menze
View author publications
You can also search for this author in PubMed Google Scholar
Gregory C. Sharp
View author publications
You can also search for this author in PubMed Google Scholar
Korsuk Sirinukunwattana
View author publications
You can also search for this author in PubMed Google Scholar
Stefanie Speidel
View author publications
You can also search for this author in PubMed Google Scholar
Fons van der Sommen
View author publications
You can also search for this author in PubMed Google Scholar
Guoyan Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Henning Müller
View author publications
You can also search for this author in PubMed Google Scholar
Michal Kozubek
View author publications
You can also search for this author in PubMed Google Scholar
Tal Arbel
View author publications
You can also search for this author in PubMed Google Scholar
Andrew P. Bradley
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Jannin
View author publications
You can also search for this author in PubMed Google Scholar
Annette Kopp-Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Lena Maier-Hein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Annika Reinke or Lena Maier-Hein .

Editor information

Editors and Affiliations

University of Leeds, Leeds, UK
Alejandro F. Frangi
King’s College London, London, UK
Julia A. Schnabel
University of Pennsylvania, Philadelphia, PA, USA
Christos Davatzikos
Universidad de Valladolid, Valladolid, Spain
Carlos Alberola-López
Queen’s University, Kingston, ON, Canada
Gabor Fichtinger

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Reinke, A. et al. (2018). How to Exploit Weaknesses in Biomedical Challenge Design and Organization. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. MICCAI 2018. Lecture Notes in Computer Science(), vol 11073. Springer, Cham. https://doi.org/10.1007/978-3-030-00937-3_45

Download citation

DOI: https://doi.org/10.1007/978-3-030-00937-3_45
Published: 13 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00936-6
Online ISBN: 978-3-030-00937-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

How to Exploit Weaknesses in Biomedical Challenge Design and Organization

Abstract

Similar content being viewed by others

How Complementary Are Different Information Retrieval Techniques? A Study in Biomedicine Domain

Why rankings of biomedical image analysis competitions should be interpreted with care

MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank

Keywords

1 Introduction

2 Methods

3 Results

4 Discussion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

How to Exploit Weaknesses in Biomedical Challenge Design and Organization

Abstract

Similar content being viewed by others

How Complementary Are Different Information Retrieval Techniques? A Study in Biomedicine Domain

Why rankings of biomedical image analysis competitions should be interpreted with care

MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank

Keywords

1 Introduction

2 Methods

3 Results

4 Discussion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation