Keywords

1 Introduction: Research Context

Radiation therapy (RT) is a cornerstone of cancer treatment for a wide variety of malignancies. Chief among the beneficiaries of RT as a treatment modality is head and neck cancer (HNC). Recent years have seen an increasing interest in MRI-guided RT planning. As opposed to more traditional computed tomography (CT)-based RT planning, MRI-guided approaches afford superior soft tissue contrast, allow for functional imaging through special multiparametric sequences (e.g., diffusion-weighted imaging), and permit daily adaptive RT through intra-therapy imaging using MRI-Linac devices [1]. Subsequently, improved treatment planning through MRI-guided adaptive RT approaches would help further maximize tumor destruction while minimizing side effects in HNC [2, 3]. Given the great potential for MRI-guided adaptive RT planning, it is anticipated that these technologies will transform clinical practice paradigms for HNC [4].

The extensive data volume for MRI-guided HNC RT planning, particularly in adaptive settings, makes manual tumor segmentation (also referred to as contouring) by physicians—the current clinical standard—often impractical due to time constraints [5]. This is compounded by the fact that HNC tumors are among the most challenging structures for clinicians to segment [6]. Artificial intelligence (AI) approaches that leverage RT data to improve patient treatment have been an exceptional area of interest for the research community in recent years. The use of deep learning (DL) in particular has made significant strides in HNC tumor auto-segmentation [7]. These innovations have largely been driven by public data science challenges such as the HECKTOR Challenge [8] and the SegRap Challenge [9]. However, to-date, there exist no large publicly available AI-ready adaptive RT HNC datasets for public distribution. It stands to reason that community-driven AI innovations would be a remarkable asset to developing technologies for the clinical translation of MRI-guided RT.

In this public data science challenge—The Head and Neck Tumor Segmentation for MR-Guided Applications 2024 Challenge (HNTS-MRG 2024, pronounced “hunts”-“merge”)—we focus on the segmentation of HNC tumors for MRI-guided adaptive RT applications. The challenge is composed of two tasks focused on automated segmentation of tumor volumes on 1) pre-RT MRI images and 2) mid-RT MRI images. An overview of HNTS-MRG 2024 is shown in Fig. 1.

Fig. 1.
Flowchart illustrating the timeline and tasks for the HNTS-MRG 24 challenge. The event is hosted on grand-challenge.org in April 2024. Training data with 150 cases is released in June 2024, featuring tasks for Pre-RT and Mid-RT image segmentation. The test phase opens in August 2024, using Docker with 50 test cases and spatial metric evaluation. Awards and post-challenge wrap-up occur in October 2024, with a trophy icon symbolizing participant rankings.

General overview of the HNTS-MRG 2024 data science challenge. Two tasks focusing on pre-RT (Task 1) and mid-RT (Task 2) tumor segmentation using MRI scans were evaluated. Training data from 150 patients were publicly released, followed by an internal evaluation of algorithms on 50 final test patients. Subsequently, a post-challenge virtual wrap up session was held where winners were publicly announced.

2 Dataset and Challenge Details

2.1 Mission of the Challenge

Biomedical Application

This data science challenge followed the Biomedical Image Analysis Challenges (BIAS) statement reporting guidelines by Maier-Hein et al. [10] and was accepted as a satellite event for the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). The algorithms submitted by participating teams were primarily designed for three main target applications: diagnosis, treatment planning, and medical research. These algorithms focused on two core tasks within medical imaging: segmentation and detection. Specifically, they were designed to analyze MRI images and classify/annotate individual voxels into three distinct categorical labels: primary gross tumor volume (GTVp), metastatic lymph node gross tumor volume (GTVn), or background tissue.

Cohorts

Following the BIAS guidelines, we define a target cohort (i.e. subjects from whom the data would be acquired in the final biomedical application) and challenge cohort (i.e., subjects from whom challenge data were acquired). The target cohort would consist of patients with squamous cell HNC who are referred to RT planning clinics. For these patients, the automated segmentation algorithms could potentially be used directly in RT planning. The challenge cohort were patients with a confirmed histological diagnosis of squamous cell HNC who had undergone RT at our institution. This patient cohort primarily consisted of individuals with oropharyngeal cancer (OPC) or cancer of unknown primary (CUP). We included CUP patients for two key reasons. Firstly, these cases are often undetected OPC [11]. Secondly, our dataset included patients whose primary tumors had achieved complete response by mid-therapy, resulting in images where only residual mid-RT lymph nodes remained visible. This scenario closely resembles the presentation of CUP cases, making their inclusion valuable for training algorithms to detect and segment metastatic nodes across a spectrum of clinical presentations. Section 2.3 describes further details of the challenge cohort dataset.

Target Entity

Data Origin

Extant images used in this study were all acquired from the head and neck region of HNC patients. However, the exact area captured in each scan (i.e., field of view) varied somewhat between images. While some scans extended inferiorly to include parts of the lungs or superiorly to include the top of the skull, all scans consistently captured at least the area from the clavicles up to the oropharyngeal region. To make the target regions more uniform across all images, we applied a cropping technique. The specific details of this cropping method are explained in Sect. 2.3.

Algorithm Target

The structures of interest for this study were GTVp and GTVn structures which are conventionally segmented by physicians for HNC RT planning. They represent the position and extent of gross tumor at the primary site and metastatic lymph nodes visible on medical imaging [12, 13].

Task Definition

The challenge consisted of two tasks, pre-RT segmentation (Task 1) and mid-RT segmentation (Task 2).

Task 1 required participants to predict GTVp and GTVn tumor segmentations on unseen pre-RT scans without annotations. This is a task analogous to previous conventional tumor segmentation challenges, such as Task 1 of the 2022 HECKTOR Challenge [14] and Task 2 of the 2023 SegRap Challenge [9]. Participants were free to use mid-RT data for training their pre-RT auto-segmentation algorithms if desired.

Task 2 simulated a real-world adaptive RT scenario, providing an unseen mid-RT image alongside a pre-RT image with corresponding pre-RT segmentation. Registered and original versions of the pre-RT data would be provided during model inference (more details in Sect. 2.3). The goal was to predict GTVp and GTVn segmentations on the new mid-RT images. This task is somewhat analogous to previous challenges that utilize multiple image inputs such as the 2023 SegRap Challenge (non-contrast CT + contrast CT) [9] and the 2023 HaN-Seg Challenge (CT + MRI) [15]. To our knowledge, no previous challenges utilized patient-specific multi-timepoint MRI for segmentation purposes, making this aspect of our challenge particularly unique. Participants were free to use any combination of input images/masks to develop their mid-RT auto-segmentation algorithms.

To foster innovation while maintaining fairness, we allowed participants to leverage pre-trained model weights, foundation models, and additional external data to augment their training. However, we stipulated that all such resources must be publicly accessible and properly cited in the participants’ paper submissions. This approach encouraged the use of state-of-the-art techniques while ensuring transparency and reproducibility in the challenge.

2.2 Online Hosting of the Challenge

HNTS-MRG 2024 was hosted on grand-challenge.org, an open-source platform that has become a de facto standard for online biomedical image analysis competitions. This platform offers essential features for running online data challenges, including an application programming interface, user management, a discussion forum, support for multi-phase competitions with separate leaderboards, and an online image results viewer, among other functionalities. Moreover, the platform utilizes Docker frameworks [16] for containerized algorithm code submissions and automated algorithm evaluation. The online webpage for HNTS-MRG 2024 was launched in April 2024, offering participants a comprehensive environment to engage in this challenge [17].

2.3 Challenge Cohort Dataset

Institutional Review Board

Ethics approval was obtained from the University of Texas MD Anderson Cancer Center Institutional Review Board with protocol number RCR03-0800. This is a retrospective data collection protocol with a waiver of informed consent.

Data Source

All data for this study were collected from a single institution: The University of Texas MD Anderson Cancer Center. T2-weighted (T2w) anatomical MRI sequences were the focus of our challenge due to their ubiquity and importance in MRI-based HNC segmentation for RT [18]. Raw T2w images in Digital Imaging and Communications in Medicine (DICOM) format were automatically extracted from a centralized institutional imaging repository (Evercore). Notably, T2w images were a mix of fat-suppressed and non-fat-suppressed images. Images include pre-RT (0–3 weeks before the start of RT) and mid-RT (2–4 weeks intra-RT) scans. No exogenous contrast enhancement agents were used for these scans. All patients were immobilized using a thermoplastic mask to aid in consistent anatomical positioning. Pre-RT and mid-RT image pairs for a given patient were consistently either fat-suppressed or non-fat-suppressed. In total, data from 202 squamous cell HNC patients were curated. T2w images of the head and neck region were acquired using a range of imaging devices and protocols. Images were acquired on the following devices: 1.5T Siemens Aera (n = 297), 1.5T Elekta Unity (n = 78), 3T Siemens Magnetom Vida (n = 19), 1.5T Siemens Magnetom Sola Fit (n = 10). A full list of imaging protocols are described in Table 1. Examples of T2w images for two patients are shown in Fig. 2.

Table 1. Magnetic resonance imaging acquisition parameters for this study. Median values with ranges shown. Values are calculated across the entire datasets for all timepoints (pre- and mid-radiotherapy).
Fig. 2.
The image shows four MRI scans of a head, comparing pre-radiotherapy (Pre-RT) and mid-radiotherapy (Mid-RT) stages. The top row displays T2-weighted non-fat-suppressed (T2w NFS) images, while the bottom row shows T2-weighted fat-suppressed (T2w FS) images. The scans on the left are labeled "Pre-RT" in red, and those on the right are labeled "Mid-RT" in blue. The images illustrate changes in tissue characteristics during the course of radiotherapy.

Comparison of T2-weighted (T2w) MRI scans before radiotherapy (pre-RT) and during radiotherapy (mid-RT), showing images without fat suppression (T2w non fat suppressed [NFS], top row) and with fat suppression (T2w fat suppressed [FS], bottom row). Pre-RT scans are co-registered to the corresponding mid-RT scans.

Annotation Characteristics

Each MRI scan was annotated for GTVp (maximum one per patient, potentially zero) and GTVn (variable number per patient, potentially zero). A team of 3 to 4 expert physicians each independently segmented these structures for all pre-RT and mid-RT cases. This approach aligns with recent findings from our group, suggesting that a minimum of 3 annotators is necessary to produce acceptable segmentations in these structures [19, 20] when combined using the simultaneous truth and performance level estimation (STAPLE) algorithm [21].

13 unique annotators independently contributed segmentation annotations to this study. All annotators were medical doctors with at least two years of experience in head and neck cancer segmentation. All annotators had access to patient medical histories and any previous relevant imaging (e.g., diagnostic positron emission tomography (PET)/CT imaging) via the patient’s chart. Annotators were instructed to segment targets as they would normally in their clinical workflows. For mid-RT segmentations, annotators were permitted to use their registered pre-RT segmentations as a reference if desired. Segmentations were generated in Velocity AI (v.3.0.1; Varian Medical Systems; Palo Alto, CA, USA) and Raystation (v.11; RaySearch Laboratories, Stockholm, Sweden) using American Association of Physicists in Medicine Task Group 263 nomenclature [22]. A senior radiation oncology faculty member with over 15 years of experience (C.D.F.) performed final quality verification of the segmentations, where annotators were instructed to modify certain segmentations if needed (e.g., in the case of missing a lymph node).

The STAPLE algorithm implementation in SimpleITK [23] was used to combine individual segmentations into a final ground truth (also referred to as reference standard) segmentation for each case (Fig. 3). In exceptional cases where significant discrepancies arose among annotators—such as disagreements over multiple nodal volumes or conflicting assessments of complete versus non-complete response—we deferred to the expert judgment of the senior faculty member (C.D.F.) for generating the final segmentation. The resulting ground truth label mask uses three values: 0 for background, 1 for GTVp, and 2 for GTVn (with multiple lymph nodes consolidated into a single label). An example of an image with the aforementioned labeling scheme is shown in Fig. 4.

Fig. 3.
MRI scan showing a cross-section of tissue with overlaid colored outlines representing different observers' segmentations. The outlines include red for Observer 1, orange for Observer 2, blue for Observer 3, purple for Observer 4, and green for the STAPLE Consensus. The background is a grayscale image of anatomical structures.

Example of the Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm consensus process combining multiple independent annotator segmentations (red, yellow, blue, purple outlined structures) into a single final consensus segmentation (green filled in structure) for a primary gross tumor volume. (Color figure online)

Fig. 4.
The image is a composite of four medical scans showing different views of a head and neck region. The top left shows an axial MRI scan with highlighted areas in green and yellow, indicating different labels. The top right is a 3D rendering with labels: "Background = 0," "GTVp label = 1" in green, and "GTVn label = 2" in yellow. The bottom left displays a coronal MRI scan, and the bottom right shows a sagittal MRI scan, both with highlighted regions. The scans are annotated with measurement scales and orientation markers.

A visual example of the mask labeling scheme for this challenge. Background = 0, primary gross tumor volume (GTVp) = 1 (green overlay), metastatic lymph node (GTVn) = 2 (yellow overlay). Masks shown are consensus segmentations from multiple independent annotators. Visualization performed in 3D Slicer. (Color figure online)

Data Preprocessing Methods

Anonymized DICOM files (MRI image and structure files) were converted to Neuroimaging Informatics Technology Initiative (NIfTI) format for ease of use by participants. Conversions were performed using DICOMRTTool v. 1.0 [24]. We chose the NIfTI format for our data due to its widespread adoption, standardized structure, and its compatibility with a broad range of analysis tools commonly used in medical imaging challenges [25]. All images were cropped from the top of the clavicles to the bottom of the nasal septum (oropharynx region to shoulders) by using manually selected inferior/superior axial slices. This allowed for more consistent image fields of view and removal of identifiable facial structures (i.e., eyes, nose, ears); cropping did not impact any of the segmented volumes.

Registered data (i.e., for Task 2) were generated using SimpleITK [23], where the mid-RT image served as the fixed image and the pre-RT image served as the moving image. Specifically, we utilized the following steps: 1. Apply a centered transformation, 2. Apply a rigid transformation, 3. Apply a deformable transformation with Elastix using a preset parameter map (Parameter map 23 in the Elastix Model Zoo which utilizes a B-spline transformation [26]). This particular deformable transformation was selected as it is open-source and was benchmarked in a previous similar application [27]. In a small minority of cases where excessive warping occurred during deformable registration, we defaulted to using only the rigid transformation. To ensure transparency and reproducibility, we provided a detailed example of our registration process on our GitHub repository [28].

Sources of Errors - Interobserver Variability

The largest source of error naturally emerges from differences in annotator segmentations. We mitigated this by combining segmentations via the STAPLE algorithm which we have shown can yield acceptable segmentations given a minimal number of annotator inputs [19].

We evaluated the interobserver variability (IOV) for GTVp and GTVn structures in our dataset using traditional geometric measures. For each structure (i.e., GTVp and GTVn) and each timepoint (i.e., pre-RT and mid-RT), we calculated pairwise IOV. To calculate pairwise IOV, for each patient, metrics for all possible pairwise combinations between available annotator segmentations were calculated followed by computing the median value across all combinations. Naturally, the patient cases where only the senior faculty member observer contributed segmentations due to significant discrepancies among observers (see Annotation Characteristics) were excluded. Our analysis included four metrics: Dice Similarity coefficient (DSC), 95% Hausdorff distance (HD95), average surface distance (ASD), and surface DSC at a 2mm tolerance (SDSC). Metrics were calculated using the Surface Distances Python package [29] and in-house Python code. Pre-RT IOV DSC values showed median (interquartile range) of 0.747 (0.165) for GTVp and 0.845 (0.070) for GTVn. Mid-RT IOV DSC values were 0.558 (0.272) for GTVp and 0.808 (0.118) for GTVn. IOV based on all geometric measures are shown in Fig. 5.

To provide a more comprehensive assessment of IOV relevant to our challenge, we extended our analysis beyond traditional geometric measures to include aggregated Dice Similarity Coefficient [30] (DSCagg, see Sect. 2.4) IOV calculations. We first computed intermediate metrics in a pairwise fashion across all observers for each patient case independently. These metrics were then aggregated across all cases for each unique annotator pair, enabling the calculation of DSCagg-GTVp, DSCagg-GTVn, and subsequently DSCagg-mean (the mean of DSCagg-GTVp and DSCagg-GTVn) for each annotator pair. Finally, we derived the overall IOV DSCagg values by calculating a weighted average of these annotator pair DSCagg values. The weighting factor was based on the number of cases compared for each annotator pair, ensuring appropriate annotator representation in the final score. As before, cases where only the senior faculty member observer contributed final segmentations were excluded. Final weighted IOV DSCagg values for pre-RT segmentations were DSCagg-mean = 0.806, DSCagg-GTVp = 0.757, DSCagg-GTVn = 0.854. Final weighted IOV DSCagg values for mid-RT segmentations were DSCagg-mean = 0.714, DSCagg-GTVp = 0.600, DSCagg-GTVn = 0.828.

Fig. 5.
The image consists of four box plots labeled A, B, C, and D, each showing the distribution of IOV values by region of interest (GTVp and GTVn) and timepoints (pre-RT and mid-RT). ..Plot A displays IOV DSC values, with pre-RT showing higher medians than mid-RT for both regions. ..Plot B illustrates IOV HD95 values, where mid-RT generally has lower medians compared to pre-RT. ..Plot C presents IOV ASD values, indicating higher medians for pre-RT in GTVp and similar medians for both timepoints in GTVn. ..Plot D shows IOV SDSC values, with pre-RT having slightly higher medians than mid-RT for both regions. ..Each plot includes a legend differentiating the timepoints by color.

Interobserver variability (IOV) data for primary gross tumor volume (GTVp) and nodal gross tumor volume (GTVn) regions of interest stratified by pre-radiotherapy (pre-RT) and mid-radiotherapy (mid-RT) timepoints. (A) Dice Similarity coefficient (DSC), (B) 95% Hausdorff distance (HD95), (C) average surface distance (ASD), (D) surface DSC at a 2 mm tolerance (SDSC). Each datapoint corresponds to the median metric value across all pairs of observers for a given patient image. Each box represents the interquartile range, with the horizontal line indicating the median score. Outliers are shown as individual points outside the whiskers. Higher values indicate greater agreement for DSC and SDSC, while lower values indicate greater agreement for HD95 and ASD.

Training and Test Case Characteristics

Tasks 1 and 2 share a common training dataset, consisting of 150 patient cases, ensuring consistency across both challenges. Training data were publicly released on Zenodo [31] under a CC BY 4.0 license. For each patient case, we provided a comprehensive set of data in NIfTI format. This dataset included six files per patient: the original pre-treatment T2-weighted MRI volume with its corresponding segmentation mask, the original mid-treatment T2-weighted MRI volume with its corresponding segmentation mask, and a registered version of the pre-treatment T2-weighted MRI volume with its corresponding segmentation mask (more details on registration in Data Preprocessing Methods). Each of these six files was linked to a unique anonymized case identifier, ensuring that all data for a given patient could be easily accessed and correctly associated.

The held-out private evaluation data comprised 52 additional cases, with two cases used for the challenge’s preliminary debugging phase, leaving 50 cases for the final test phase (more information in Sect. 2.5). Only the challenge organizers had access to the ground truth segmentations for the test cases until final publication of the full dataset.

Training and held-out private evaluation data were partitioned to contain similar distributions based on dataset characteristics such as image fat-suppression status, tumor response, and staging. The distributions based on various parameters are shown in Fig. 6.

Fig. 6.
The image consists of eight bar charts labeled A to H, comparing training and test sets across various medical parameters. ..Chart A: "T2w Sequence" shows higher counts for Non-FS in both sets. .Chart B: "Response Status" indicates more counts for GTVp and GTVn both NCR in the training set. .Chart C: "GTVp Volume Above Median" has equal counts for below and above median in both sets. .Chart D: "GTVn Volume Above Median" shows similar counts for below and above median in both sets. .Chart E: "HPV" displays a higher count of positive cases in the training set. .Chart F: "Tumor Subsite" shows similar counts for BOT and Tonsil/Other in both sets. .Chart G: "T Stage" indicates a higher count for T0 in the training set. .Chart H: "N Stage" shows more N0 cases in the training set. ..Each chart uses color coding to differentiate categories.

Distribution of key parameters in training and held-out private evaluation sets (written as test). (A) T2-weighted MRI sequence type (Non-FS: non-fat suppressed, FS: fat suppressed). (B) Tumor response status at mid-therapy (NCR: non-complete response, Other: any combination of complete and non-response of primary and node). (C) Primary gross tumor volume (GTVp) and (D) nodal gross tumor volume (GTVn), both categorized as above or below the dataset median. (E) Human papillomavirus (HPV) status. (F) Tumor anatomic subsite (BOT: base of tongue). (G) T-stage and (H) N-stage as per the eighth edition American Joint Committee on Cancer staging system.

2.4 Assessment Method

Both tasks were evaluated in the same general manner using the aggregated Dice Similarity Coefficient (DSCagg). DSCagg was employed by Andrearczyk et al. for the segmentation task of the 2022 edition of the HECKTOR Challenge [14]. Specifically, the DSCagg metric is defined as:

$$DSCagg = \frac{2{\sum }_{i}\left|Ai{\cap }Bi\right|}{{\sum }_{i}\left|Ai\right|+\left|Bi\right|}$$
(1)

where Ai and Bi are the ground truth and predicted segmentation for image i, where i spans the entire test set. Namely, DSCagg calculates intermediate metrics for each case individually and then aggregates the measurements across the test set (i.e., yields one value). DSCagg was initially described in detail by Andrearczyk et al. [30].

Conceptually, the 2022 edition of the HECKTOR Challenge had similar segmentation outputs (i.e., GTVp and GTVn for HNC patients) as our proposed challenge, so we deem DSCagg an appropriate metric. Since the presence of GTVp and GTVn were not consistent across all cases, the proposed DSCagg metric is well-suited for this task. Unlike conventional volumetric DSC, which can be overly sensitive to false positives when the ground truth mask is empty—resulting in a DSC of 0—the DSCagg metric is more robust. It effectively handles cases where certain structures may or may not be present, providing a more balanced evaluation across diverse scenarios encountered in our data. Notably, DSCagg was shown to be a stable metric with respect to final ranking from a secondary analysis of the HECKTOR 2021 results [32], further highlighting its appropriateness for the challenge.

The metric was computed individually for GTVp (DSCagg-GTVp) and GTVn (DSCagg-GTVn), and the mean average of the two (DSCagg-mean) was used for the final challenge ranking (similar to HECKTOR 2022). The metric was calculated for Task 1 (pre-RT segmentation) and Task 2 (mid-RT segmentation) separately. We provided an example of how the DSCagg was calculated for this challenge on our GitHub repository [28].

2.5 Docker Submission and Challenge Phases

Algorithm submissions for the challenge were managed through grand-challenge.org, with participants required to submit their solutions as Docker container images [16]. For algorithm submissions, Task 1 participants received only an unseen pre-RT image, while Task 2 participants were provided with an unseen mid-RT image, a pre-RT image with its corresponding segmentation, and a registered pre-RT image with its corresponding registered segmentation. Toy examples of algorithm inputs for both tasks are shown in Fig. 7. By utilizing a Docker framework, we maintained data integrity and challenge fairness, enabling us to use identical patient cases for both tasks without disclosing Task 1’s ground truth segmentation masks to Task 2 participants. To ensure practical implementation and efficient evaluation, we established specific technical constraints, namely that algorithms were required to complete processing within 20 min per patient case through the Grand Challenge runtime environment (using an NVIDIA T4 graphics processing unit). To assist participants, we provided detailed examples of Docker image containerization on our GitHub repository [28]. We launched two distinct phases for each task on August 15th, 2024: a preliminary development phase and a final test phase.

The preliminary development phase served as a “practice” round, allowing participants to debug their algorithms and familiarize themselves with the Docker submission framework. During this optional but highly recommended phase, teams could make up to five valid submissions. We used data from two patients not included in the training set, selecting straightforward cases with easily identifiable segmentation targets to facilitate the debugging process. The two patients selected for the preliminary phase were both human papillomavirus (HPV)-positive with large GTVp and GTVn targets, with one patient’s images featuring fat suppression and the other’s without, providing participants exposure to different MRI acquisition techniques commonly encountered in the dataset. Results from this phase were immediately displayed on the leaderboard but did not impact the final rankings.

The final test phase, which was composed of 50 cases, determined the official evaluation and ranking of participants’ algorithms. In contrast to the development phase, each team was limited to a single valid submission. This restriction ensured a fair comparison of each team’s best-performing algorithm. The test set for this phase was entirely separate from the development phase data, providing a true measure of the algorithms’ performance on unseen cases.

Fig. 7.
The image illustrates two tasks involving MRI segmentation using an algorithm. Task 1, labeled "Pre-RT segmentation," shows a single input, "pre-RT MRI (Image)," processed by the algorithm to produce an output, "pre-RT segmentation (Mask)." Task 2, labeled "Mid-RT segmentation," involves multiple inputs: "mid-RT MRI (Image)" as mandatory, and optional inputs "pre-RT MRI (Image)," "pre-RT segmentation (Mask)," "registered pre-RT MRI (Image)," and "registered pre-RT segmentation (Mask)." These are processed by the algorithm to yield "mid-RT segmentation (Mask)."

Toy examples of model input and outputs for Task 1 (pre-radiotherapy segmentation, top) and Task 2 (mid-radiotherapy segmentation, bottom).

2.6 Baseline Models

To establish performance benchmarks for Tasks 1 and 2, we developed baseline algorithms using nnU-Net [33], widely regarded as the current DL gold standard for medical image segmentation [34]. For Task 1, we implemented an nnU-Net v2 model with default parameters (full 3D resolution, 1000 epochs, 5-fold cross-validation), using only the pre-RT training images (n = 150) as input. We applied an identical nnU-Net approach for Task 2, but utilized mid-RT images (n = 150) for training instead. No post-processing was applied to baseline models. Model training was performed on a Lamda workstation with 4 NVIDIA RTX A6000 graphics processing units. DL training took approximately 24 h per model. Additionally, we created a simple “null” algorithm for Task 2, which uses unmodified pre-treatment segmentations as mid-RT predictions. This approach mimics a typical starting point for segmentation adjustments in routine clinical workflows.

2.7 Post-challenge Publications and Conference

To be eligible for the final ranking and prizes, participants were required to submit a concise paper detailing their methods. Teams that participated in both tasks (pre-RT and mid-RT segmentation) had the option to submit either a single comprehensive paper or two separate papers describing their approaches. These submissions were subsequently published in our post-challenge proceedings, providing a valuable resource for the research community. Following the challenge’s conclusion, we hosted a live virtual webinar event on the Zoom video conference platform, where top-performing teams were invited to present their innovative methods. This event culminated in the official announcement of the challenge winners.

3 Challenge Algorithm Results

3.1 Participation

As of September 18, 2024 (submission deadline), the number of registered teams for the challenge (regardless of the tasks) was 107. For each task, each team could submit up to five valid submissions for the preliminary development phase and one valid submission for the final test phase. By the submission deadline, we received a total of 164 valid entries across both tasks: 95 for Task 1 (75 in the preliminary development phase, 20 in the final test phase) and 69 for Task 2 (54 in the preliminary development phase, 15 in the final test phase). After accounting for eligibility, 19 unique teams were identified. The geographical distribution of initial registrants is shown in Fig. 8A, while the distribution of final eligible participants is shown in Fig. 8B. The geographical distribution of initial registrants and final participants followed similar patterns, except for Europe and Asia, where the relative proportions reversed between the initial registrants and the final eligible participants.

Fig. 8.
The image consists of two world maps labeled A and B, each depicting data related to nuclear weapons. Map A shows the number of nuclear tests conducted by each country, with Russia having the highest at 48, followed by the United States with 22. Other countries have significantly fewer tests, with most having 1 or 2. Map B illustrates the number of nuclear weapon types by country, with Russia having 9 and the United States 4. Other countries have fewer types, with some having none. The maps use a gradient color scale from light pink to dark red to represent the data intensity.

Geographical distribution of initial registrants and final participants in the HNTS-MRG 2024 challenge by continent. (A) Number of initial registrants, showing the distribution of participants who signed up for the competition on the Grand Challenge website. (B) Number of final eligible participants, reflecting the participants who completed test phase submissions and submitted corresponding manuscripts. The color scale in both maps represents the count of individual Grand Challenge accounts per continent. For the final eligible participants, only the primary contact person from each team was considered.

3.2 Task 1 (Pre-RT Segmentation) Specific Results

Summary of Participants Methods

This section provides an overview of the methods proposed by each team for the automatic segmentation of the GTVp and GTVn in Task 1. The descriptions are presented in the order of the official rankings, beginning with the top-performing team. Each method is briefly outlined, focusing on the key distinguishing features of the method and corresponding submitted manuscript.

Team TUMOR [35] experimented with various nnU-Net methodologies alongside MedNeXt transformer-based models [36] of different kernel sizes. Their models were pre-trained on mid-RT and registered pre-RT images, then fine-tuned on the original pre-RT images. They also explored various ensembling strategies but found that averaging nnU-Net and MedNeXt solutions resulted in worse performance than using either model individually. Ultimately, their best approach was a “small” MedNeXt model with a kernel size of 3, which was used for the final test phase submission of Task 1. Interestingly, they also experimented with fine-tuning using a public meningioma RT dataset [37] but found it did not enhance performance, likely due to discrepancies in features learned during pre-training.

Team Hilab [38] explored a fully supervised learning approach enhanced with pre-trained weights and data augmentation techniques. Notably, they used the SegRap2023 challenge dataset [9] for fully supervised pre-training and applied histogram matching during preprocessing to align intensity differences between CT and MRI data, along with nonlinear transformations to image intensities. To mitigate the impact of negative samples and encourage the network to learn class distributions more effectively, they employed the MixUp technique [39], which augments the dataset by creating new training examples through linear interpolation between sample pairs. For their Task 1 submission, they ultimately used a combination of their base model and pre-training + MixUp model, incorporating ensembling from cross-validation folds.

Team Stockholm_Trio [40] explored a variety of architectures, including SegResNet, nnU-Net, ResEnc, MedNeXt, and U-Mamba. For Task 1, all models were trained exclusively on pre-RT data. The ResEnc and MedNeXt models consistently outperformed the others, particularly when trained on the preprocessed dataset (crop + intensity standardization). The best results were achieved by ensembling the ResEnc and MedNeXt models. However, the execution time of their Docker image on the evaluation platform exceeded the time limit, causing job submission failures. To meet the resource constraints, the final submitted models were significantly simplified by omitting certain preprocessing and ensembling steps.

Team CEMRG [41] introduced a two-stage self-supervised learning approach which leverages unlabeled data to develop robust pre-trained models. In the first stage, they utilized a Self-Supervised Student-Teacher Learning Framework, specifically DINOv2 [42] adapted for 3D data, to learn effective representations from a limited unlabeled dataset. In the second stage, they fine-tuned an xLSTM-based [43] UNet model designed to capture both spatial and sequential features. For Task 1, the team fine-tuned their model on pre-RT segmentation data, which was ultimately submitted for the final test phase.

Team mic-dkfz [44] utilized nnU-net with a residual encoder architecture [34] for their Task 1 solution. Importantly, they experimented with various training strategies, including extensive data augmentation (Aug++), pretraining, ensembling, post-processing, and test-time augmentation. The team leveraged transfer learning by pretraining their model on an unprecedentedly large set of public 3D medical imaging datasets and then fine-tuning on pre-RT data. For their submission to the final test phase, they used an ensemble of models that combined Aug++ and pretraining.

Team RUG_UMCG [45] initially experimented with a custom framework for Task 1, incorporating MONAI [46] with U-Net, 3D U-Net, Swin UNETR architectures, and MedSAM—a foundation model pretrained on a large medical dataset [47]. However, these approaches were less optimal in terms of training speed, validation results, and overall performance compared to a vanilla nnU-net. Ultimately, they transitioned to the nnU-Net framework, employing a 15-fold cross-validation ensemble. For Task 1, they enhanced their training data by incorporating mid-RT data as separate inputs. During inference, test time augmentation was disabled to meet the challenge runtime limit.

Team alpinists [48] conducted a comprehensive literature review to inform their approach, ultimately proposing a resource-efficient two-stage segmentation method using nnU-Net with residual encoders. In this two-stage approach, the segmentation results from the first training round guided the sampling process for a second refinement stage. For the pre-RT task, they achieved competitive results using only the first-stage nnU-Net, which was submitted as their final model. For the final test set submission, they retrained their selected pre-RT models on the full 150-patient dataset. Uniquely, the team used Code-Carbon [49] to monitor the computational efficiency of their approach.

Team SZTU_SingularMatrix [50] investigated the use of STU-Net, a model designed to improve scalability and transferability for medical segmentation [51]. Their approach involved large-scale pretraining on datasets such as TotalSegmentator [52] followed by fine-tuning on the challenge dataset. They explored various STU-Net variants with different parameter sizes and ultimately selected the STU-Net-B model (featuring 58.26 million parameters) for their final test phase submission.

Team SJTU & Ninth People’s Hospital [53] explored the use of an nnU-Net model with a residual encoder, coupled with explicit selection of training data. Their initial experiments showed poor performance when the model encountered cases with high background ratios. To address this, for their Task 1 approach they retrained the model using a carefully selected subset of data consisting of cases with background ratios of 70–90% (i.e., the proportion of background voxels to tumor voxels). This approach aimed to improve segmentation performance, and they also incorporated registered data in the training process. While the model performed well on cases with lower background ratios, further optimization was needed for cases with higher background ratios. For the Task 1 final test phase, the authors submitted an ensemble model trained specifically on cases with high background ratios.

Team DCPT-Stine [54] integrated two promising segmentation frameworks, UMamba [55] and nnU-Net with a residual encoder, into a new approach called UMambaAdj. This method combines the feature extraction strengths of the residual encoder with the long-range dependency capabilities of Mamba blocks. The proposed approach demonstrated comparable segmentation accuracy to the original UMambaEnc, but with reduced training and inference times. Additionally, the team found that UMamba blocks significantly improved distance-based metrics, although these metrics were not considered in the final challenge rankings.

Team UW LAIR [56] implemented SegResNet [57] with deep supervision for Task 1. They trained the model using both pre-RT and mid-RT data, but only pre-RT data was used for model selection in the validation set. For each training/validation split, they used three random seeds, selecting the model with the highest DSCagg in the validation set for each of the five cross-validation folds. This process was repeated twice with different random seeds, resulting in a total of 10 models. The final Task 1 submission was an ensemble of these 10 models.

Team NeuralRad [58] developed an enhanced nnU-Net model augmented with an autoencoder architecture. During inference, they added an output channel to predict the original input images, allowing the model to generate both segmentation results and autoencoder predictions simultaneously. By introducing the original training images as additional input channels and incorporating mean squared error loss alongside dice loss, the model was able to learn additional image features, improving segmentation accuracy.

Team 1WM [59] benchmarked several state-of-the-art segmentation architectures to determine whether recent advances in deep encoder-decoder models are effective for low-data and low-contrast tasks. Interestingly, their results showed that traditional UNet-based methods outperform more modern architectures like UNETR, SwinUNETR, and SegMamba, suggesting that factors like data preparation, the underlying objective function, and preprocessing play a greater role than the network architecture itself. For Task 1, they focused on a single-channel pre-RT network, using the pre-RT volume as input. Ultimately, they submitted a ResUNet 5-fold cross-validation ensemble model for the final test phase of Task 1.

Team dlabella29 [60] explored SegResNet integrated into Auto3DSeg via MONAI. The models were pre-trained on both pre-RT and mid-RT image-mask pairs and then fine-tuned on pre-RT data without any preprocessing. Extensive exploratory analysis of the training data also played a key role in shaping their post-processing decisions which included removing smaller tumor and node predictions. For their final test phase submission, they used an ensemble of six SegResNet models, fusing predictions through weighted majority voting.

Team PocketNet [61] implemented a lightweight CNN architecture called PocketNet [62] using the medical image segmentation toolkit [63]. Unlike traditional networks that double the number of feature maps at lower resolutions, PocketNet maintains a constant number of feature maps across all resolution levels. This design results in significantly faster training time while reducing memory usage and requirements. The PocketNet model was trained on pre-RT images via 5-fold cross validation and the corresponding ensemble was submitted for the final test phase of Task 1.

Team andrei.iantsen [64] explored different variations of standard U-Net architectures in their solutions. They tested various processing configurations, including normalization, augmentation, and weighting techniques to find an optimal approach. For Task 1, they trained their networks on all available MRI images, including pre-RT, mid-RT, and registered pre-RT images. In the final test phase of Task 1, they submitted a 5-fold cross-validation ensemble that combined patch-wise normalization, scheduled augmentation, and Gaussian weighting.

Team FinoxyAI [65] proposed a dual-stage 3D UNet approach, called DualUnet, which uses a cascaded Unet framework for progressive segmentation refinement. In the first stage, the models produce an initial binary segmentation, which is then refined by an ensemble of second-stage models to achieve multiclass segmentation. Both pre-RT and mid-RT MRI scans were used as training inputs. This dual-stage approach consistently outperformed single-stage methods in segmentation performance in validation experiments. The approach was trained using 5-fold cross-validation and submitted to the final test phase as an ensemble of five coarse models and ten refinement models.

Team ECU [66] employed LinkNet [67] ensembles for their solution. They initially pre-trained a LinkNet model with weights from ImageNet [68] followed by fine tuning on the challenge dataset. From the training process, they selected eight high-performing model weights to create an ensemble. Each selected weight was used to generate a LinkNet architecture, resulting in eight networks whose predictions were averaged to produce the final segmentation. Their validation experiments demonstrated that the ensemble outperformed any individual model. Interestingly, they also found that increasing the number of networks beyond eight did not significantly enhance accuracy, suggesting a point of diminishing returns. They suggest their approach leverages the benefits of ensemble learning without the computational cost of training each network from scratch.

Challenge Ranking Results

The results for Task 1 are reported in Table 2. The DSCagg-mean results from the 18 participants ranged from 0.571 to 0.825 (overall mean = 0.783). Team TUMOR achieved the highest overall performance, with a DSCagg-mean of 0.825, including the top GTVn DSCagg score of 0.873. Team Stockholm_Trio secured the best GTVp DSCagg result, with a score of 0.795. Only the top 3 teams achieved DSCagg-mean results higher than the nnU-Net baseline (0.817). The top 9 teams (top 50%) achieved DSCagg-mean results higher than interobserver variability (0.806). Notably, two participants for Task 1 withdrew from the competition before submitting their methods manuscripts; their results are displayed for completeness in Table 2 but are not incorporated into any analysis.

Table 2. Task 1 (pre-radiotherapy segmentation) results for participating teams. Results are shown in descending order by mean aggregated DSC (DSCagg). DSCagg scores for primary gross tumor volume (GTVp) and metastatic lymph nodes (GTVn) are also shown. Highest scores for each category are bolded. The performance of the mean, baseline nnU-Net, interobserver variability (derived in Sect. 2.3), anonymized withdrawn teams are also shown at the bottom of the table. Values are rounded to the nearest 3rd decimal place.

3.3 Task 2 (Mid-RT Segmentation) Specific Results

Summary of Participants Methods

This section provides an overview of the methods proposed by each team for the automatic segmentation of the GTVp and GTVn in Task 2. The descriptions are presented in the order of the official rankings, beginning with the top-performing team. Each method is briefly outlined, focusing on the key distinguishing features of the method and/or corresponding manuscript.

Team UW LAIR [56] integrated novel mask-aware attention modules into a SegResNet framework, allowing pre-RT masks to influence features learned from paired mid-RT data. The model took mid-RT MRI images along with pre-RT masks as inputs. During training, paired pre-RT data was also included, with prior masks set to zeros. They also applied mask propagation through deformable registration, which excluded predicted segmentations on mid-RT MRI scans that had no overlap with registered pre-RT segmentations. Ultimately, the attention-based approach outperformed the baseline method which concatenated mid-RT images with pre-RT masks. As in their Task 1 approach, they generated several models using cross validation splits and random seeds then ensembled them for the final submission.

Team mic-dkfz [44] integrated registered pre-RT images and their segmentations as additional inputs into the nnU-Net framework, through a method they referred to as LongiSeg [69]. Interestingly, though they initially experimented with the residual encoder architecture, it did not yield improved results. They investigated several LongiSeg variants and ultimately submitted an ensemble of their LongiSeg Pre-Seg-C model for the Task 2 final test phase. In this model, a one-hot encoding of the registered prior scan’s segmentation mask was added to the network input (in addition to image inputs), following the order (current scan, prior scan, prior mask). The model was trained in chronological order, with the mid-RT scan as the current scan and the pre-RT scan as the prior.

Team HiLab [38] introduced an innovative training strategy with a novel network architecture, termed Dual Flow UNet, which features separate encoders for mid-RT images and registered pre-RT images along with their labels. In this setup, the mid-RT encoder progressively integrates information from pre-RT images and labels during forward propagation. Their submission to the Task 2 final test phase was an intricate ensemble of methods, combining folds from the Dual Flow UNet with base + pre-RT and pre-training + MixUp variants.

Team andrei.iantsen [64] used the same standard Unet based processing approaches for Task 2 as in Task 1. Notably, for Task 2 they trained models using four simultaneous input channels: the mid-RT image, registered pre-RT image, and two binary masks for the GTVp and GTVn on the registered pre-RT image. As in Task 1, their submitted 5-fold cross validation ensemble model for the Task 2 final test phase utilized all three modifications (patch-wise normalization, scheduled augmentation, gaussian weighting).

Team Stockholm_Trio [40] used the same architectures for Task 2 as in Task 1, with the addition of ablation studies to test different combinations of image data and segmentation masks. Due to difficulties in optimizing the MedNeXt model for this task, only the nnU-Net ResEnc model was ultimately used for training. Notably, they applied a dilation to the pre-RT masks, then derived signed distance maps from the dilated masks, incorporating them as prior information to guide the network’s attention (preDistance-prior). Their results showed that the preDistance-prior settings outperformed other models. As with Task 1, to meet resource constraints, the final submitted models were simplified by omitting preprocessing and ensembling steps.

Team lWM [59] implemented similar experiments for Task 2 as in Task 1, this time using a three-channel mid-RT network with the concatenated mid-RT volume, registered pre-RT volume, and the associated registered pre-RT ground truth segmentation. As with Task 1, they found simple Unet models superior and subsequently submitted a ResUNet 5-fold cross-validation ensemble model for the final test phase of Task 2.

Team RUG_UMCG [45] applied a similar approach to their Task 1 solution, utilizing the nnU-Net framework with a 15-fold cross-validation ensemble. For Task 2, they implemented a 3-channel input, where the mid-RT MRI volume served as the first channel, the registered pre-RT MRI volume as the second channel, and the corresponding segmentation mask as the third channel. As in Task 1, test-time augmentation was disabled during inference to comply with the challenge’s runtime limit.

Team TUMOR [35] applied similar training approaches and architectures for Task 2 as in their Task 1 experiments, with the key difference being the use of concatenated multi-channel inputs to improve segmentation performance. Interestingly, they observed that using registered pre-RT images alone, without their segmentation masks, did not contribute useful information for segmenting mid-RT images. However, including both registered pre-RT images and their segmentation masks improved the DSCagg for mid-RT segmentation. Ultimately, an nnU-Net ensemble of the full-resolution and cascade models was selected as the final model for Task 2.

Team DCPT-Stine [70] employed a novel approach that computes gradient maps from pre-RT images and applies them to mid-RT images to enhance tumor boundary delineation. They applied connected component analysis to registered pre-RT tumor segmentations to initially create bounding boxes. These regions were then used to generate gradient maps on mid-RT T2w images, which served as additional input channels. Gradient maps from pre-RT images and their ground truth segmentation were also incorporated as extra training data. The method was built on nnU-Net with a residual encoder, and validation results showed that leveraging pre-RT information improved segmentation results. Experiments showed that using gradient maps led to more precise boundary localization than images alone.

Team alpinists [48] applied a similar two-stage approach as their Task 1 approach. However, to enhance segmentation performance, they incorporated prior knowledge from the registered pre-RT images and masks as an additional input for the second-stage refinement network. By leveraging the pre-RT data in the second stage, they were able to achieve more accurate mid-RT segmentations for their final submission. As with their Task 1 solution, they retrained the selected mid-RT model on the full 150-patient dataset.

Team NeuralRad [58] applied a similar approach for Task 2 as their Task 1 solution which coupled nnUnet to an autoencoder architecture. The main difference in their Task 2 submission was utilizing mid-RT data instead of pre-RT data.

Team dlabella29 [60] applied a similar methodology using SegResNet for Task 2 as they did in Task 1. Specific Task 2 preprocessing involved setting all voxels more than 1 cm from the registered pre-RT masks to background, followed by applying a bounding box to the image. The modified registered pre-RT and mid-RT MRI were used as input, and model training involved a single stage without any pre-training or fine tuning. Interestingly, they explored systematic radial reductions in the registered pre-RT masks and found that this simple technique performed surprisingly well. However, in keeping with the challenge’s spirit, they avoided using simple mask reductions as their submission, as this method is not suitable for adaptive RT planning where patient-specific solutions are essential. Ultimately, they submitted an ensemble of five SegResNet models for the Task 2 final test phase submission.

Team CEMRG [41] employed the same two-stage self-supervised approach as in Task 1, but for Task 2, they fine-tuned their xLSTM-based UNet model on mid-RT segmentation data. This enabled the model to incorporate temporal dependencies specific to mid-treatment tumor response.

Team SJTU & Ninth People’s Hospital [53] applied the same approach for Task 2 as they did for Task 1, using the nnU-Net residual encoder model coupled with selective training on specific data subsets. The key difference for Task 2 was the inclusion of mid-RT images instead of pre-RT images for model training. As in Task 1, they submitted an ensemble of folds for the final test phase.

Team TNL_skd [71] proposed an end-to-end coarse-to-fine cascade framework based on a 3D U-Net, inspired by future frame prediction in natural images and video [72]. The model has two interconnected components: a coarse segmentation network and a fine segmentation network, both sharing the same architecture. During coarse segmentation, a dilated pre-RT mask and mid-RT image are used to localize the region of interest and generate a preliminary prediction. During fine segmentation, resampling focuses on the region of interest, refining the prediction with the mid-RT image to produce the final mask. Notably, they also investigated training the networks separately but found the end-to-end combined model was superior.

Challenge Ranking Results

The results for Task 2 are reported in Table 3. The DSCagg-mean results from the 15 participants ranged from 0.562 to 0.733 (overall mean = 0.688). Team UW LAIR achieved the highest overall performance, with a DSCagg-mean of 0.733, including the top GTVp DSCagg score of 0.607. Team mic-dkfz secured the best GTVn DSCagg result, with a score of 0.875. All teams, with the exception of one, achieved DSCagg-mean results higher than the nnU-Net baseline (0.633) and the null algorithm (0.601). Only the top 4 teams (~top 25%) achieved DSCagg-mean results higher than interobserver variability (0.714).

Table 3. Task 2 (mid-radiotherapy segmentation) results for participating teams. Results are shown in descending order by mean aggregated DSC (DSCagg). DSCagg scores for primary gross tumor volume (GTVp) and metastatic lymph nodes (GTVn) are also shown. Highest scores for each category are bolded. The performance of the mean, baseline nnU-Net, null algorithm (simple structure propagation from registered images), and interobserver variability (derived in Sect. 2.3) are also shown at the bottom of the table. Values are rounded to the nearest 3rd decimal place.

3.4 General Results Summary

A boxplot summarizing both Task 1 and Task 2 performance is shown in Fig. 9.

A correlation analysis was conducted to assess the relationship between participant performance in Task 1 and Task 2, including only those participants who completed both tasks. Correlations were generally weak with no significant relationships identified. Kendall’s Tau correlation coefficients and corresponding p-values were: DSCagg-mean (-0.01, p = 1.00), DSCagg-GTVp (0.01, p = 1.00), and DSCagg-GTVn (0.18, p = 0.38).

Fig. 9.
Box plot comparing participant algorithm scores across different tasks and metrics. The x-axis shows three metrics: BCa coverage, BCa interval, and BCa width. The y-axis represents BCa coverage scores. Task 1 and Task 2 are represented in red and blue, respectively, with additional markers for manual selection and top-performing algorithms. Task 1 scores are generally higher in BCa coverage and interval metrics, while Task 2 shows lower scores in BCa width. Each box plot includes median values and variability indicators.

Boxplot comparison of aggregated Dice Similarity Coefficient (DSCagg) scores across Task 1 (pre-radiotherapy) and Task 2 (mid-radiotherapy) for three metrics: DSCagg mean, DSCagg primary gross tumor volume (GTVp), and DSCagg metastatic lymph nodes (GTVn). Each box represents the interquartile range, with the horizontal line indicating the median score. Outliers are shown as individual points outside the whiskers. Task 1 is represented in red, and Task 2 in blue. Scatter symbols indicate the nnU-Net baseline (triangle), and interobserver variability (IOV, inverted triangle). (Color figure online)

4 Discussion: Putting the Results into Context

4.1 Outcome and Findings

Data challenges play a crucial role in advancing research and facilitating the clinical implementation of AI technologies [73]. Our challenge represents the first crowdsourced initiative for MR-based segmentation in HNC, with a unique focus on investigating whether incorporating prior timepoint data enhances auto-segmentation performance in RT applications. This approach addresses a critical gap in the field and provides valuable insights for adaptive RT workflows.

Task 1 (pre-RT segmentation) results demonstrated the high performance of auto-segmentation algorithms, with most solutions predominantly based on nnU-Net architectures. It was shown that the top 50% of submitted methods achieved DSCagg-mean scores comparable to or exceeding our measured IOV (DSCagg-mean ~ 0.80), indicating their potential for clinical application. Our results align closely with those of HECKTOR 2022 [14], which also used DSCagg as a metric and saw top-performing algorithms achieve scores around 0.80, though their more heterogeneous test set potentially posed a more complex segmentation end-goal. Generally, GTVp structures were harder for algorithms to segment than GTVn structures. While teams experimented with various underlying training strategies and DL architectures, there didn’t seem to be a clear optimal strategy for maximizing performance, though it is worth noting two out of the top three teams used MedNeXt—a transformer-driven architecture [36]—in their approach. Moreover, our baseline nnU-Net algorithm already achieved high-performing results, suggesting that current state-of-the-art methods provide a solid foundation for further improvements. The minimal quantitative differences observed between top-performing models echo findings from previous HNC challenges like HECKTOR [14], SegRap [9], and H&N-Seg [15]. This consistency across challenges underscores the robustness of current pre-RT segmentation algorithms, particularly given the strong baseline performance of nnU-Net. It suggests that for this specific task, we may be approaching a performance plateau with current DL architectures and available training data.

Task 2 (mid-RT segmentation) presented a more challenging problem, as clearly evidenced by the lower overall algorithmic performance compared to Task 1. This aligns with our expectations and the higher IOV observed in mid-RT annotations (DSCagg-mean ~ 0.71). As with Task 1, algorithms found GTVp structures more challenging to segment than GTVn structures. However, in Task 2, this difficulty gap was wider, aligning with the trends observed in our interobserver variability data. Notably, along with volumetric changes, tumor shrinkage is often accompanied by other radiation-induced biological effects [74], such as inflammation and necrosis. These changes can be visible on imaging and may complicate the accurate contouring of intra-treatment scans [75], particularly for GTVp structures. Subsequently, our baseline nnU-Net model for this task fell short of the DSCagg-mean IOV, likely due to the challenge of accurately capturing these complex, evolving tumor characteristics. Interestingly, a simple “null” model mimicking static contour propagation performed surprisingly similar to the baseline nnU-Net model (DSCagg-mean ~ 0.60). It’s worth noting that our measured IOV may be slightly inflated due to the exclusion of particularly challenging cases (see Sect. 2.3, Sources of Errors - Interobserver Variability), potentially setting a higher benchmark than typically expected. Importantly, the vast majority of submitted algorithms thoroughly outperformed the baselines, with some even surpassing the IOV threshold. This achievement underscores the potential value of advanced auto-segmentation methods in adaptive RT workflows. However, the fact that only about 25% of teams were able to surpass IOV for Task 2, compared to 50% for Task 1, highlights the novel nature of this segmentation challenge and the need for innovative approaches. Moreover, GTVp IOV was only crossed by the winning algorithm, further illustrating the need to focus on GTVp auto-contouring improvements. Interestingly, the average GTVn segmentation performance was higher in Task 2 than in Task 1, likely because most OPC GTVn remain large and do not achieve a complete response by mid-RT [76], simplifying the segmentation process, especially if prior segmentation masks were incorporated. As expected, the most successful methods thoughtfully incorporated registered pre-RT data (i.e., images and masks) typically through novel DL architectural modifications, demonstrating the utility of leveraging prior timepoint information in adaptive RT auto-segmentation solutions. Although this challenge only utilized prior information from pre-RT to mid-RT scans, the same frameworks could potentially be extended to incorporate imaging data from additional intra-treatment timepoints.

4.2 Limitations of the Challenge

While we have striven for a comprehensive data challenge with adequate documentation, curation efforts, and execution, our study is not without extant limitations.

Firstly, a primary limitation of our data challenge was the relatively modest patient cohort size, derived from a single institutional data source. Our total cohort size (~ 200 cases) is on par with some previous HNC challenges like SegRap 2023 [9], but falls considerably short of larger-scale initiatives such as HECKTOR 2022 [14] (~ 900 cases). It is worth noting that, despite this limitation, DL auto-segmentation algorithms have demonstrated remarkable performance even with limited data [77, 78], as evidenced by the high performance achieved on our test sets. Nevertheless, expanding our dataset over time, following the example set by challenges like HECKTOR, would be beneficial for future iterations. On a related note, another limitation is our focus on oropharyngeal regional tumors, which restricts the diversity of HNC subsites represented in our study. While broadening the range of HNC regional subsites would be valuable, it’s important to consider the potential advantages of a more focused approach. Recent recommendations in DL auto-segmentation suggest that decomposing tasks to reduce class imbalance (e.g., focusing on oropharyngeal region) may lead to more effective data utilization and superior models [79]. This data-centric approach could potentially yield better results than a single, all-encompassing model for diverse HNC subsites.

A second significant limitation was the high degree of IOV in our annotations, particularly evident in mid-RT GTVp structures. This variability, while expected, aligns with existing literature on human IOV in HNC tumor segmentation using MRI [80]. To address this issue in future challenges, it would be beneficial to implement strict annotation guidelines for clinician annotators. While such guidelines exist for clinical target volumes [81, 82], they are notably absent for gross tumor volumes, highlighting an area for improvement in future iterations. Furthermore, our study’s reliance solely on MRI, while valuable for MR-centric adaptive approaches (e.g., MR-Linac), may have limited the accuracy of tumor delineation. Incorporating additional systematically co-registered imaging modalities such as PET and CT could enhance both the generation of ground truth segmentations by physicians and overall model performance, as supported by previous research [83, 84]. Although we initially planned to include multiple MRI sequences, particularly diffusion weighted sequences (i.e., apparent diffusion coefficient maps), data curation constraints prevented this inclusion without significantly reducing our sample size. Future challenges should explore the integration of additional MRI sequences (i.e., multiparametric MRI), as they could provide crucial information for more precise tumor segmentation [85].

Finally, while we aimed for a robust evaluation using DSCagg, a metric previously validated by Andrearczyk et al. [30, 32], our choice of evaluation metrics could be expanded in future iterations. Recent tumor segmentation challenges involving multiple objects, such as BraTS-METS 2024 [86], have employed more sophisticated measures like lesion-wise DSC. This approach uses ground truth label dilation to better understand lesion extent and rigorously penalizes false positives and negatives with a score of 0. Furthermore, related metrics developed for multiple sclerosis lesions, such as the object-normalized DSC proposed by Raina et al. [87], might offer advantages in handling volume discrepancies. This adaptation of DSC scales precision at a fixed recall rate, addressing bias related to the occurrence rate of the positive class in the ground truth. Notably, for RT-related tasks, incorporating surface distance measurements [88] or spatially accounting for healthy tissue proximity [89] could also provide a more comprehensive evaluation. Additionally, treating the nodal component of these tasks as an object detection problem in conjunction with segmentation could offer a more nuanced assessment of algorithm performance. In future iterations, adopting a broader range of evaluation metrics would likely provide a more holistic understanding of algorithm performance and better align with the specific intricacies of HNC segmentation for MRI-guided RT.

4.3 Future of the Challenge

While we initially released training data (i.e., MRI images and STAPLE consensus segmentations in NIfTI format) through Zenodo [31], we have plans for a more comprehensive data release. This expanded dataset will include raw DICOM data, individual observer segmentations, and relevant clinical metadata for both training and held-out evaluation sets. The inclusion of individual observer segmentations may be particularly valuable in ambiguity modeling experiments for deep learning uncertainty quantification [90, 91]. This extensive data release will be accompanied by a detailed data descriptor to facilitate its use by the research community. Furthermore, we intend to publish a post-challenge summary paper in a high-impact, field-specific journal. This paper will delve into meta-analytic approaches to comprehensively characterize algorithm results, including combined participant algorithms, inter-algorithm variability, additional subanalysis, and ranking stability, in a similar vein to previous post-challenge analyses [8, 32]. Eligible participants will be invited to co-author this manuscript, fostering collaborative insight into the challenge outcomes.

While there are currently no concrete plans for a second edition of HNTS-MRG, we remain open to the possibility of future iterations that could significantly enhance the challenge’s scope and impact. Such future editions could potentially incorporate a wider array of imaging sequences and timepoints (i.e., greater number of intra-treatment images) [92], leveraging the full capabilities of MRI in adaptive RT for HNC. We also envision the inclusion of data from multiple institutions, which would not only increase the dataset size but also introduce valuable diversity in imaging protocols and patient populations for added generalization ability of algorithms. This approach would mirror the successful strategy employed by the HECKTOR series of challenges [14, 93, 94], which has seen progressive data enlargement and diversification over the years. By broadening our dataset in these ways, future iterations of HNTS-MRG could offer even more robust insights into the performance of MRI-guided RT segmentation algorithms.

5 Conclusions

This paper presented a comprehensive overview of the HNTS-MRG 2024 challenge, focusing on the automated analysis of MRI images in HNC patients. The challenge explored two critical tasks: fully-automated pre-RT segmentation (Task 1) and mid-RT segmentation (Task 2). Utilizing a robust dataset of 200 HNC cases (150 for training, 50 for final testing), this challenge garnered significant interest from leading research teams worldwide, resulting in 20 high-quality papers showcasing a diverse array of innovative methods. Task 1 algorithm performance was generally high and consistent with previous similar tumor segmentation challenges (e.g., HECKTOR 2022). Top-performing algorithms for Task 1 achieved DSCagg-mean scores comparable to or exceeding clinician IOV, with minimal differences between leading methods. Task 2 proved more challenging, as expected, with lower model performance compared to Task 1. Notably, the best-performing algorithms in Task 2 surpassed both our baseline models and clinician IOV, demonstrating the potential of advanced auto-segmentation methods in adaptive RT workflows. Across both tasks, algorithms consistently found GTVp structures more difficult to segment than GTVn structures, mirroring trends in clinician IOV. To further advance this field, future work should focus on harmonizing tumor segmentation guidelines for clinicians, investigating additional segmentation performance metrics, and expanding the patient cohort.