Abstract
Radiation therapy (RT) is essential in treating head and neck cancer (HNC), with magnetic resonance imaging(MRI)-guided RT offering superior soft tissue contrast and functional imaging. However, manual tumor segmentation is time-consuming and complex, and therefore remains a challenge. In this study, we present our solution as team TUMOR to the HNTS-MRG24 MICCAI Challenge which is focused on automated segmentation of primary gross tumor volumes (GTVp) and metastatic lymph node gross tumor volume (GTVn) in pre-RT and mid-RT MRI images. We utilized the HNTS-MRG2024 dataset, which consists of 150 MRI scans from patients diagnosed with HNC, including original and registered pre-RT and mid-RT T2-weighted images with corresponding segmentation masks for GTVp and GTVn. We employed two state-of-the-art models in deep learning, nnUNet and MedNeXt. For Task 1, we pretrained models on pre-RT registered and mid-RT images, followed by fine-tuning on original pre-RT images. For Task 2, we combined registered pre-RT images, registered pre-RT segmentation masks, and mid-RT data as a multi-channel input for training. Our solution for Task 1 achieved 1st place in the final test phase with an aggregated Dice Similarity Coefficient of 0.8254, and our solution for Task 2 ranked 8th with a score of 0.7005. The proposed solution is publicly available at Github Repository.
You have full access to this open access chapter, Download conference paper PDF
Keywords
1 Introduction
Radiation therapy (RT) is a fundamental treatment modality for various malignancies, with head and neck cancer (HNC) being a primary beneficiary. Traditional RT planning has largely relied on computed tomography (CT) imaging. However, recent advancements have driven significant interest in magnetic resonance imaging (MRI)-guided RT. MRI offers superior soft tissue contrast compared to CT and enables functional imaging through multiparametric sequences, such as diffusion-weighted imaging. Additionally, MRI-guided RT facilitates daily adaptive treatment using MRI-Linac devices, optimizing tumor destruction while minimizing adverse effects. These advantages suggest that MRI-guided adaptive RT has the potential to revolutionize clinical practice for HNC [1, 2].
Despite these benefits, MRI-guided RT planning generates extensive data, making manual tumor segmentation by physicians-the current clinical standard-a time-consuming and impractical process. This challenge is intensified by the complex anatomy of head and neck (H&N) tumors, which are notoriously difficult to delineate accurately. As a result, there is a growing interest in leveraging artificial intelligence (AI) to automate and improve the segmentation process.
Deep learning (DL), a subset of AI, has shown remarkable success in medical image segmentation, particularly in challenging domains like HNC. Various public challenges, such as the HECKTOR [3] and SegRap [4] challenges, have driven advancements in this field by providing datasets and benchmarks for AI model development. However, no large-scale, publicly available datasets for MRI-guided RT in HNC exist, highlighting the need for community-driven efforts to develop AI tools for clinical translation.
The HNTS-MRG24 challengeFootnote 1 addresses this gap by focusing on the segmentation of H&N tumors in MRI-guided adaptive RT. The challenge is divided into two tasks:
Task 1: Segmentation of primary gross tumor volume (GTVp) and metastatic node gross tumor volume (GTVn) on pre-RT MRI images.
Task 2: Extends this to mid-RT MRI images. In this task mid-RT image, pre-RT image with segmentation, and registered pre-RT image with registered segmentation are all available and can be used as input.
A unique aspect of this challenge is its exploration of whether incorporating prior time point data (pre-RT and mid-RT) into segmentation algorithms can enhance performance in RT applications.
Given the potential of AI to streamline and enhance MRI-guided RT planning, the development of robust, automated segmentation algorithms could significantly impact clinical workflows, reducing the burden on clinicians and improving patient outcomes.
This paper presents our approach to the HNTS-MRG24 challenge, utilizing state-of-the-art DL models, i.e., nnUNet [5] and MedNeXt [6, 7], and ensemble techniques to achieve accurate and reliable segmentation of H&N tumors.
State of the Art
Recent advancements in AI-based segmentation of HNC have demonstrated significant potential, particularly in the context of RT planning. Various studies have utilized DL models to automate and improve the segmentation of tumors and organs at risk, addressing the challenges posed by the complex anatomy of the H&N region.
Li et al. (2020) [8] proposed a semi-supervised framework for medical image segmentation using deep convolutional neural networks. Their method combines labeled and unlabeled data to improve segmentation accuracy by generating pseudo labels and iteratively refining the model. The framework incorporates ensemble learning to reduce errors from poor-quality pseudo labels. The approach was evaluated on the ISIC 2018 dataset for skin lesion segmentation [9, 10], and it demonstrated superior performance compared to fully supervised models and earlier semi-supervised methods.
Astaraki et al. (2023) [11] focused on nasopharyngeal carcinoma, a subset of HNC, in the SegRap 2023 challenge. They developed a fully automated segmentation framework using a standard 3D U-Net model, which was effective in segmenting both GTVs and organs at risk from CT images. Their approach achieved first place in the second task of SegRap 2023 challenge, underscoring the robustness of the U-Net architecture for this task.
Myronenko et al. (2022) [12] presented a fully automated solution for H&N tumor segmentation using positron emission tomography (PET)/CT images in the HECKTOR 2022 [12] challenge. They employed the SegResNet architecture from MONAI, a semantic segmentation network optimized for 3D medical imaging. Their approach included 5-fold cross-validation, image normalization, and model ensembling, which helped achieve first place in the challenge.
These studies collectively highlight the ongoing efforts to refine DL techniques for HNC segmentation, with a particular focus on improving accuracy, handling data imbalance, and integrating multi-modal imaging data. The continued development of AI-driven segmentation tools holds promise for enhancing the precision and efficiency of RT planning, ultimately improving patient outcomes.
The remainder of this paper is structured as follows: Sect. 2 covers the materials and methods used in our approach, Sect. 3 presents the results and evaluation of our models, Sect. 4 discusses the findings, and Sect. 5 concludes with a summary and future directions.
2 Materials and Methods
2.1 Dataset
HNTS-MRG24 Dataset: For this study, we utilized the dataset provided by the HNTS-MRG24 challenge, which focuses on the segmentation of H&N tumors for MRI-guided adaptive RT. The dataset comprises MRI images from 150 patients diagnosed with HNC, collected at The University of Texas MD Anderson Cancer Center. It includes both pre-RT and mid-RT T2-weighted (T2w) MRI scans, with corresponding segmentation masks for GTVp and GTVn.
An important aspect of the dataset is the pre-registration of pre-RT images to mid-RT images, which was performed by the challenge organizers. This registration process was designed to align the images spatially, facilitating more accurate comparison and analysis between the different time points. Detailed parameters for this registration process can be found in the challenge’s official GitHub repositoryFootnote 2.
For Task 1, we utilized the mid-RT and pre-RT registered images to pretrain our models, followed by fine-tuning on the original pre-RT images. During this phase, all 150 cases were included, and no cases were discarded.
For Task 2, the input consisted of a multi-channel format combining mid-RT, pre-RT registered images, and the corresponding pre-RT registered segmentation. All 150 cases were used to train nnUNet, but we encountered issues while training MedNeXt. To resolve these issues, cases with zero ground truth for either label 1, label 2, or both were discarded, leaving 115 samples after removing 35 cases. An example of a pre-RT image with its segmentation is shown in Fig. 1. All the visualizations in this paper were created using 3D Slicer [13].
BraTS24 Meningioma Radiotherapy Dataset: In addition to the challenge data, external public datasets were permitted. However, finding datasets that matched the challenge criteria (T2w MRI, at least two segmentation labels, H&N region) proved challenging. We experimented with the BraTS 2024 Meningioma RT Segmentation Challenge dataset [14], which contains 500 samples and focuses on the segmentation of GTV for meningiomas in brain MRI scans. This dataset uses 3D postcontrast T1w images and preserves extracranial structures through defacing techniques.
We used all 500 cases of the BraTS dataset specifically for Task 1 of the challenge and applied it as pretraining data for our models.
2.2 Networks
The effectiveness of two state-of-the-art models in DL, nnUNet and MedNeXt, was tested in both tasks of the HNTS-MRG24 challenge. Their performance was compared across different architectures and training strategies. A detailed description of the architectures used for each task will follow.
NnUNet: nnUNet is an automated DL framework that self-configures based on the properties of the input data. It has proven effective across a wide range of medical image segmentation tasks due to its adaptability and robust performance [5]. Specifically, we employed the following architectures, each for 5 folds:
-
3D Full Resolution (FullRes) U-Net with the default planner
-
3D U-Net Cascade with the default planner
-
3D FullRes U-Net with Large Residual Encoder (ResEnc) Presets (nnUNetPlannerResEncLPlans)
For further information about nnUNet different configurations and planners, please refer to its documentationFootnote 3.
MedNeXt: MedNeXt is a DL architecture tailored for medical image analysis, particularly effective in handling varying image modalities [6, 7]. We utilized the following architectures, each for 5 folds:
-
Small model with 3 \(\times \) 3 \(\times \) 3 kernel size
-
Small model with 5 \(\times \) 5 \(\times \) 5 kernel size
-
Large model with 3 \(\times \) 3 \(\times \) 3 kernel size
-
Large model with 5 \(\times \) 5 \(\times \) 5 kernel size
When referring to small and large MedNeXt models, the terms relate to the compound scaling of the model. This refers to the simultaneous scaling of depth (number of layers), width (number of channels), and receptive field (kernel size). The small functional design (MedNeXt-S) utilizes 32 channels, an expansion ratio of 2, and a block count of 2. The largest architecture (MedNeXt-L) consists of 62 MedNeXt blocks and uses high values of both expansion ratio and block count [6].
The decision to explore these configurations (kernel sizes 3 \(\times \) 3 \(\times \) 3 and 5 \(\times \) 5 \(\times \) 5 for small and large models) is supported by findings in the MedNeXt study [6]. Smaller kernels, such as 3 \(\times \) 3 \(\times \) 3, provide a robust baseline with balanced computational efficiency and performance. Larger kernels, such as 5 \(\times \) 5 \(\times \) 5, leverage the ability of MedNeXt to learn long-range spatial dependencies, which are particularly useful for medical images with complex anatomical structures.
The MedNeXt study also demonstrates that MedNeXt-L outperforms or is competitive with smaller variants across tasks involving heterogeneous datasets (brain and kidney tumors, organs), varying modalities (CT, MRI), and diverse training set sizes. On the other hand MedNeXt-S can be more computationally efficient and data-efficient, which are important considerations in medical image segmentation where computational resources may be limited and datasets are often small [6].
For further information about different configurations of MedNeXt, please visit its documentationFootnote 4.
2.3 Ensemble Strategy
To enhance the performance of our segmentation models, we implemented a multi-level ensemble strategy. First, we applied the default ensembling methods for each model framework. For nnUNet models, we used nnUNetv2_ensemble and for MedNeXt models, we used MedNeXtv1_ensemble to combine the predictions from different architectures of each model. nnUNetv2_ensemble and MedNeXtv1_ensemble are the commands provided by the frameworks, which are average ensemble.
After obtaining the ensembled predictions from nnUNet and MedNeXt separately, average ensemble method was applied on these outputs to produce the final segmentation result. However, due to differences in the size of the probability maps generated by MedNeXt and nnUNet, additional preprocessing was required. Specifically, we padded the probability maps from MedNeXt to match the size of the original compressed nifti images. Once the probability maps were aligned, we computed average and converted the averaged probability map into a segmentation image.
2.4 Methodology
The training and ensemble strategies, as well as the incorporation of external data, are described below.
Task 1: For Task 1, different configurations of nnUNet and MedNeXt (as described in Sect. 2.2) were pre-trained on mid-RT and pre-RT registered images as individual inputs, and then fine-tuned on the original pre-RT images. After training all the models, ensembling strategies were applied to improve performance. Initially, every possible combination of nnUNet models (FullRes, Cascade, and ResEnc) was aggregated. Following this, the outputs of MedNeXt were combined with the best-performing combination of nnUNet models to evaluate possible improvements.
To explore the effectiveness of the BraTS dataset for the HNC segmentation task, nnUNet FullRes was first pretrained on the BraTS dataset (500 samples) and then fine-tuned on the original pre-RT images. For this experiment, to handle the transition from the BraTS dataset (a single-channel output) to the HNTS-MRG24 dataset requiring segmentation of two channels (GTVp and GTVn), two labels were assigned instead of one during the pretraining phase. This adjustment ensured that the output layer of the pretrained model had two channels, and it is compatible with fine-tuning on the HNTS-MRG24 dataset without further modification to the output layer.
Additionally, nnUNet FullRes was pretrained on a combined dataset of BraTS and mid-RT and pre-RT registered images (800 samples) to assess the benefit of combining external data with challenge-specific data. Following this, fine-tuning was performed on the original pre-RT images.
Task 2: To evaluate the potential value of pre-RT images and their segmentation masks for mid-RT segmentation, nnUNet FullRes and MedNeXt small model with kernel size 3 were trained on four different datasets:
-
Mid-RT images only, referred to as Dataset 504.
-
Mid-RT and registered pre-RT images as a multi-channel input, referred to as Dataset 505.
-
Mid-RT images, registered pre-RT images, and pre-RT segmentation masks as a multi-channel input, referred to as Dataset 506.
-
Mid-RT and registered pre-RT segmentation masks as a multi-channel input, referred to as Dataset 507.
After identifying the best dataset, all model configurations mentioned in Sect. 2.2 were trained on it. The same ensembling strategy as in Task 1 was applied, first aggregating combinations of nnUNet models (FullRes, Cascade, ResEnc) and then combining the best-performing combination of nnUNet models outputs with MedNeXt predictions to evaluate possible improvements.
2.5 Evaluation
The validation set was evaluated using the Aggregated Dice Similarity Coefficient (\(\text {DSC}_{\text {agg}}\)) [15] , same as the challenge evaluation standards.
In this context, \(A_i\) and \(B_i\) represent the ground truth and predicted segmentations for image i, respectively, where i ranges across the entire test set.
Additionally, DSC was calculated for each label (GTVp, GTVn) on a per-sample basis, for each model [16].
In this context, \(A_i\) represents the ground truth and \(B_i\) represents the predicted segmentations for image i.
For cases with zero ground truth (\( |A_i| = 0 \)), the predictions were checked to determine whether the model produced a true empty segmentation (no tumor predicted for a sample with zero ground truth). In such cases, the DSC was assigned a value of 1. Conversely, if the model produced a non-empty segmentation for a sample with zero ground truth, the DSC was assigned a value of 0. After addressing these scenarios, the mean and standard deviation (STD) of the DSC were calculated across all samples to further assess model performance.
3 Results
The experiments were conducted using the cluster node of the Institute for Artificial Intelligence in Medicine (IKIM) in Essen, Germany. The node has 6 NVIDIA RTX 6000, 48 GB of VRAM, 1024 GB of RAM, and AMD EPYC 7402 24-Core Processor. The software environment included Python 3.9.19, PyTorch version 2.3.1+cu121, nnUNet version 2.5, and MedNeXt version 1.7.0.
Task 1:
The performance of different configurations of the nnUNet and MedNeXt models was evaluated using \(\text {DSC}_{\text {agg}}\). While we intended to train all configurations of both models, we encountered issues with certain MedNeXt configurations. The small model with kernel size 3 was the only one that trained successfully and other configurations kept collapsing after a few epochs. On the other hand, nnUNet was successfully trained in all three configurations: FullRes, ResEnc, and Cascade.
For nnUNet, we applied its default ensembling strategy (nnUNetv2_ensemble) to combine predictions from the FullRes, ResEnc, and Cascade configurations. Every possible combination of these models was ensembled. We also attempted an additional step of averaging the best predictions from nnUNet (ensemble of Cascade and ResEnd) and MedNeXt. An overview of these results is given in Table 1.
In addition to \(\text {DSC}_{\text {agg}}\), mean DSC and STD were also calculated the for each predicted label (GTVp, GTVn) across all samples for each model configurations to further investigate model stability and robustness under different conditions. These results are shown in Table 2.
Figure 2 shows a comparison between the best-performing MedNeXt prediction, the worst-performing average ensemble of nnUNet and MedNeXt, and the ground truth segmentation.
Comparison of predicted segmentations of two pre-RT samples (Case 78 and 166) for Task 1. The left image shows the prediction from MedNeXt with the best DSC\({\text {agg}}\), the middle image shows the prediction from the average ensemble of nnUNet and MedNeXt, which had the lowest DSC\({\text {agg}}\), and the right image shows the ground truth segmentation. The green label represents GTVp (label = 1), and the yellow label represents GTVn (label = 2). (Color figure online)
Additionally inference of pretrained models compared to fine-tuned models was tested, see Table 1 and 3. Interestingly, these pretrained models performed better on the original pre-RT images compared to the models that were fine-tuned on the original pre-RT images.
We experimented with using the BraTS dataset as external data. It was used in the pretraining step, either alone or combined with mid-RT and pre-RT registered images, and then fine-tuned on the original pre-RT images. The results of these experiments are presented in Table 4.
For Task 1, the final submission used the MedNeXt small model with kernel size 3. It achieved a \(\text {DSC}_{\text {agg}}\) of 0.8728 for GTVn and 0.7780 for GTVp, with an overall mean \(\text {DSC}_{\text {agg}}\) of 0.8254 in the final test phase on the 50 test patients.
Task 2:
For this task, nnUNet FullRes and MedNeXt small model with kernel size 3 were trained on four datasets with different combinations of mid-RT images, registered pre-RT images, and their segmentation masks (see Sect. 2.4).
nnUNet FullRes was successfully trained for all experiments. However, while training MedNeXt small with kernel size 3 on Dataset 506, the model collapsed after a few hundred epochs for some folds. Table 5 outlines the number of epochs each fold was trained for. Despite the incomplete training, we proceeded with inference for MedNeXt small model with kernel size 3 on Dataset 506.
To address the training issues with MedNeXt on dataset 506, we discarded 35 samples where either label 1, label 2, or both were zero in the segmentation mask of the registered pre-RT. This resulted in 115 samples (dataset 516), which were then used to train MedNeXt. The results are presented in Table 6.
After experimenting on different datasets and finding the best one, we used the same architectures for Task 2 as in Task 1 to compare different models and configurations and ensemble strategies. All configurations of nnUnet trained successfully. Other MedNeXt architectures (small model with kernel size 5, large model with kernel size 3, and large model with kernel size 5) collapsed after only a few epochs. As with Task 1, we ensembled all possible combinations of nnUNet predictions using default nnUNet ensembling (nnUNetv2_ensemble), and then ensembled the best nnUNet predictions (ensemble of Cascade and FullRes) with MedNeXt predictions (from dataset 506, despite incomplete training on some folds) using the average ensembling method. An overview is given in Table 7.
In addition to \(\text {DSC}_{\text {agg}}\), we also calculated the mean DSC and STD for each predicted label (GTVp, GTVn) across all samples for each model configurations to further evaluate model robustness and performance consistency under varying conditions. These results are provided in Table 8.
Figure 3 presents a comparison between the best-performing nnUNet model (ensemble of Cascade and FullRes) and the worst-performing average ensemble of nnUNet and MedNeXt model trained on Dataset 506.
Comparison of segmentation predictions of sample mid-RT (Case 78 and 166) for Task 2. The left image shows the prediction from ensemble of nnUNet Cascade and FullRes with the best DSCagg, the middle image shows the prediction from the average ensemble of nnUNet and MedNeXt, which had the lowest DSCagg, and the right image shows the ground truth segmentation. The green label represents GTVp (label = 1), and the yellow label represents GTVn (label = 2). (Color figure online)
For Task 2, the final submission used an nnUNet ensemble of FullRes and Cascade models. It achieved a \(\text {DSC}_{\text {agg}}\) of 0.8519 for GTVn and 0.5491 for GTVp, with an overall mean \(\text {DSC}_{\text {agg}}\) of 0.7005 in the final test phase on the 50 test patients.
4 Discussion
The results for Task 1 and Task 2 provide insights into the performance of various configurations of nnUNet and MedNeXt models, as well as the impact of pretraining with external datasets and ensembling strategies.
For Task 1, the MedNeXt small model with kernel size 3 achieved the best performance with the highest \(\text {DSC}_{\text {agg}}\), outperforming all configurations of nnUNet, see Table 1. As a result, we chose the MedNeXt small kernel size 3 configuration as our final submission for Task 1.
Despite the fact that MedNeXt small model with kernel size 3 outperformed all nnUNet models, other MedNeXt architectures faced stability issues while training. Specifically, MedNeXt small model with kernel size 5, large model with kernel size 3, and large model with kernel size 5 repeatedly collapsed too early in training-after only a few epochs-so these models were not trained for enough epochs to be used effectively. Consequently, we were unable to fully compare the performance of different MedNeXt architectures and use the mednextv1_ensemble method to aggregate predictions from various configurations of MedNeXt, as originally planned.
Interestingly, the average ensemble of nnUNet and MedNeXt predictions led to a lower \(\text {DSC}_{\text {agg}}\) than using either model independently, suggesting that while both models have strengths, averaging their predictions may have introduced inconsistencies that reduced performance. Specifically, the average ensembling approach increased the number of false negatives and false positives, while also decreasing the number of true positives. This imbalance likely contributed to the overall drop in \(\text {DSC}_{\text {agg}}\).
The comparison of mean DSC ± STD values shows that MedNeXt consistently achieved a higher mean DSC with less variability, for both GTVp and GTVn, see Table 2. This indicates greater robustness in its segmentation performance across different samples and superior overall performance. The nnUNet models had higher variability and lower mean DSC, particularly in GTVp predictions. In conclusion, MedNeXt proved to be a stronger candidate for reliable segmentation compared to nnUNet.
Pretrained models on the registered pre-RT and mid-RT images had higher \(\text {DSC}_{\text {agg}}\) than those that were fine-tuned on original pre-RT images, see Table 3. This observation can be attributed to the fact that the pretraining process used all available input samples, without a separate validation set. Since the original pre-RT images and the registered pre-RT images are highly similar, it can be expected that the pretrained models, which were trained on the registered images, would perform better on a data that is quite similar to what was already seen. In other words, the evaluation on these pretrained models was not a reliable measure of their true performance. In contrast, the fine-tuned models likely had lower \(\text {DSC}_{\text {agg}}\) because they were fine-tuned using a 5-fold cross-validation setup. Finally, we concluded that fine-tuned models would most likely generalize better on unseen test data.
When we experimented with pretraining nnUNet using the BraTS dataset, the results were mixed. Pretraining on BraTS alone led to poor performance, particularly for GTVn segmentation, see Table 4. Several factors likely contributed to this outcome. First, the BraTS dataset consists of images of the brain, while our challenge data includes the more anatomically complex H&N region. Second, the BraTS dataset uses T1w MRI images, whereas our challenge data is T2w, which may have caused discrepancies in the features learned during pretraining. Finally, the BraTS dataset contains only one label (tumor region), whereas our challenge requires segmentation of two distinct labels (GTVp and GTVn). These differences likely inhibited the ability of the pretrained model to generalize well to the challenge-specific data. However, when BraTS was combined with mid-RT and pre-RT registered images, there was a notable improvement, although it still did not surpass the models trained solely on challenge-specific data. This dataset was less aligned with the challenge’s specific needs, and thus its contribution to the final model performance was limited.
For Task 2, the impact of including registered pre-RT images and their segmentation masks in the multi-channel input was evaluated. It was observed that using registered pre-RT images alone, without their segmentation masks, did not provide useful information for segmenting mid-RT images. Using only the segmentation masks of registered pre-RT images along with the mid-RT images resulted in a significant improvement. However, including both registered pre-RT images and their segmentation masks further improved the performance of nnUNet FullRes, and it achieved the highest \(\text {DSC}_{\text {agg}}\) for the mid-RT segmentation task when it was trained on dataset 506 (see Table 6).
MedNeXt faced stability issues when trained on dataset 506, collapsing after a few hundred epochs for several folds, see Table 5. Despite incomplete training, the MedNeXt model trained on Dataset 506 outperformed models trained on Datasets 505 and 504. Interestingly, the MedNeXt model trained on Dataset 507, achieved the best performance among MedNeXt models for Task 2. The observation that MedNeXt performed better on Dataset 507 than on Dataset 506, in contrast to nnUNet, is likely due to its inability to successfully complete 1000 epochs for all folds on Dataset 506. To address these training challenges, the dataset was refined by discarding samples with zero ground truth for either label, resulting in a stable training process for MedNeXt on Dataset 516. Nevertheless, MedNeXt showed its best performance when trained on Dataset 507.
Both nnUNet and MedNeXt models trained on datasets which included segmentation masks of registered pre-RT images (506 and 507) performed better than those trained on dataset 505 and 504. This further demonstrates the importance of including segmentation masks in the input data. This improvement can be attributed to the fact that the primary difference between mid-RT and pre-RT images lies in the size of the GTVn and GTVp, as mid-RT images are taken after some RT treatments. By incorporating pre-RT images and their segmentation, into the model’s input, the model can better understand the region of the GTVn and GTVp in the mid-RT images, and can be more accurate in localization and segmentation.
Other MedNeXt architectures, specifically, MedNeXt small model with kernel size 5, large model with kernel size 3, and large model with kernel size 5 faced stability issues while training in Task 2 as well. They repeatedly collapsed too early in training-after only a few epochs-so these models were not trained for enough epochs to be used effectively. As a result, it was not possible to compare the performance of different MedNeXt architectures and use the mednextv1_ensemble method to aggregate predictions from various configurations of MedNeXt, as originally planned.
The comparison of different models highlights that nnUNet, particularly the nnUNet ensemble of FullRes and Cascade model, outperformed MedNeXt in terms of \(\text {DSC}_{\text {agg}}\), which is the primary ranking metric for the challenge, see Table 7. As a result, the nnUNet ensemble of FullRes and Cascade was chosen as the final model for Task 2.
The ensemble of nnUNet and MedNeXt predictions for Task 2, using the average ensembling method, resulted in a lower \(\text {DSC}_{\text {agg}}\) than using either model independently. This suggests that averaging predictions between models may not have fully captured their individual strengths and could have introduced inconsistencies in the final segmentation. Specifically, the average ensembling approach increased the number of false negatives and false positives, while reducing the number of true positives, ultimately lowering the \(\text {DSC}_{\text {agg}}\) and overall performance.
The comparison of mean DSC ± STD values shows that nnUNet ensemble of all generally achieved a higher mean DSC with less variability, especially for GTVn, see Table 8. This indicates greater robustness in its segmentation performance across samples. For GTVp, MedNeXt outperformed nnUNet models. However, MedNeXt had higher variability and lower mean DSC, in GTVn predictions. This suggests that its stability issues and variability in results and lower \(\text {DSC}_{\text {agg}}\) made it less reliable compared to the more robust and stable nnUNet ensembles.
For practitioners looking to implement these approaches, the choice of model depends on the task requirements and available resources. MedNeXt is recommended for its robustness and high performance in GTVp and GTVn segmentation for Task 1. However, it is important to carefully monitor training stability when using larger configurations or datasets with imbalances. For Task 2, the nnUNet is suggested due to its consistent performance and ability to handle multi-channel inputs effectively.
The computational requirements for the two models are also different. MedNeXt needs more GPU memory and longer training time due to its complex design. Training a MedNeXt model takes about 180 s per epoch on an NVIDIA RTX 6000 GPU. In comparison, nnUNet takes about 60 s per epoch under the same conditions. nnUNet’s lower resource usage and faster training make it better for environments with limited resources, while MedNeXt is more suitable for tasks that require detailed spatial modeling and where sufficient resources are available.
5 Conclusion
In this study, we sought to address the issue of segmenting tumor volumes in HNC using MRI data for the purpose of adaptive RT planning. Two DL models, nnUNet and MedNeXt, were tested, with an investigation of diverse architectural configurations, ensemble methodologies, and pretraining on external dataset.
In conclusion, the nnUNet model, particularly when ensemble predictions are leveraged, demonstrated high efficacy in the segmentation tasks. MedNeXt also demonstrated potential, particularly in Task 1, but encountered challenges with stability during training for Task 2. Pretraining with domain-specific data proved to be a crucial step in Task 1, and the incorporation of registered pre-RT segmentation masks proved beneficial for enhancing the performance of both models for Task 2. This finding addresses the key aspect of this challenge, which explored whether incorporating prior time point data (pre-RT and mid-RT) into segmentation algorithms could enhance performance in RT applications. The results clearly show that using both registered pre-RT images and their segmentation masks significantly improves the model’s ability to accurately segment mid-RT images. Future work could focus on exploring more effective ensemble methods, such as weighted average ensemble, to better combine the strengths of both models. Additionally, addressing the stability issues faced by MedNeXt, particularly in Task 2, may involve adjusting the training process to prevent collapse during training. Improving the use of external datasets through more sophisticated domain adaptation could also enhance the effectiveness of pretraining.
References
Kiser, K.J., Smith, B.D., Wang, J., Fuller, C.D.: Après mois, le déluge: preparing for the coming data flood in the MRI-guided radiotherapy era. Front. Oncol. 9, 983 (2019)
Pollard, J.M., Wen, Z., Sadagopan, R., Wang, J., Ibbott, G.S.: The future of image-guided radiotherapy will be MR guided. Br. J. Radiol. 90(1073), 20160667 (2017)
Andrearczyk, V., Oreiller, V., Abobakr, M., Akhavanallaf, A., et al.: Overview of the hecktor challenge at MICCAI 2022: automatic head and neck tumor segmentation and outcome prediction in PET/CT. In: Head and Neck Tumor Segmentation and Outcome Prediction. HECKTOR 2022. LNCS, vol. 13626, pp. 1–30. Springer, Cham (2023)
Luo, X., et al.: Segrap2023: a benchmark of organs-at-risk and gross tumor volume segmentation for radiotherapy planning of nasopharyngeal carcinoma. In: MICCAI SegRap 2023 (2023)
Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18(2), 203–211 (2021)
Roy, S., et al.: Mednext: transformer-driven scaling of convnets for medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2023)
Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 1–9 (2020)
Li, R., Auer, D., Wagner, C., Chen, X.: A generic ensemble based deep convolutional neural network for semi-supervised medical image segmentation. arXiv preprint arXiv:2004.07995 (2020)
Codella, N., et al.: Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (ISIC). arXiv preprint arXiv:1902.03368 (2019)
Tschandl, P., Rosendahl, C., Kittler, H.: The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 (2018)
Astaraki, M., Bendazzoli, S., Toma-Dasu, I.: Fully automatic segmentation of gross target volume and organs-at-risk for radiotherapy planning of nasopharyngeal carcinoma. arXiv:2310.02972 (2023)
Myronenko, A., Siddiquee, M.M.R., Yang, D., He, Y., Xu, D.: Automated head and neck tumor segmentation from 3D PET/CT: HECKTOR 2022 challenge report. arXiv preprint arXiv:2209.10809 (2022)
Pieper, S., Halle, M., Kikinis, R.: 3D slicer. In: 2004 2nd IEEE International Symposium on Biomedical Imaging: Nano to Macro (IEEE Cat No. 04EX821), vol. 1, pp. 632–635 (2004)
LaBella, D., et al.: Brain tumor segmentation (BRATS) challenge 2024: meningioma radiotherapy planning automated segmentation. arXiv preprint arXiv:2405.18383 (2024)
Andrearczyk, V., Oreiller, V., Jreige, M., Castelli, J., Prior, J.O., Depeursinge, A.: Segmentation and classification of head and neck nodal metastases and primary tumors in PET/CT. In: 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp. 4731–4735 (2022)
Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
Acknowledgments
This work is supported by the Plattform für KI-Translation Essen (KITE) from the REACT-EU initiative (EFRE-0801977, https://kite.ikim.nrw/) and “NUM 2.0” (FKZ: 01KX2121) and FWF enFaced 2.0 (grant number: KLI-1044, https://enfaced2.ikim.nrw/). André Ferreira thanks the Fundação para a Ciência e Tecnologia (FCT) Portugal for the grant 2022.11928.BD.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2025 The Author(s)
About this paper
Cite this paper
Moradi, N. et al. (2025). Comparative Analysis of nnUNet and MedNeXt for Head and Neck Tumor Segmentation in MRI-Guided Radiotherapy. In: Wahid, K.A., Dede, C., Naser, M.A., Fuller, C.D. (eds) Head and Neck Tumor Segmentation for MR-Guided Applications. HNTSMRG 2024. Lecture Notes in Computer Science, vol 15273. Springer, Cham. https://doi.org/10.1007/978-3-031-83274-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-83274-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-83273-4
Online ISBN: 978-3-031-83274-1
eBook Packages: Computer ScienceComputer Science (R0)