A scoping review of robustness concepts for machine learning in healthcare

Balendran, Alan; Beji, Céline; Bouvier, Florie; Khalifa, Ottavio; Evgeniou, Theodoros; Ravaud, Philippe; Porcher, Raphaël

doi:10.1038/s41746-024-01420-1

Download PDF

Article
Open access
Published: 17 January 2025

A scoping review of robustness concepts for machine learning in healthcare

Alan Balendran ORCID: orcid.org/0000-0002-2779-458X¹,
Céline Beji¹,
Florie Bouvier¹,
Ottavio Khalifa¹,
Theodoros Evgeniou ORCID: orcid.org/0000-0001-9525-6110²,
Philippe Ravaud^1,3,4 &
…
Raphaël Porcher^1,3

npj Digital Medicine volume 8, Article number: 38 (2025) Cite this article

3534 Accesses
4 Altmetric
Metrics details

Subjects

Abstract

While machine learning (ML)-based solutions—often referred to as artificial intelligence (AI) solutions—have demonstrated comparable or superior performance to human experts across various healthcare applications, their vulnerability to perturbations and stability to variations due to new environments—essentially, their robustness—remains ambiguous and often overlooked. In this review, we aimed to identify the types of robustness addressed in the literature for ML models in healthcare. A total of 274 eligible records were retrieved from PubMed, Web of Science, IEEE Xplore, and additional sources. Eight general concepts of robustness emerged. Furthermore, an analysis of those concepts across types of data and types of predictive models revealed that the concepts were differently addressed. Our findings offer valuable insights for stakeholders seeking to understand and navigate the robustness of machine learning models during their development, validation, and deployment in healthcare settings, where interpretation of robustness may vary.

Making machine learning matter to clinicians: model actionability in medical decision-making

Article Open access 24 January 2023

Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review

Article Open access 10 January 2022

The potential for artificial intelligence to transform healthcare: perspectives from international health leaders

Article Open access 09 April 2024

Introduction

Solutions such as software and medical devices integrating artificial intelligence (AI) or machine learning (ML) have become common in healthcare¹. They are no longer merely in the state of development but are gaining increasing attention for real-world applications, as evidenced by the growing number of FDA-approved medical devices using AI².

AI-enabled medical devices for decision support, and more precisely the ML models within these medical devices, have shown comparable performance—and sometimes even better performance—than experts for different tasks^3,4,5. However, they are usually not assessed prospectively or under actual clinical workflow conditions^6,7. Hence, their apparent performance does not take into account external sources of perturbations and variations that can affect the model behaviour^8,9. Moreover, some AI-based solutions are being designed for deployment outside traditional clinical settings, such as mobile devices like smartphones¹⁰. This shift raises concerns regarding the potential introduction of additional sources of perturbations.

Resilience of ML models to variations and perturbations due to the environment in which the model operates is commonly referred to as robustness¹¹. Besides, robustness has also been identified as a key principle in trustworthy AI frameworks. For example, the ALTAI framework considers robustness as one of the three core components of trustworthy AI¹². Similarly, the FUTURE-AI framework that proposed a guideline for trustworthy AI in healthcare, considers robustness as an essential principle—on par with fairness and explainability—in achieving trustworthy AI in healthcare¹³. This highlights that robustness is not just a property of a machine learning model but a fundamental principle toward the implementation of safe and trustworthy AI in healthcare, and beyond.

However, the practical implications of this term remain somewhat ambiguous. Robustness serves as an overarching concept encompassing various factors that can impact a model differently, depending on the nature of perturbations but also on the development stage of the machine learning model (e.g. acquiring data, selecting and training a machine learning model, and evaluating it on new data). Moreover, different types of perturbation can have varying impacts on the performance of ML models. For instance, adversarial attacks—a method to inject noise in an image while preserving visual information—have demonstrated a remarkable ability to fool deep learning models’ prediction⁸. Such blindspots could confuse and raise legitimate concerns in healthcare, where ML models, especially those based on deep learning, considered black-box due to their limited interpretability, are employed for critical tasks such as diagnosis, prognosis, or patient monitoring¹⁴.

However, it is important to note that the use of “black-box” solutions for managing patient health is neither new nor exclusive to algorithms. A complete understanding of a drug’s effects, especially its adverse effects, cannot be fully achieved through clinical trials alone. Rare or unexpected adverse effects can emerge due to deviations in the target population and drug administration when used in real-life settings. Furthermore, such effects can also occur within the same target population used during clinical trials when a drug is used on a larger scale. Cases like Thalidomide show that clinical trials are insufficient to capture all adverse events in specific populations, particularly due to ethical concerns and unobserved outcomes during clinical trials, such as effects on newborns. Similarly, machine learning and deep learning models, and by extension, the medical devices that incorporate them, may exhibit unexpected behaviours when applied to populations or contexts beyond those they were originally trained and tested on. This raises the critical question of how to ensure that those non-deterministic solutions do not deviate or behave unpredictably once deployed in real-world clinical practice. The Thalidomide scandal was a catalyst for the creation of pharmacovigilance, which now also sparks discussions about applying a similar monitoring framework for AI-based devices in healthcare—often referred to as algorithmovigilance¹⁵.

Moreover, the development and deployment of ML models in healthcare involve multiple stakeholders, each with their concerns and perceptions regarding what constitutes a variation or perturbation for these models. Different stakeholders, including researchers, model developers, operators, healthcare professionals, and patients perceive and encounter different types of variations. Given these diverse perspectives, the current understanding of robustness as simply handling “variations and perturbations” is inadequate.

Therefore, a clear and comprehensive view of the concepts of robustness is essential to ensure effective communication among all stakeholders in healthcare about what a model is robust against. By identifying the specific causes of perturbations that can occur throughout the life cycle of an ML-based solution in healthcare, it becomes possible to address these issues and mitigate their impact on the model’s performance. Given the need to clarify the concept of machine learning robustness and identify the key characteristics related to that concept, we decided to conduct a scoping review^16,17,18. Namely, the objectives of this scoping review were (1) to identify the various concepts of robustness currently used in the literature for machine learning models in healthcare for decision support; and (2) to map those different concepts of robustness across types of data and of predictive models.

Results

General characteristics

The search initially retrieved 8585 records, of which 6920 remained after removing duplicates. Subsequently, 6201 records were excluded based on title and abstract screening. Ten were added through other methods. After assessing the full text of 729 records, 274 were finally included in our review. Figure 1 provides a summarised overview of the screening process. The list of included studies is available in Supplementary Table 1. Among the included records, 190 (69%) were published in journals, while 81 (30%) originated from conferences, workshops, and symposiums. One record each was from a scholarly dissertation, a preprint, and a book chapter. From these 274 records, we extracted 526 combinations of medical specialties, model type, data type, and concepts of robustness, presented in the following section—all summarised in Table 1.

Table 1 General characteristics of included studies

Full size table

The five most frequent domains of applications were pulmonology (86/526, 16.3%), followed by gynaecology (85/526, 16.2%), neurology (73/526, 13.9%), dermatology (33/526, 6.3%), and gastroenterology (32/526, 6.1%).

The most common type of data used to build predictive models was image data (167/526, 31.7%), followed by omics data (105/526, 20%), and image-derived features (84/526, 16%). The latter corresponds to cases where features are derived from images before training the predictive models, as opposed to end-to-end learning using images directly. The other category (47/526, 8.9%) encompassed other types of data such as vocal recordings, gait data, and clinical discharge summaries extracted from EHRs. A complete list of all items categorised as other for each extracted domain is presented in Supplementary Table 2.

The majority of the applications used deep learning-based methods (269/526, 51.1%), followed by non-deep machine learning methods (134/526, 25.5%). The latter includes methods such as decision tree, random forest, gradient boosting, and support vector machine. Hybrid methods, which combine deep learning (for feature extraction) and non-deep learning/linear methods for predictions, were present in only 15 out of the 526 applications (2.9%). Linear regression models accounted for 12.2% (64/526). Other types of predictive models (44/526, 8.4%) comprised models such as naive Bayes, Gaussian processes, and linear/quadratic discriminant analysis.

Eight general concepts of robustness emerged, illustrated in Fig. 2 and described in Table 2: input perturbations and alterations, missing data, label noise, imbalanced data, feature extraction and selection, model specification and learning, external data and domain shift, and adversarial attacks. Examples of notions for each concept are described in Supplementary Table 3. The most frequently addressed concept was robustness to input perturbations and alterations (142/526, 27%), while robustness to imbalanced data (15/526, 3%) was the least commonly tackled. The metrics used to assess the robustness to the different concepts are described in Supplementary Table 4.

**Fig. 2: The eight robustness concepts.**

Table 2 Description of identified concepts with examples

Full size table

Robustness concepts across types of data and predictive model

We also analysed the robustness concepts based on the data and models used. The results, shown in Fig. 3, reveal that robustness concepts were addressed differently depending on the choice of data and model. We highlight some of these findings below.

**Fig. 3: Stratified view of the eight concepts of robustness.**

Figure 3a illustrates the different concepts stratified by the type of data used to develop a model. Robustness to the concept of feature extraction and selection was mainly emphasised in applications based on image-derived data (33%) and omics (22%). Adversarial attacks were mainly tackled in applications relying on image (22%) and physiological signal data (7%). Applications addressing robustness to missing data were for the most part accounted for applications using clinical data (20%). Robustness to label noise was most frequently addressed in image-based applications (23%). Applications relying on omics data (i.e., data obtained through high-throughput measurement of biological molecules) addressed the fewest number of concepts of robustness (5).

Figure 3b illustrates the different concepts stratified by the type of the model. Robustness to adversarial attacks was only addressed for applications based on deep learning (15%). Robustness to label noise was also mostly tackled for deep learning models (16%) and hybrid models (13%). The concept of external data and domain shift, as well as input perturbations and alterations, were the most addressed concepts across all types of models. Applications based on hybrid models addressed the smallest number of concepts (4), while applications based on deep learning covered the most (8).

An analysis of robustness concepts, stratified by medical specialty, is presented in Supplementary Table 5.

Moreover, an assessment of the combination between the type of data and the type of predictive model, available in Supplementary Fig. 1, revealed that deep learning models were primarily favoured for images, while non-deep learning methods, such as non-deep machine learning and linear regression models, were largely used for omics data.

Discussion

Building on the different notions extracted from the studies included in this scoping review, we identified eight general concepts to represent the robustness of machine learning models for different sources of perturbations: input perturbations and alterations, missing data, label noise, imbalanced data, feature extraction and selection, model specification and learning, external data and domain shift, and adversarial attacks. These concepts encompass perturbations and variations that can occur at different stages of a machine learning model’s life cycle. For example, input alterations and perturbations, missing data, label noise, and imbalanced data, occur during data acquisition, collection, and preparation; feature selection and extraction, model specification and learning are part of model development; and external data and domain shift, and adversarial attacks relate to model validation and deployment, as illustrated in Fig. 2. This classification highlights the diverse nature of robustness in ML solutions for healthcare applications.

The review also highlights the intersection between robustness and other intrinsic machine learning principles such as generalizability, fairness, and explainability. For instance, studies that evaluate the robustness to various data distributions through external validation inherently address the generalizability of the model^19,20,21,22. Similarly, addressing the robustness of the model to specific minority groups directly relates to the notion of fairness^23,24,25. Assessing the robustness of a model’s learned features to ensure the absence of spurious correlations is also intricately linked to model explainability^26,27.

Our review also revealed that certain sources of perturbations were more subject to the choice of the predictive model or the type of data. For instance, adversarial attacks were used mostly (90%) to assess the robustness of deep learning models trained on image data. This is explained by the nature of adversarial attacks being perturbations that maintain the “visual” structure (from a human perspective) of the data as close as possible to the original data, which is more suited for image and ECG than for tabular data, for instance^28,29. Only one study addressed adversarial attacks for clinical notes by applying transformations such as medical concept substitution, word replacement, or adverb removal³⁰. Similarly, robustness to feature selection and extraction was mostly addressed for applications based on omics data (e.g. genomics, proteomics, metabolomics, etc.) and to image-derived data (e.g. radiomics) which are considered high-dimensional data. For applications based on such data, it is generally desirable to select only a subset of the most salient or most reliable features, introducing additional sources of perturbations. Robustness to missing data was mainly addressed for clinical data due to the inherent nature of missing data. Studies addressing missing data were not limited to a particular missingness mechanism (e.g., missing completely at random, missing at random, missing not at random)^31,32. One study based on multimodality discussed robustness in settings where an entire modality was missing³³. Furthermore, we observed that the choice of model is influenced by the type of data. For instance, deep learning models are favoured for image-based applications, while non-deep machine learning and linear regression models are often used for high-dimensional data, such as omics. This choice consequently impacts the robustness concepts that we retrieved from the literature.

Other concepts such as input perturbations and noisy environments and label noise are more general. This is particularly evident in the case of label noise, which can manifest in various forms depending on the dataset. Label noise can stem from training on proxy labels, for instance, deriving labels from treatment prescriptions³⁴. Label noise can also derive from human uncertainty when using the consensus of multiple annotators, often needed due to the absence of a pathologically confirmed diagnosis label, thereby introducing inherent uncertainty^35,36,37. For large data, it is often infeasible to perform the data labelling by human annotators, thus resorting to tools from natural language processing (NLP) to automatically infer labels for the data, creating noise in the process³⁸. Moreover, discrepancies may also arise when a label assigned at an image level (e.g. a whole slide image (WSI) or a chest x-ray) fails to accurately capture the heterogeneity present locally in the image (e.g. in patches extracted from the WSI or specific areas of a chest x-ray)^39,40.

Our classification also shows that these concepts can be perceived and associated with different stakeholders. If we put into perspective the different concepts of robustness identified in our review with what is known regarding the different actors involved in the different steps of a machine learning model lifecycle, we find that those concepts are associated with different stakeholders. For instance, patients’ perceptions of robustness mainly fall within the space of input perturbations and alterations that generally occur during data acquisition, which involves patients. Variations related to external data and domain shifts are more likely to be experienced by ML-system operators, patients, and healthcare professionals when the model is deployed and used in a setting for which it was not initially trained. Conversely, variations occurring during the machine learning model specification and learning are mainly observed by model developers. Adversarial attacks, although effective at exploiting deep learning models’ blind spots, are intentionally generated samples, usually by those developing or validating models. Thus, they are not encountered in real clinical settings, making them less associated with patients, healthcare professionals, or ML-system operators.

Nearly half (47%) applications reviewed focused on medical specialties within gynaecology, pulmonology, and neurology. This can be attributed to the availability of easily accessible benchmark datasets. For instance, the ChestX-ray8 dataset, a publicly available dataset of 108,948 chest X-rays covering eight diseases⁴¹, contributes to the high number of applications in pulmonology. Similarly, the large number based on omics can be ascribed to The Cancer Genome Atlas Program (TCGA), a publicly available collection of genomes that was developed mainly for cancer-related research⁴². The main gaps in studies addressing the least-explored medical applications, predictive models, and type of data can be explained by the overall scarcity of machine learning research in these areas.

Our review has some limitations. First, while it provides valuable insights into the different dimensions of robustness for a machine learning model in healthcare, it is not meant to provide or assess methods for “robustifying” a model. This requires additional efforts and the implications of a multidisciplinary team involving healthcare professionals, model developers, and AI/ML researchers. For instance, perturbations caused by label noise can be addressed in many ways: at the data collection level by collecting additional data with more reliable ground truth labels or by correcting existing labelling errors before training the predictive model⁴³. It could also be addressed from the model perspective, where strategies to mitigate the impact of label noise may include developing models that are inherently robust to such noise, or by adapting model architectures and training strategies to correct noisy labels during training^44,45. Second, our review is mostly restricted to studies associated with the terminology ‘robust’, ‘noise, and ‘perturbation’. Our choice was motivated by the frequent use of these terms in the machine learning community and to limit the number of records obtained through the search. Other terms could have been chosen, such as 'stability', 'resilience', 'reliability', or 'vulnerability', which might have produced different studies. Third, our review is, by its nature, limited temporally. Indeed, our review reflects the evidence available in the literature at a given time. Therefore, it does not include more studies based on more recent machine learning methods such as foundation models, which have recently demonstrated impressive performance across various tasks. While these models are currently a hot topic, their use in healthcare—and particularly their robustness—remains in its early stages. A search (29th October 2024) using terms like ‘foundation model,’ ‘generalist medical artificial intelligence,’ and ‘artificial general intelligence’ returned only two eligible studies, based on the title and abstract, related to foundation models across the 3 databases, which were all published after the search date of this work. We believe that the robustness of foundation models will become a distinct area of study due to their unique characteristics (e.g., multi-modality, specific training strategy, finetuning, different evaluation methods, etc.).

Other works have focused on the robustness of machine learning models. A study published in 2020 explored various notions of robustness in healthcare but did not examine how these concepts varied based on the type of data and predictive model used⁴⁶. Another study in 2023 proposed a general definition for robustness from a causal perspective and they did not focus on healthcare⁴⁷. Moreover, both studies did not rely on a formalised review methodology. To our knowledge, this review presents the first classification of robustness concepts specifically tailored for machine learning in healthcare. It’s important to note that different choices in our approach could have resulted in a different classification⁴⁸.

We believe that our review constitutes the first step toward identifying appropriate strategies and frameworks for each of the proposed concepts of robustness to improve the robustness of a machine learning model both during the development of the model but also after the model has been deployed where it is crucial to ensure that the model remains robust in time. Future work based on this study can take different directions: one focus could be on identifying the various methods developed to address and mitigate robustness issues in machine learning models for each of the concepts we identified. Another direction could involve stress testing different machine learning models against the concepts identified in this study to explore their robustness and determine which factors are more likely to impact model performance.

Given the discrepancy between the ideal environment in which a model is developed and the complex real-world setting characterised by numerous sources of perturbations and variations, establishing a robust framework to identify and mitigate these potential sources in healthcare is crucial. Our review provides a comprehensive overview of the different concepts of robustness that are addressed at various stages of an ML model’s life cycle in healthcare for decision support. The perception of what constitutes a perturbation or variation can differ based on the life cycle stage, type of data, predictive model, and the various stakeholders involved. Our review may help stakeholders navigate and comprehend the robustness of models deployed in healthcare settings, improving the reliability of these models in real-world applications.

Methods

The proposed review was conducted and reported per the Joanna Briggs Institute methodology and Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) also detailed in Supplementary Table 6^49,50.

A study protocol was uploaded on Open Science Framework⁵¹.

Search strategy and selection criteria

Relevant studies for our review were identified in two ways. First, we searched three electronic databases: PubMed, IEEE Xplore, and Web of Science (search date: 1 March 2023). These three databases cover a broad range of machine learning literature in healthcare, making them suitable for our review. A search equation was developed with the assistance of a librarian (CS). Search terms such as “deep learning”, “machine learning” were selected to retrieve machine learning studies, as well as terms like “statistical learning” and “prediction model” to capture studies from the field of statistics. The detailed search equations for each different database can be found in Supplementary Table 7–9. We did not set a limit for the start date when searching for eligible studies.

Additional studies were also searched in the grey literature. We retrieved the first 200 records on Google Scholar obtained using the keywords “machine learning”, “robustness”, and “healthcare”, and we manually searched records available on relevant conferences and associated workshops websites. These conferences and workshops include Datasets and Benchmarks, Machine learning for Health, and Machine learning for Healthcare.

Additional studies identified in the reference lists of included studies were also considered.

We included a record if it described a notion of robustness for a supervised or semi-supervised machine learning model developed for healthcare applications related to decision support such as diagnosis, prognosis, or treatment recommendation. Thus, we included records both on classification and regression tasks.

Records about treatment recommendations were excluded if the notion tackled by the study was related to so-called “doubly-robust” methods⁵². Since the doubly-robust property is specific to estimators developed within a causal framework, we did not include them, as this robustness property would not be transposable to non-causal settings.

Records that used the term robust but for which the notion was vague or not well defined were also excluded.

We excluded records if information regarding either the predictive model used or the data used to develop the model was insufficient to allow data extraction.

Records focusing on adversarial attacks were excluded if the paper only proposed a novel method to design adversarial samples.

Only records written in English and with an abstract were considered.

Study selection

Retrieved studies were imported to the Covidence software to remove duplicates and for screening.

Two reviewers (AB and CB) independently screened 200 randomly selected records. Any uncertainty during the title and abstract screening was discussed and resolved between the two reviewers, or by consulting a third person (RP) if necessary. Then, one reviewer (AB) screened the remaining records by title and abstract. Another reviewer (FB) checked 15% of the excluded records to assess the reliability of the screening.

Next, two reviewers (AB and CB) independently screened 50 randomly selected records for full text. Any discrepancies were resolved through discussion or by consulting a third person (RP) if no agreement was reached. The remaining full-text studies were screened by one reviewer (AB). A third reviewer (FB) checked 15% of the excluded records based on the rationale for their exclusion.

The study selection process was summarised in a PRISMA flow diagram (Fig. 1).

Data extraction

A data extraction form was developed on Google Sheets to collect general characteristics from the records, including title, nature of the record, corresponding author, and year of publication. Subsequently, for each study, we extracted information regarding the medical application, predictive model, data used to train the model, the notion of robustness addressed, and, if applicable, the use of specific metrics for assessing model robustness.

If multiple items were available for one of the characteristics listed above, each was extracted accordingly. If a study proposed and compared a machine learning model with other methods, only information on the proposed method was extracted. However, if a study evaluated the robustness of various models, details for each method were recorded.

Data extraction was initially performed by two reviewers (AB & CB) on 50 randomly selected records. Then, one reviewer (AB) completed the data extraction for all records. Subsequently, two additional reviewers (FB & OK) independently verified the extracted data for 15% of randomly selected articles each.

Data synthesis and analysis

The extracted information was categorised as follows: each medical application was linked to a specific medical specialty with the assistance of a medical doctor; each predictive model was categorised into a broader class of machine learning methods; each data set was mapped to a type of data. One reviewer (AB) then derived general concepts to categorise the different notions of robustness extracted in each record. The concepts were chosen to encompass the different stages of a machine learning model development. The concepts were discussed and refined by discussing with another reviewer (CB), expert in machine learning.

Data availability

Data supporting this study are available within the main article and supporting materials.

Code availability

The screening process was conducted using Covidence systematic review software, Veritas Health Innovation, Melbourne, Australia. Available at www.covidence.org.

References

Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25, 44–56 (2019).
Article CAS PubMed Google Scholar
Health, C. for D. and R. Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices. FDA (2023).
Zhang, Z. et al. Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nat. Mach. Intell. 1, 236–245 (2019).
Article Google Scholar
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017).
Article CAS PubMed PubMed Central Google Scholar
Pham, T.-C., Luong, C.-M., Hoang, V.-D. & Doucet, A. AI outperformed every dermatologist in dermoscopic melanoma diagnosis, using an optimized deep-CNN architecture with custom mini-batch logic and loss function. Sci. Rep. 11, 17485 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wu, E. et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med. 27, 582–584 (2021).
Article CAS PubMed Google Scholar
Nagendran, M. et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ 368, m689 (2020).
Article PubMed PubMed Central Google Scholar
Finlayson, S. G. et al. Adversarial attacks on medical machine learning. Science 363, 1287–1289 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Med. 15, e1002683 (2018).
Article PubMed PubMed Central Google Scholar
Peggy, B. & Yuan, L. Using AI to help find answers to common skin conditions (Google). https://blog.google/technology/health/ai-dermatology-preview-io-2021/.
Zhang, J. M., Harman, M., Ma, L. & Liu, Y. Machine Learning Testing: Survey, Landscapes and Horizons. IEEE Transact Softw Engg 48, 1–36 (2022).
Article Google Scholar
High Level Expert Group on AI. Assessment List for Trustworthy Artificial Intelligence (ALTAI) for self-assessment. https://digital-strategy.ec.europa.eu/en/library/assessment-list-trustworthy-artificial-intelligence-altai-self-assessment (2020).
Lekadir, K. et al. FUTURE-AI: International consensus guideline for trustworthy and deployable artificial intelligence in healthcare. Preprint at https://doi.org/10.48550/arXiv.2309.12325 (2024).
DeGrave, A. J., Janizek, J. D. & Lee, S.-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat. Mach. Intell. 3, 610–619 (2021).
Article Google Scholar
Balendran, A., Benchoufi, M., Evgeniou, T. & Ravaud, P. Algorithmovigilance, lessons from pharmacovigilance. Npj Digit. Med. 7, 1–6 (2024).
Article Google Scholar
Arksey, H. & O’Malley, L. Scoping studies: towards a methodological framework. Int. J. Soc. Res. Methodol. 8, 19–32 (2005).
Article Google Scholar
Munn, Z. et al. Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach. BMC Med. Res. Methodol. 18, 143 (2018).
Article PubMed PubMed Central Google Scholar
Nyanchoka, L. et al. A scoping review describes methods used to identify, prioritize and display gaps in health research. J. Clin. Epidemiol. 109, 99–110 (2019).
Article PubMed Google Scholar
Kyung, S. et al. Improved performance and robustness of multi-task representation learning with consistency loss between pretexts for intracranial hemorrhage identification in head CT. Med. Image Anal. 81, 102489 (2022).
Article PubMed Google Scholar
Valliani, A. A. et al. Robust Prediction of Non-home Discharge After Thoracolumbar Spine Surgery With Ensemble Machine Learning and Validation on a Nationwide Cohort. World Neurosurg. 165, e83–e91 (2022).
Article PubMed Google Scholar
Huo, J., Wu, L. & Zang, Y. Development and Validation of a Robust Immune-Related Prognostic Signature for Gastric Cancer. J. Immunol. Res. 2021, 5554342 (2021).
Article PubMed PubMed Central Google Scholar
Zhang, W. et al. A Novel and Robust Prognostic Model for Hepatocellular Carcinoma Based on Enhancer RNAs-Regulated Genes. Front. Oncol. 12, 849242 (2022).
Article CAS PubMed PubMed Central Google Scholar
Guan, Y. et al. Assessment of the timeliness and robustness for predicting adult sepsis. iScience 24, 102106 (2021).
Article CAS PubMed PubMed Central Google Scholar
Khoshnevisan, F. & Chi, M. Unifying Domain Adaptation and Domain Generalization for Robust Prediction Across Minority Racial Groups. in Machine Learning and Knowledge Discovery in Databases. Research Track (eds. Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J. & Lozano, J. A.) 521–537 (Springer International Publishing, Cham, 2021). https://doi.org/10.1007/978-3-030-86486-6_32.
Lu, Y. et al. Robust Speech and Natural Language Processing Models for Depression Screening. in 2020 IEEE Signal Processing in Medicine and Biology Symposium (SPMB) 1–5 (2020). https://doi.org/10.1109/SPMB50085.2020.9353611.
Malafaia, M., Silva, F., Neves, I., Pereira, T. & Oliveira, H. P. Robustness Analysis of Deep Learning-Based Lung Cancer Classification Using Explainable Methods. IEEE Access 10, 112731–112741 (2022).
Article Google Scholar
O’Brien, M., Bukowski, J., Hager, G., Pezeshk, A. & Unberath, M. Evaluating neural network robustness for melanoma classification using mutual information. in Medical Imaging 2022: Image Processing vol. 12032 173–177 (SPIE, 2022).
Joel, M. Z. et al. Using Adversarial Images to Assess the Robustness of Deep Learning Models Trained on Diagnostic Images in Oncology. JCO Clin. Cancer Inform. 6, e2100170 (2022).
Article PubMed PubMed Central Google Scholar
Ma, L. & Liang, L. A regularization method to improve adversarial robustness of neural networks for ECG signal classification. Comput. Biol. Med. 144, 105345 (2022).
Article PubMed Google Scholar
Wang, K., Wang, G., Chen, N. & Chen, T. How Robust is Your Automatic Diagnosis Model? in 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 877–884 (2019). https://doi.org/10.1109/BIBM47256.2019.8983217.
Çallı, E. et al. Deep learning with robustness to missing data: A novel approach to the detection of COVID-19. PloS One 16, e0255301 (2021).
Article PubMed PubMed Central Google Scholar
Ramoni, M., Sebastiani, P. & Dybowski, R. Robust outcome prediction for intensive-care patients. Methods Inf. Med. 40, 39–45 (2001).
Article CAS PubMed Google Scholar
Liang, P. P. et al. MULTIBENCH: Multiscale Benchmarks for Multimodal Representation Learning.
Potapenko, I. et al. Detection of oedema on optical coherence tomography images using deep learning model trained on noisy clinical data. Acta Ophthalmol. (Copenh.) 100, 103–110 (2022).
Article Google Scholar
Ju, L. et al. Improving Medical Images Classification With Label Noise Using Dual-Uncertainty Estimation. IEEE Trans. Med. Imaging 41, 1533–1546 (2022).
Article PubMed Google Scholar
Peng, T. et al. Noise Robust Learning with Hard Example Aware for Pathological Image classification. in 2020 IEEE 6th International Conference on Computer and Communications (ICCC) 1903–1907 (2020). https://doi.org/10.1109/ICCC51575.2020.9344937.
Hekler, A. et al. Effects of Label Noise on Deep Learning-Based Skin Cancer Classification. Front. Med. 7, 177 (2020).
Article Google Scholar
Oakden-Rayner, L. Exploring Large-scale Public Medical Image Datasets. Acad. Radiol. 27, (2019).
Kurian, N. C., Meshram, P. S., Patil, A., Patel, S. & Sethi, A. Sample Specific Generalized Cross Entropy for Robust Histology Image Classification. in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI) 1934–1938 (2021). https://doi.org/10.1109/ISBI48211.2021.9434169.
Saab, K. et al. Reducing Reliance on Spurious Features in Medical Image Classification with Spatial Specificity. in Proceedings of the 7th Machine Learning for Healthcare Conference 760–784 (PMLR, 2022).
Wang, X. et al. ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. in 3462–3471 (IEEE Computer Society, 2017). https://doi.org/10.1109/CVPR.2017.369.
Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer Analysis Project. Nat. Genet. 45, 1113–1120 (2013).
Article PubMed PubMed Central Google Scholar
Zhang, H. et al. Re-thinking and Re-labeling LIDC-IDRI for Robust Pulmonary Cancer Prediction. in Medical Image Learning with Limited and Noisy Data (eds. Zamzmi, G. et al.) 42–51 (Springer Nature Switzerland, Cham, 2022). https://doi.org/10.1007/978-3-031-16760-7_5.
Pan, S., Sheng, B., He, G., Li, H. & Xue, G. BAW: learning from class imbalance and noisy labels with batch adaptation weighted loss. Multimed. Tools Appl. 81, 13593–13610 (2022).
Article Google Scholar
Hajiabadi, H., Babaiyan, V., Zabihzadeh, D. & Hajiabadi, M. Combination of loss functions for robust breast cancer prediction. Comput. Electr. Eng. 84, 106624 (2020).
Article Google Scholar
Qayyum, A., Qadir, J., Bilal, M. & Al-Fuqaha, A. Secure and Robust Machine Learning for Healthcare: A Survey. IEEE Rev. Biomed. Eng. 14, 156–180 (2021).
Article PubMed Google Scholar
Freiesleben, T. & Grote, T. Beyond generalization: a theory of robustness in machine learning. Synthese 202, 109 (2023).
Article Google Scholar
Castro, D. C., Walker, I. & Glocker, B. Causality matters in medical imaging. Nat. Commun. 11, 3673 (2020).
Article CAS PubMed PubMed Central Google Scholar
Peters, M. D. J. et al. Guidance for conducting systematic scoping reviews. JBI Evid. Implement. 13, 141–146 (2015).
Google Scholar
Tricco, A. C. et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann. Intern. Med. 169, 467–473 (2018).
Article PubMed Google Scholar
Balendran, A. Machine learning robustness concepts in healthcare: a scoping review protcol. https://osf.io/xrqpb/?view_only=945f3c9f8f7346869418ebf5f788ed3f.
Funk, M. J. et al. Doubly Robust Estimation of Causal Effects. Am. J. Epidemiol. 173, 761–767 (2011).
Article PubMed PubMed Central Google Scholar
Ishii, S. & Ljunggren, D. A Comparative Analysis of Robustness to Noise in Machine Learning Classifiers. (2021).
Arcaini, P., Bombarda, A., Bonfanti, S. & Gargantini, A. Dealing with Robustness of Convolutional Neural Networks for Image Classification. in 2020 IEEE International Conference On Artificial Intelligence Testing (AITest) 7–14 (IEEE, Oxford, UK, 2020). https://doi.org/10.1109/AITEST49225.2020.00009.
Ren, L.-R., Gao, Y.-L., Liu, J.-X., Zhu, R. & Kong, X.-Z. L2,1-Extreme Learning Machine: An Efficient Robust Classifier for Tumor Classification. Comput. Biol. Chem. 89, 107368 (2020).
Article CAS PubMed Google Scholar
Abdelhack, M. et al. A Modulation Layer to Increase Neural Network Robustness Against Data Quality Issues.
Iori, M. et al. Mortality Prediction of COVID-19 Patients Using Radiomic and Neural Network Features Extracted from a Wide Chest X-ray Sample Size: A Robust Approach for Different Medical Imbalanced Scenarios. Appl. Sci. 12, 3903 (2022).
Article CAS Google Scholar
Adnan, N., Najnin, T. & Ruan, J. A Robust Personalized Classification Method for Breast Cancer Metastasis Prediction. Cancers 14, 5327 (2022).
Article PubMed PubMed Central Google Scholar
Suter, Y. et al. Radiomics for glioblastoma survival analysis in pre-operative MRI: exploring feature robustness, class boundaries, and machine learning techniques. Cancer Imaging 20, 55 (2020).
Article PubMed PubMed Central Google Scholar
Cai, L. et al. Robust phase-based texture descriptor for classification of breast ultrasound images. Biomed. Eng. OnLine 14, 26 (2015).
Article PubMed PubMed Central Google Scholar
Park, Y. & Ho, J. C. Tackling Overfitting in Boosting for Noisy Healthcare Data. IEEE Trans. Knowl. Data Eng. 33, 2995–3006 (2021).
Article Google Scholar
Clancy, K. et al. Deep learning for identifying breast cancer malignancy and false recalls: a robustness study on training strategy. in Medical Imaging 2019: Computer-Aided Diagnosis vol. 10950 20–25 (SPIE, 2019).
Vargason, T. et al. Classification of autism spectrum disorder from blood metabolites: Robustness to the presence of co-occurring conditions. Res. Autism Spectr. Disord. 77, 101644 (2020).
Article Google Scholar
Moen, T., Ferrero, A. & McCollough, C. Robustness of Textural Features to Predict Stone Fragility Across Computed Tomography Acquisition and Reconstruction Parameters. Acad. Radiol. 26, 885–892 (2019).
Article PubMed Google Scholar
Massafra, R. et al. Robustness Evaluation of a Deep Learning Model on Sagittal and Axial Breast DCE-MRIs to Predict Pathological Complete Response to Neoadjuvant Chemotherapy. J. Pers. Med. 12, 953 (2022).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank Sally Yacoub and Colin Sidre for their help in devising the search equation. We also thank Amira Benattia for her valuable assistance in mapping the different medical applications to medical specialties. Finally, we thank Elise Diard for her help with results visualisation. The study was realised in the context of the @Hotel-Dieu project, which was funded by the Banque Publique d’Investissement in France. The funder played no role in the study design, data collection, analysis, interpretation of data, writing of the manuscript, and in the decision to submit the manuscript for publication.

Author information

Authors and Affiliations

Université Paris Cité, Université Sorbonne Paris Nord, INSERM, INRAE, Centre for Research in Epidemiology and StatisticS (CRESS), Paris, France
Alan Balendran, Céline Beji, Florie Bouvier, Ottavio Khalifa, Philippe Ravaud & Raphaël Porcher
INSEAD Decision Sciences, Fontainebleau, France
Theodoros Evgeniou
Centre d’Épidémiologie Clinique, Assistance Publique-Hôpitaux de Paris, Hôtel-Dieu, Paris, France
Philippe Ravaud & Raphaël Porcher
Columbia University Mailman School of Public Health, Department of Epidemiology, New York, USA
Philippe Ravaud

Authors

Alan Balendran
View author publications
You can also search for this author inPubMed Google Scholar
Céline Beji
View author publications
You can also search for this author inPubMed Google Scholar
Florie Bouvier
View author publications
You can also search for this author inPubMed Google Scholar
Ottavio Khalifa
View author publications
You can also search for this author inPubMed Google Scholar
Theodoros Evgeniou
View author publications
You can also search for this author inPubMed Google Scholar
Philippe Ravaud
View author publications
You can also search for this author inPubMed Google Scholar
Raphaël Porcher
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

A.B., R.P., and P.R. conceptualised the scoping review. R.P., T.E., and P.R. supervised the study. A.B., C.B., F.B., and O.K. did the screening of the search results and data extraction. A.B. and R.P. drafted the manuscript. All authors critically revised and edited the manuscript. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Alan Balendran.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary material

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Balendran, A., Beji, C., Bouvier, F. et al. A scoping review of robustness concepts for machine learning in healthcare. npj Digit. Med. 8, 38 (2025). https://doi.org/10.1038/s41746-024-01420-1

Download citation

Received: 22 August 2024
Accepted: 24 December 2024
Published: 17 January 2025
DOI: https://doi.org/10.1038/s41746-024-01420-1