Abstract
Machine learning has increasingly been applied to predict opioid-related harms due to its ability to handle complex interactions and generating actionable predictions. This review evaluated the types and quality of ML methods in opioid safety research, identifying 44 studies using supervised ML through searches of Ovid MEDLINE, PubMed and SCOPUS databases. Commonly predicted outcomes included postoperative opioid use (n = 15, 34%) opioid overdose (n = 8, 18%), opioid use disorder (n = 8, 18%) and persistent opioid use (n = 5, 11%) with varying definitions. Most studies (96%) originated from North America, with only 7% reporting external validation. Model performance was moderate to strong, but calibration was often missing (41%). Transparent reporting of model development was often incomplete, with key aspects such as calibration, imbalance correction, and handling of missing data absent. Infrequent external validation limited the generalizability of current models. Addressing these aspects is critical for transparency, interpretability, and future implementation of the results.
Similar content being viewed by others
Introduction
Opioids, a class of medications used to treat acute and chronic pain, are associated with persistent use, adverse events, unintentional overdoses, and deaths1. In 2020, nearly 70,000 opioid-related overdose deaths were reported in the United States2,3. Although opioid mortality rates in other countries have not reached these levels, the adverse consequences of prescription opioid use are increasing in countries such as Canada, Australia, and the United Kingdom parallel to increasing prescription use4,5,6,7 In response to the global impact of harmful opioid use on public health, international efforts have intensified to combat the opioid crisis through the development of effective prevention and treatment strategies8.
There has been growing interest in using machine learning (ML) to improve diagnosis, prognosis, and clinical decision-making, a trend largely driven by the widespread availability of large-volume data, such as electronic health records (EHRs) and advances in technology9. ML techniques have shown promise in handling large, nonlinear, and high-dimensional datasets and modelling complex clinical scenarios, offering flexibility over traditional statistical models10. However, despite their potential, ML applications in healthcare have not consistently led to improved patient outcomes10 and often fail to achieve notable clinical impact.11,12 The increasing complexity of ML models such as gradient-boosted machines, random forests, and neural networks can reduce their transparency and interpretability, limiting their application within clinical practice.13,14,15 Model development is often driven more by data availability than by clinical relevance16, with performance gains not always justifying the trade-off between accuracy and interpretability17. Although several ML models have been developed recently to predict opioid-related harms, their clinical utility remains uncertain. We conducted a systematic literature review to identify and summarise available ML prediction models pertaining to opioid-related harm outcomes and assess strengths, limitations, and risk of bias in these studies, providing an overview of the current state-of-the-art ML methods in opioid-drug safety research.
Results
Study selection
The search yielded 1315 studies (Fig. 1) 347 studies were identified as duplicates, and 894 were excluded after title/abstract screening. Seventy-four full-text articles were retrieved for full review. Among excluded studies, 11 did not use ML methods, six described studies that did not develop predictive models as the main objective, and three relied on data sources other than those in the scope of the review. The flow diagram outlining the study selection process and displaying detailed results of our literature search in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) can be found in Fig. 1.
Overview of included studies
All eligible studies were published between 2017 and 2023, with most of them performed in 2021 (n = 11, 25%, Supplementary Table 1). Studies were mainly conducted with data from the United States (n = 39,88.6%), with some studies originating in Canada18,19,20 (n = 3, 6.8%), Iran (n = 1,2.3%)21, and Switzerland (n = 1,2.3%)22. The majority used EHRs as the main data source (n = 31, 70%) as opposed to administrative data, with a retrospective cohort study design to perform their analysis. External validation, vital for evaluating the generalisability of the models, was rare and identified for only five studies23,24,25,26,27. For two studies this external validation was conducted within the same study of the model creation using a separate dataset26,27, and for three models we identified independent studies that aimed to validate the models’ performance in a different setting23,25. The performance of the machine learning models assessed in the reviewed studies varied widely, with area under the receiver operating characteristic curve (AUC) values ranging from 0.6828 to 0.9629 (Table 1). We caution against direct comparison due to differences in training and testing settings across studies. Variations include differences in datasets used, outcomes measured, data partition sizes, and the specific models employed.
The reliability of risk predictions (calibration metrics) was not reported for a notable portion of the reviewed studies (n = 18, 40.9%). These metrics are essential for evaluating how well the predicted probabilities align with the actual observed outcomes [Table 1].
Outcomes of interest of the reviewed studies
Within clinical prediction, diagnostic prediction models are used to estimate the probability of a disease that is already present, while prognostic models aim to assess the risk of future health conditions30. The largest category of studies in this review focused on prognostic models for postoperative opioid use, with 15 studies (34%)23,31,32,33,34,35,36,37,38,39,40,41,42,43,44. The majority of these studies examined hip or knee arthroscopy31,45,46 and spine surgery32,33,34. Other primary outcomes of prediction models included opioid overdose prediction (n = 8, 18%)26,35,36,37,38,39,40,41, opioid use disorder (n = 8, 18%)20,29,42,43,44,47,48,49 and prolonged opioid use (with varying definitions, to be detailed in subsequent sections) (n = 5, 11%)22,25,50,51,52. Additionally, four studies (7%) utilised a composite outcome that included hospitalisations, emergency department visits, substance abuse and mortality19,27,53,54. A smaller subset of studies concentrated on other opioid-related harms as their main outcomes. Specifically, two studies (5%) focused on opioid dependence55,56, one (n = 1, 2%) on mortality57, and one on seizure after tramadol overdose (n = 1, 2%)21.
In the following sections, we provide a detailed explanation of each identified category and an overall summary of the methods, data collection procedures, and statistical analyses used in these studies to examine their research questions. Full summary of included studies, predictor variables (Supplementary Table 2) and numbers of participants/ outcomes (Supplementary Table 3) are included in the online supplementary information.
Prognostic models to predict opioid use after surgery
Fifteen studies (34%)23,24,28,31,32,33,34,45,46,58,59,60,61,62,63 were identified with the primary objective of developing a prognostic predictive model using machine learning to address postoperative opioid use.
All eligible studies were conducted within the last four years, with most of the studies developed within 2022 (n = 4/15, 27%)31,33,34,46, while the earliest studies were from 2019 (n = 3/15, 20%)24,59,60. The studies included in this systematic review used various terms to describe their outcomes related to postoperative opioid use, such as “prolonged,” “chronic use”, “persistent,” “sustained,” and “extended” opioid use. In the absence of a universally agreed definition that can yield different prevalence results64, these definitions varied and encompassing different metrics at various time points. Examples include any opioid prescription filled between 90 to 365 days after surgery, continued opioid use beyond a 3-month postoperative period, use extending up to 6 months, continued postoperative opioid use at specific intervals (14 to 20 days, 6 weeks, 3 months, 6 months), filling at least one opioid prescription more than 90 days after surgery, uninterrupted filling of opioid prescriptions for at least 90 to 180 days, and opioid consumption continuing for at least 150 days following surgery.
Datasets and Sampling: Thirteen studies (n = 13/15, 87%) used EHRs as the main data source (two from a military data repository), and the remaining used insurance claims (n = 2/15, 13%). All the prediction models in these studies were developed with data from patients in the United States, and external validation could only be identified for two of the developed prediction models23,24 (both in Taiwanese cohorts)65,66. External validation remains to be performed in non-US and non-Taiwanese patient groups for all these developed models.
Sample sizes across the included studies exhibited substantial variation, ranging from 38131 to 112,898 patients28 (mean = 13,209.6, median= 5,507). The overall number of outcome events was considerably smaller for most of the studies, with the percentage of outcome incidence ranging from 4%23 to 41%46 (mean = 13%, median= 10%). Although data to develop the prediction models was imbalanced (with an outcome frequency of less than 20%) for thirteen studies in this group, only eight studies explicitly acknowledged it and addressed using techniques such as oversampling (n = 2/15, 13%)31,33 or by reporting the area under the precision-recall curve (AUPRC) (n = 5, 33%)23,60. AUPRC provides a more informative measure of classifier performance on imbalanced data sets than more common classification metrics such as AUC and accuracy, which can be misleading in such a scenario. Many of these studies only used demographic and preoperative predictors to build their predictive models. (n = 8/15, 53%).
When addressing missing data, only eight studies explained how they were handled. Most of them used multiple imputation method with ‘missForest’ (n = 7/15, 46%), a popular Random Forest (RF)-based missing data imputation package in software. One used the Multivariate Imputation by Chained Equations (MICE) package in R. While only two studies explicitly mentioned how continuous variables were handled46,58, eight suggested that they were variables were kept as continuous in downstream analysis; for five studies this was left unclear (33%).
Most studies (n = 8/15, 53%) that developed prognostic models to predict opioid use after surgery used a group of five algorithms for predictive modelling due to their ability to handle complex data and produce accurate predictions: Stochastic Gradient Boosting, RF, Support Vector Machine, Neural Network, Elastic-net Penalised Logistic Regression. Other studies also incorporated XGBoost (n = 4 out of 15, 27%) and LASSO (n = 4 out of 15, 27%) algorithms in their analyses. The most common algorithms were RF (n = 12/15, 80%) and Elastic-net penalised logistic regression (n = 9/15, 60%). Values reported for the area under the receiver operating characteristic curve (AUC) ranged from 0.6828 to 0.9433. Notably, logistic regression, along with its regularized form, Elastic Net, were consistently reported to perform on par to the performance ensemble methods such as random forests and gradient boosting machine algorithms (n = 7/15, 46%). Calibration metrics were reported for most of the studies (n = 13/15, 87%), which included calibration plots, intercept, slope and Brier Score (ranging from 0.037 to 0.136) for these studies.
Patient subtypes: Most of the research on opioid use after orthopaedic surgery included patients who underwent a specific single surgical procedure, mainly arthroplasty (n = 3/15, 20%)31,45,46 and spine patients (n = 3/15, 20%)23,24,61.
Prognostic models to predict opioid use disorder
Eight studies were identified20,29,42,43,44,47,48,49. All, except for one Canadian study20 used data from the United States. The sample size varied significantly, ranging from 130,12029 to 5,183,566 44 (mean= 1,116,761, median= 361,527) with the percentage of outcome events ranging from 1% to 4% (Mean = 2%, Median=2%). Data in this category of studies were highly imbalanced. Two studies used oversampling techniques to handle classification imbalance20,42 and three more reported AUPRC44,47,48. No external validation was identified for any of these models. Data-driven variable selection was only mentioned in two studies, using the Andersen Behavioural Model42 and LASSO logistic regression methods alongside to supervised machine learning algorithms44. In terms of performance, Gradient Boosting and neural network classifiers had the best performance based on AUC ranging from 0.8849 to 0.95929 with no overall performance benefit over logistic regression. In contrast, Support Vector Machine (SVM) was the worst-performing algorithm47. Very few studies (n = 2/8, 25%) reported performing calibration as part of their analysis but only one study appropriately reported it, including calibration plots, intercept, slope, and Brier Score49.
Prognostic models to predict opioid overdose
Eight studies were identified26,35,36,37,38,39,40,41. All the studies in this subgroup were conducted in the United States, four (n = 4 out of 8, 50%) used EHR and four used administrative and insurance claims data from sources like Medicaid or Medicare (n = 4 out of 8, 50%), with the majority being developed in 2021 (n = 3 out of 8, 38%). The population sample size in these studies ranged from 237,259 to 7,284,389, with the percentage of outcome events ranging from 0.04%37 to 1.36%26. Although data imbalance was only directly addressed with imbalance data techniques for three studies36,37,40 it was acknowledged by all studies and additional performance metrics were reported, including AUPRC26,38,39. Only one study was externally validated26. The models developed in this category had high c-statistics, ranging from 0.8526 to 0.9535 and good prediction performance.
Prediction models for persistent opioid use
Five studies were identified.22,25,50,51,52 This included three with an explicit chronic opioid use outcome, one focusing on long-term opioid use prediction and one predicting progression from acute to chronic opioid use51. All of them were retrospective studies, with the oldest dating back to 201825. The population sample size ranged from 27,705 to 418,564 patients, with the percentage of outcome events ranging from 5%25 to 30%51. Only one study addressed class imbalance but used a down-sampling approach25, which risks discarding valuable information potentially leading to reduced model accuracy. All studies had internal validation, but only one was externally validated using data from two additional healthcare organizations25 (Supplementary Table 4). The top-performing algorithms in terms of AUC were logistic regression (3 out of 5, 60%)25, RF classifier (n = 1 out of 5, 20%)52, and XGBoost (n = 1 out of 5, 20%)51, with SVM being the worst performing (AUC = 0.72)51. One study also found that class balancing did not have a significant impact on model performance for most models, despite the relatively rare outcome50. Only one study in this category (n = 1/5, 20%) reported and provided details of calibration for their analysed models52. Calibration details reported across studies are presented in Supplementary Table 5.
Prediction models for opioid dependence
Two studies were identified55,56. The sample size ranged from 102,166 to 199,273. The percentage of outcome events ranged from 0.7%55 to 3.9%56. While both studies used large samples from US EHRs, Che et al benefited from having data from multiple hospitals. Additionally, Che et al. had access to a clear diagnosis of “opioid dependence” in the patient records55 whilst Ellis et al. had to rely on various forms of substance dependence for the definition of their outcome (not exclusive to opioids)56. Both studies addressed the small number of outcomes with class imbalance techniques and showed a good discrimination performance of 0.855 and 0.8756. Che et al. reported that deep learning solutions achieved superior classification results and outperformed other baseline methods. None of the studies reported calibration measures nor were found to be externally validated.
Prognostic models to predict other opioid-related harms
Only a limited number of studies investigated opioid-related harms beyond those commonly addressed above. Among these studies, four developed prognostic models using administrative health data sets to predict hospitalization, emergency visits, and mortality19,27,53,54. For one of these studies, simpler linear models carried higher discrimination and performance19 and for another, although the final model using XGBOOST had high discrimination performance, the calibration plot showed a consistent overestimation of risk53. Vunikili et al. used a retrospective cohort of patients to build a model predicting opioid abuse and mortality. The XGBoost algorithm outperformed logistic regression for classifying patients susceptible to “opioid risks”54. Fouladvand, et al. focused on predicting whether a person experienced opioid abuse, dependence, or an overdose event (opioid-related adverse outcomes) during the 6-month period after surgery. The best predictive performance was achieved by the RF model, with an AUC of 0.87. The model was well calibrated and had good discrimination and was externally validated using data from other states within the United States27. Other studies in this category developed models to predict mortality risk after nonfatal opioid overdose (using gradient boosting machines)57 and predict seizure due to tramadol poisoning (using machine learning models that did not show significant performance improvements compared to the logistic regression model)21.
Risks of bias
In accordance with the Prediction Model Risk of Bias Assessment Tool (PROBAST) guidance, we systematically evaluated the risk of bias in 44 identified studies across four key domains: participants, predictors, outcomes, and analysis (Table 2). We found that 16/44 studies had a high risk of bias in at least one domain and 19/44 unclear in at least one domain due to lack of information. Only 9/44 studies were found to have a low risk of bias across all domains. A summary of the risk of bias assessment for machine learning algorithms grouped by type of outcome is presented in Supplementary Figs. 1–4.
Participants: Ten studies had a high (9/44) or unclear (1/44) risk of bias for their participants. This was primarily due to the following issues: (1) the model was built using datasets from a single health centre (2) the paper’s inclusion and exclusion criteria is not described with enough details to be reproducible and/or (3) there are large differences in demographics of the target population and the population the data on which the model was built upon.
Predictors: Among the studies, 5/44 lacked sufficient information on predictors, rendering bias assessment challenging. For 36 studies, the risk of bias was deemed low due to the comprehensive description of predictor definition, selection, assessment, and handling during analysis. Conversely, 3/44 studies exhibited a high risk of bias, primarily due to insufficient details for the reproducibility of the analysis.
Outcomes: In terms of outcome variables, 36/44 studies demonstrated low risk, 3/44 were unclear, and 5/44 exhibited a high risk of bias. The key contributing factor was the inconsistency between the studies’ stated objective of predicting opioid-related harms and the inclusion of outcomes not specific to opioid-related harms.
Analysis: Only 11/44 studies were evaluated to have a low risk of bias in their analysis, either by adhering to TRIPOD guidelines, following the Guidelines for Developing and Reporting Machine Learning Models in Biomedical research, or providing sufficient detail on the analysis methodology. Conversely, 28 studies lacked adequate information, leading to classification as unclear risk, while 5 studies exhibited a high risk of bias. Common shortcomings included inadequate information for reproducibility, such as missing data handling, class imbalance handling, hyperparameter description, tool/library versions, and non-availability of code (Table 3).
Public availability of the algorithms and models: Only 5/44 studies published the code for reproducing their results, 4/44 suggested that it is available on request (Table 3).
Discussion
This systematic review summarises the extensive efforts of the international community to address the opioid crisis by developing ML prediction models. Comparing complex models with more interpretable ones is necessary to assess whether the trade-offs between performance and interpretability is justified17. Studies in this area have primarily been published in recent years and show promise and potential in identifying patients at risk of opioid-related harms in North America (n = 36/44, including studies from the US n = 33/44 and Canada n = 3/44). However, our findings reveal specific methodological limitations, including poor transparency, inadequate machine learning methodology disclosure, limited reproducibility, and biases in study design. The existing literature often relies on C-statistics alone as a performance metric, which may lead to overestimating the advantages of the prediction tools for rare outcomes. In a highly imbalanced dataset, as can be the case with opioid-adverse events, the ROC curve can be overly optimistic. ROC-AUC performance metric is calculated based on the true positive rate and false positive rate67. The true positive rate (also known as sensitivity or recall) represents the proportion of actual positives correctly identified as such. The false positive rate represents the proportion of actual negatives incorrectly identified as positives68. When the true negative rate is much larger (also known as the specificity or calculated as 1- false positive rate), even a large number of false positives might result in a low false positive rate, artificially inflating the AUC and giving a false impression of high performance (AUC close to 1) even when the model is not effectively identifying the few positive cases. Considering additional metrics, such as the Precision-Recall curve, F1 Score, or other metrics could better reflect model performance in imbalanced scenarios69. The absence of calibration reporting in a substantial proportion of the examined studies (41%) raised concerns about the reliability of the reviewed prediction models. Details about internal and external validation for each of the studies is included in Supplementary Table 4.
Our review suggests that developed ML models for predicting opioid related harms primarily showcase ML’s potential rather than being created for clinical application, making them restricted for research purposes only. Key barriers to the practical application of these prediction models include the lack of external validation, accessibility, and the infrastructure to process risk scores and to ensure the algorithms usability and effectiveness26. Given the importance of model interpretability for government policymaking, we suggest using explanatory algorithms such as Local Interpretable Model-Agnostic Explanations (LIME)70 and Shapley Additive Explanations (SHAP)71. Explanatory algorithms could allow clinicians and patients to understand the relationships between the variables in the model, making them more transparent and interpretable. In the reviewed studies only 36% (n = 16/44) included them in their analysis. However, these methods have limitations. Both LIME and SHAP provide insights based on the correlations defined by the model, but they do not offer causal explanations. Additionally, these methods may struggle with collinearity; for example, in the presence of highly correlated variables, SHAP values might be high for one variable and zero or very low for another72. Furthermore, calculating SHAP values can be computationally expensive, especially for large datasets or complex models. Lastly, the SHAP method assumes the model is a “black box”, meaning it does not incorporate information about the model’s internal structure73.
Only five models reached the threshold of methodological quality, reproducibility, and external validation that is required to support utilization in clinical practice. For most studies (n = 39,89%) there was an absence of reporting on how results could be implemented clinically or for external validation. This limitation emphasises why deployment of ML models in real-world clinical settings remains uncommon11. The generalisability of the reviewed models is also limited, due to the use of data from limited sources such as being from a single centre,21,62,63 a single geographic area,26 or sources that don’t fully reflect the population they aim to study58.
Sampling bias in data collection was found to be a common problem in studies using insurance claims datasets (which don’t fully capture the demographics most at risk for opioid-related harms) and research based on military personnel data (which may focus on a younger, predominantly male population58). Some studies (n = 4,9%) also revealed risk of racial bias and found differences in the racial composition of the groups used to develop the model versus those used to test it62, which could particularly impact under-represented subgroups of patients and are important considerations in the development of ML algorithms74.
Predictor variables varied significantly across the studies that developed clinical prediction models, with various studies not providing a complete list of variables used, which raises concerns about reproducibility. While demographic factors such as age and sex were commonly reported, only 15 out of 44 studies (n = 34%) explicitly included socioeconomic status as a predictor in their models. The lack of consistent reporting and inclusion of variables such as socioeconomic status in many studies may weaken the robustness of the ML models and limit their generalisability.
For the identified models aimed to predict mortality, their accuracy is questionable since the cause of death was not always available and may not necessarily be related to opioid exposure57.
For the selection of model predictors, authors chose them based on theory or previous literature. There is an opportunity to explore the benefits of a pure data mining approach and compare its utility against with models developed with user input43. The addition of opioid dosage (e.g. as morphine milligram equivalents per day) in future prediction models aimed for clinical implementation is crucial. Even though opioid dose has been found to be associated with a higher likelihood of opioid-harms by several studies43,75, less than half of the reviewed studies (n = 21,44%) considered opioid dosage, often due to lack of data availability. Additionally, considering the time-varying nature of opioid use with discrete periods of being on or off the drug, transparent preparation of this type of data is especially important to avoid misclassification of opioid exposure that can considerably affect the point estimates associated with adverse outcomes76. None of the reviewed models that predicted adverse opioid outcomes employed ML for time-to-event analysis, indicating a potential area for future development.
Recent advancements in deep learning have shown strong predictive capabilities of transformer-based models, which could improve the performance of existing ML models in fields like healthcare77. Originally developed for natural language processing, their ability to capture structure in human language could generalise to life-sequences, such as socio-economic and health data, for classification tasks. However, despite their ability to provide highly accurate predictions and often outperform state-of-the-art algorithms, transformer-based models are still nascent in the context of clinical prediction modelling and present several current challenges. These include their highly complex architectures with multiple layers of attention mechanisms, a vast number of parameters, high computational demands, domain-specific adaptations and the need for large, high-quality, well-curated datasets78.
We would like to acknowledge strengths and limitations of the current study. This review evaluated and summarised systematically the various applications of machine learning models in addressing prescription opioid-related harms in adults. We acknowledge that some authors may have performed model calibration as well as handling class imbalance and missing data but not mentioned so in their studies. Therefore, we labelled risk of bias as “unclear” to avoid overestimating problems. In cases with not enough information on model development, performance, and calibration, it is hard to interpret the validity of the results. This study complements the works by Garbin et al. and Emam et al., by focusing specifically on machine learning-based prognostic models that predict opioid-related harms and assessing bias risk using PROBAST. Systematic reviews have been partially conducted addressing the opioid epidemic in the present year79,80, mainly focusing on postoperative opioid misuse, while other opioid-related harms have not undergone similar assessments. In addition, we report on specific areas of development that could be useful for future research building on the work done in this field thus far.
The application of machine learning in predicting opioid-related harms and identifying at-risk patients has shown promising results, offering valuable insights into the complex landscape of opioid use. Currently, most of these models have not been implemented in clinical practice. Instead, they serve as research tools, illustrating the potential power of machine learning in enhancing our understanding of opioid-related harms. One of the key limitations of the current literature is the lack of transparency reporting and limited external validation studies for the developed prediction models using ML. In navigating the future integration of ML in opioid-related harm prediction, researchers should prioritise comprehensive validation efforts, ensuring that these models are robust, generalisable, and properly reported to be reproducible.
Methods
Identification of studies
To identify relevant articles that were published from the inception of records until the 12th of October 2023, an all-time search was conducted using Ovid MEDLINE, PubMed, and SCOPUS databases. The search was performed without any restrictions on publication date, language, or study design to ensure all relevant studies were included. To improve transparency, we followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist81. Using the Population, Intervention, Comparison, and Outcome (PICO) framework, we defined our relevant search: different types of prescription opioids (population), a broad definition of ML (intervention) and opioid-related harms (outcome). We used a combination of MeSH terms, keywords, and Boolean operators to construct our search string (full search details are provided in Supplementary Table 6).
Inclusion criteria
We used a three-stage process to identify the studies to be included in this review (Fig. 1). Articles were eligible for full-text review if they detailed construction of one or more machine learning-based prediction models, focusing on a primary outcome directly associated with prescription opioid-harms. We specifically sought studies addressing harms such as dependence, misuse, abuse, overdose, hospitalisations, and death. It is important to note that the terminology used to describe these harms varied across studies. Terms such as chronic opioid use, long-term opioid use, and persistent opioid use may have been used interchangeably or with different definitions64. Opioid dependence, whilst sometimes used synonymously in the literature, is the adaptation to repeated exposure to some drugs and medicines usually characterised by tolerance and/or withdrawal and may be inevitable for those on long-term opioids82. Addiction relates to dependence with a compulsive preoccupation to seek and take an opioid despite consequences82. To ensure a comprehensive coverage of opioid-related harms, we included studies that used a range of these terms and their synonyms (Supplementary Table 6). Regarding methodology, we considered a study to use supervised machine learning if it reported any statistical learning technique to predict outcomes of interest or categorise cases based on a known ground truth, regardless of the terminology used by the authors83,84. We excluded studies that developed models based solely on regression techniques. In addition, eligible studies had to predominantly utilise data from EHRs or patient administrative records.
Exclusion criteria
We excluded studies from our review if they met any of the following criteria: 1) studies that primarily focused on risk factor identification without a clear emphasis on predictive modelling; 2) studies utilising ML to process and analyse human language, enhance the reading of images, to understand user-generated text; 3) studies employing genetic traits or molecular markers as predictive factors; 4) studies that were systematic reviews, letters, conference abstracts or commentaries: 5) studies relying on data sources other than EHRs or administrative patient data (e.g. public domain data such as social media, internet searches, and surveys); 6) studies not concentrating on human subjects or those specifically targeting paediatric populations (patients younger than 18 years old). Full inclusion and exclusion criteria can be found in [Table 4].
Screening process used to select studies for inclusion
All abstracts were screened by C.R. with uncertainties being resolved by consensus with M.J. The full text of selected abstracts was assessed for eligibility by C.R., with the supervision of M.J.
Data extraction, quality assessment and analysis
In our data extraction using the “CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies” (CHARMS) checklist Bias assessment process was performed using the Prediction model Risk of Bias Assessment (PROBAST) tool. We followed the standardised checklist designed by B.M. Fernandez-Felix, et al.85. The data collected for each study included information such as study design, sample size, number of outcome events, main data sources, machine learning algorithms used, the best and worst algorithms based on C-statistic, outcomes measured and their definition, mention of class imbalance (imbalance between the frequency of outcome events and nonevents)86 and missingness, calibration measures reported and details of internal and external validation. The full list of extraction items can be found in the Supplementary Table 6. To identify external validation studies an all-time search was conducted on October 12, 2023, using Ovid MEDLINE, PubMed, and SCOPUS databases (full search details are provided in Supplementary Table 6).
Data availability
All data generated or analysed during this study are included in this published article and its supplementary information. All study materials are available from the corresponding author upon reasonable request.
References
Skolnick, P. The Opioid Epidemic: Crisis and Solutions. Annu Rev. Pharm. Toxicol. 58, 143–159 (2018).
NIH National Institute on Drug Abuse. Drug Overdose Deaths: Facts and Figures. https://nida.nih.gov/research-topics/trends-statistics/overdose-death-rates (2024).
Volkow, N. D. & Blanco, C. The changing opioid crisis: development, challenges and opportunities. Mol. Psychiatry 26, 218–233 (2021).
Moore, E. M., Warshawski, T., Jassemi, S., Charles, G. & Vo, D. X. Time to act: Early experience suggests stabilization care offers a feasible approach for adolescents after acute life-threatening opioid toxicity. Paediatr. Child Health 27, 260–264 (2022).
Gardner, E. A., McGrath, S. A., Dowling, D. & Bai, D. The Opioid Crisis: Prevalence and Markets of Opioids. Forensic Sci. Rev. 34, 43–70 (2022).
Alenezi, A., Yahyouche, A. & Paudyal, V. Current status of opioid epidemic in the United Kingdom and strategies for treatment optimisation in chronic pain. Int J. Clin. Pharm. 43, 318–322 (2021).
Jani, M., Birlie Yimer, B., Sheppard, T., Lunt, M. & Dixon, W. G. Time trends and prescribing patterns of opioid drugs in UK primary care patients with non-cancer pain: A retrospective cohort study. PLOS Med. 17, e1003270 (2020).
Humphreys, K. et al. Responding to the opioid crisis in North America and beyond: recommendations of the Stanford-Lancet Commission. Lancet 399, 555–604 (2022).
Roehrs, A., da Costa, C. A., Righi, R. D. & de Oliveira, K. S. Personal Health Records: A Systematic Literature Review. J. Med Internet Res 19, e13 (2017).
Dhiman, P. et al. Methodological conduct of prognostic prediction models developed using machine learning in oncology: a systematic review. BMC Med. Res. Methodol. 22, 101 (2022).
Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 17, 195 (2019).
Wynants, L. et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ 369, m1328 (2020).
Wang, F., Kaushal, R. & Khullar, D. Should Health Care Demand Interpretable Artificial Intelligence or Accept “Black Box” Medicine? Ann. Intern Med. 172, 59–60 (2020).
Lipton, Z. The mythos of model interpretability. Commun. ACM 61, 36–43 (2018).
Ciobanu-Caraus, O. et al. A critical moment in machine learning in medicine: on reproducible and interpretable learning. Acta Neurochirurgica 166, 14 (2024).
Varoquaux, G. & Cheplygina, V. Machine learning for medical imaging: methodological failures and recommendations for the future. NPJ Digit Med. 5, 48 (2022).
Christodoulou, E. et al. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J. Clin. Epidemiol. 110, 12–22 (2019).
Sharma, B. et al. Publicly available machine learning models for identifying opioid misuse from the clinical notes of hospitalized patients. BMC Med. Inform. Decision Making 20, 79 (2020).
Sharma, V., Kulkarni, V., Eurich, D. T., Kumar, L. & Samanani, S. Safe opioid prescribing: a prognostic machine learning approach to predicting 30-day risk after an opioid dispensation in Alberta, Canada. BMJ open 11, e043964 (2021).
Liu, Y. S. et al. Individualized prospective prediction of opioid use disorder. Can. J. Psychiatry 68, 54–63 (2023).
Behnoush, B. et al. Machine learning algorithms to predict seizure due to acute tramadol poisoning. Hum. Exp. Toxicol. 40, 1225–1233 (2021).
Held, U. et al. Development and internal validation of a prediction model for long-term opioid use-an analysis of insurance claims data. Pain 165, 44–53 (2024).
Karhade, A. V. et al. Predicting prolonged opioid prescriptions in opioid-naïve lumbar spine surgery patients. Spine J. 20, 888–895 (2020).
Karhade, A. V. et al. Development of machine learning algorithms for prediction of prolonged opioid prescription after surgery for lumbar disc herniation. Spine J. 19, 1764–1771 (2019).
Calcaterra, S. L. et al. Prediction of Future Chronic Opioid Use Among Hospitalized Patients. J. Gen. Intern. Med. 33, 898–905 (2018).
Lo-Ciganic, W.-H. et al. Developing and validating a machine-learning algorithm to predict opioid overdose in Medicaid beneficiaries in two US states: a prognostic modelling study. Lancet Digi. Health 4, e455–e465 (2022).
Fouladvand, S. et al. A Comparative Effectiveness Study on Opioid Use Disorder Prediction Using Artificial Intelligence and Existing Risk Models. IEEE J. Biomed. Health Inform. 27, 3589–3598 (2023).
Hur, J. et al. Predicting postoperative opioid use with machine learning and insurance claims in opioid-naïve patients. Am. J. Surg. 222, 659–665 (2021).
Segal, Z. et al. Development of a machine learning algorithm for early detection of opioid use disorder. Pharmacol. Res. Perspect. 8, e00669 (2020).
van Smeden, M., Reitsma, J. B., Riley, R. D., Collins, G. S. & Moons, K. G. Clinical prediction models: diagnosis versus prognosis. J. Clin. Epidemiol. 132, 142–145 (2021).
Lu, Y. et al. Machine-learning model successfully predicts patients at risk for prolonged postoperative opioid use following elective knee arthroscopy. Knee Surg. Sports Traumatol. Arthrosc. 30, 762–772 (2022).
Katakam, A., Karhade, A. V., Schwab, J. H., Chen, A. F. & Bedair, H. S. Development and validation of machine learning algorithms for postoperative opioid prescriptions after TKA. J. Orthop. 22, 95–99 (2020).
Gabriel, R. A. et al. Machine learning approach to predicting persistent opioid use following lower extremity joint arthroplasty. Reg. Anesthesia pain. Med. 47, 313–319 (2022).
Klemt, C. et al. Machine learning algorithms predict extended postoperative opioid use in primary total knee arthroplasty. Knee Surg., Sports Traumatol., Arthrosc. 30, 2573–2581 (2022).
Dong, X. et al. Predicting opioid overdose risk of patients with opioid prescriptions using electronic health records based on temporal deep learning. J. Biomed. Inform. 116, 103725 (2021).
Dong, X. et al. Machine Learning Based Opioid Overdose Prediction Using Electronic Health Records. AMIA Annu. Symp. Proc. 2019, 389–398 (2019).
Gellad, W. F. et al. Development and validation of an overdose risk prediction tool using prescription drug monitoring program data. Drug Alcohol Depend. 246, 109856 (2023).
Lo-Ciganic, W.-H. et al. Integrating human services and criminal justice data with claims data to predict risk of opioid overdose among Medicaid beneficiaries: A machine-learning approach. PloS one 16, e0248360 (2021).
Lo-Ciganic, W.-H. et al. Evaluation of Machine-Learning Algorithms for Predicting Opioid Overdose Risk Among Medicare Beneficiaries With Opioid Prescriptions. JAMA Netw. Open 2, e190968 (2019).
Ripperger, M. et al. Ensemble learning to predict opioid-related overdose using statewide prescription drug monitoring program and hospital discharge data in the state of Tennessee. J. Am. Med. Inform. Assoc.: JAMIA 29, 22–32 (2021).
Sun, J. W. et al. Predicting overdose among individuals prescribed opioids using routinely collected healthcare utilization data. PloS one 15, e0241083 (2020).
Annis, I. E., Jordan, R. & Thomas, K. C. Quickly identifying people at risk of opioid use disorder in emergency departments: trade-offs between a machine learning approach and a simple EHR flag strategy. BMJ open 12, e059414 (2022).
Banks, T. J., Nguyen, T. D., Uhlmann, J. K., Nair, S. S. & Scherrer, J. F. Predicting opioid use disorder before and after the opioid prescribing peak in the United States: A machine learning tool using electronic healthcare records. Health Inform. J. 29, 14604582231168826 (2023).
Dong, X. et al. Identifying risk of opioid use disorder for patients taking opioid medications with deep learning. J. Am. Med. Inform. Assoc. : JAMIA 28, 1683–1693 (2021).
Kunze, K. N., Polce, E. M., Alter, T. D. & Nho, S. J. Machine Learning Algorithms Predict Prolonged Opioid Use in Opioid-Naïve Primary Hip Arthroscopy Patients. J. Am. Acad. Orthop. Surg. Glob. Res. Rev. 5, 00093–00098 (2021).
Grazal, C. F. et al. A machine-learning algorithm to predict the likelihood of prolonged opioid use following arthroscopic hip surgery. Arthrosc. 38, 839–847.e832 (2022).
Gao, W., Leighton, C., Chen, Y., Jones, J. & Mistry, P. Predicting opioid use disorder and associated risk factors in a medicaid managed care population. Am. J. Managed Care 27, 148–154 (2021).
Kashyap, A., Callison-Burch, C. & Boland, M. R. A deep learning method to detect opioid prescription and opioid use disorder from electronic health records. Int. J. Med. Inform. 171, 104979 (2023).
Lo-Ciganic, W.-H. et al. Using machine learning to predict risk of incident opioid use disorder among fee-for-service Medicare beneficiaries: A prognostic study. PloS one 15, e0235981 (2020).
Bjarnadottir, M. V., Anderson, D. B., Agarwal, R. & Nelson, D. A. Aiding the prescriber: developing a machine learning approach to personalized risk modeling for chronic opioid therapy amongst US Army soldiers. Health Care Manag. Sci. 25, 649–665 (2022).
Johnson, D. G. et al. Prescription quantity and duration predict progression from acute to chronic opioid use in opioid-naive Medicaid patients. PLOS Digit. Health 1, https://doi.org/10.1371/journal.pdig.0000075 (2022).
Mohl, J. T. et al. Predicting Chronic Opioid Use Among Patients With Osteoarthritis Using Electronic Health Record Data. Arthritis Care Res. 75, 1511–1518 (2023).
Sharma, V. et al. Development and Validation of a Machine Learning Model to Estimate Risk of Adverse Outcomes Within 30 Days of Opioid Dispensation. JAMA Netw. Open 5, e2248559 (2022).
Vunikili, R. et al. Predictive modelling of susceptibility to substance abuse, mortality and drug-drug interactions in opioid patients. Front. Artif. Intell. 4, 742723 (2021).
Che, Z., St Sauver, J., Liu, H. & Liu, Y. Deep Learning Solutions for Classifying Patients on Opioid Use. AMIA … Annu. Symp. Proc. AMIA Symp. 2017, 525–534 (2017).
Ellis, R. J., Wang, Z., Genes, N. & Ma’ayan, A. Predicting opioid dependence from electronic health records with machine learning. BioData Min. 12, 3 (2019).
Guo, J. et al. Predicting Mortality Risk After a Hospital or Emergency Department Visit for Nonfatal Opioid Overdose. J. Gen. Intern. Med. 36, 908–915 (2021).
Anderson, A. B. et al. Can Predictive Modeling Tools Identify Patients at High Risk of Prolonged Opioid Use after ACL Reconstruction? Clin. Orthop. Relat. Res. 478, 00–1618 (2020).
Karhade, A. V., Schwab, J. H. & Bedair, H. S. Development of Machine Learning Algorithms for Prediction of Sustained Postoperative Opioid Prescriptions After Total Hip Arthroplasty. J. Arthroplast. 34, 2272–2277.e2271 (2019).
Karhade, A. V. et al. Machine learning for prediction of sustained opioid prescription after anterior cervical discectomy and fusion. Spine J. 19, 976–983 (2019).
Zhang, Y. et al. A predictive-modeling based screening tool for prolonged opioid use after surgical management of low back and lower extremity pain. Spine J. 20, 1184–1195 (2020).
Giladi, A. M. et al. Patient-Reported Data Augment Prediction Models of Persistent Opioid Use after Elective Upper Extremity Surgery. Plast. Reconstruct. Surg. 152, 358e–366e (2023).
Baxter, N. B. et al. Predicting persistent opioid use after hand surgery: a machine learning approach. Plast. Reconstr. Surg. 54, 573–580 (2024).
Huang, Y.-T., Jenkins, D. A., Peek, N., Dixon, W. G. & Jani, M. High frequency of long-term opioid use among patients with rheumatic and musculoskeletal diseases initiating opioids for the first time. Ann. Rheumatic Dis. 82, 1116–1117 (2023).
Chen, S.-F. et al. External validation of machine learning algorithm predicting prolonged opioid prescriptions in opioid-naive lumbar spine surgery patients using a Taiwanese cohort. J. Formosan Med. Associat. https://doi.org/10.1016/j.jfma.2023.06.027 (2023).
Yen, H.-K. et al. A machine learning algorithm for predicting prolonged postoperative opioid prescription after lumbar disc herniation surgery. An external validation study using 1,316 patients from a Taiwanese cohort. Spine J. 22, 1119–1130 (2022).
Nahm, F. S. Receiver operating characteristic curve: overview and practical use for clinicians. Korean J. Anesthesiol. 75, 25–36 (2022).
Monaghan, T. F. et al. Foundational statistical principles in medical research: sensitivity, specificity, positive predictive value, and negative predictive value. Medicina (Kaunas) 57, https://doi.org/10.3390/medicina57050503 (2021).
Movahedi, F., Padman, R. & Antaki, J. F. Limitations of receiver operating characteristic curve on imbalanced data: Assist device mortality risk scores. J. Thorac. Cardiovasc Surg. 165, 1433–1442.e1432 (2023).
Shin, J. Feasibility of local interpretable model-agnostic explanations (LIME) algorithm as an effective and interpretable feature selection method: comparative fNIRS study. Biomed. Eng. Lett. 13, 689–703 (2023).
Rodríguez-Pérez, R. & Bajorath, J. Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions. J. Comput Aided Mol. Des. 34, 1013–1026 (2020).
Durgia, C. Using SHAP for Explainability — Understand these Limitations First, https://towardsdatascience.com/using-shap-for-explainability-understand-these-limitations-first-1bed91c9d21 (2021).
Huang, X. & Marques-Silva, J. On the failings of Shapley values for explainability. Int. J. Approx. Reason. 171, 109112 (2024).
Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 447–453 (2019).
Goesling, J. et al. Trends and predictors of opioid use after total knee and total hip arthroplasty. Pain 157, 1259–1265 (2016).
Jani, M. et al. Take up to eight tablets per day”: Incorporating free-text medication instructions into a transparent and reproducible process for preparing drug exposure data for pharmacoepidemiology. Pharmacoepidemiol Drug Saf. 32, 651–660 (2023).
Savcisens, G. et al. Using sequences of life-events to predict human lives. Nat. Comput. Sci. 1–14 (2023).
Denecke, K., May, R. & Rivera-Romero, O. Transformer Models in Healthcare: A Survey and Thematic Analysis of Potentials, Shortcomings and Risks. J. Med Syst. 48, 23 (2024).
Garbin, C., Marques, N. & Marques, O. Machine learning for predicting opioid use disorder from healthcare data: A systematic review. Comput. methods Prog. Biomed. 236, 107573 (2023).
Emam, O. S. et al. Machine learning algorithms predict long-term postoperative opioid misuse: a systematic review. Am Surg. 90, 140–151 (2024).
Page, M. J. et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372, n71 (2021).
Taylor S. et al. Dependence and withdrawal associated with some prescribed medicines: An evidence review. Public Health England (2019) https://assets.publishing.service.gov.uk/media/5fc658398fa8f5474c800149/PHE_PMR_report_Dec2020.pdf.
Andaur Navarro, C. L. et al. Systematic review finds “spin” practices and poor reporting standards in studies on machine learning-based prediction models. J. Clin. Epidemiol. 158, 99–110 (2023).
Pruneski, J. A. et al. Supervised machine learning and associated algorithms: applications in orthopedic surgery. Knee Surg. Sports Traumatol. Arthrosc. 31, 1196–1202 (2023).
Fernandez-Felix, B. M., López-Alcalde, J., Roqué, M., Muriel, A. & Zamora, J. CHARMS and PROBAST at your fingertips: a template for data extraction and risk of bias assessment in systematic reviews of predictive models. BMC Med. Res. Methodol. 23, 44 (2023).
Japkowicz, N. & Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 6, 429–449 (2002).
Acknowledgements
This work was funded by the National Institute for Health and Care Research (NIHR) [NIHR301413]. M.J. is funded by an NIHR Advanced Fellowship [NIHR301413]. The views expressed in this publication are those of the authors and not necessarily those of the NIHR, NHS or the UK Department of Health and Social Care. The funder played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.
Author information
Authors and Affiliations
Contributions
Study Design and Execution: M.J., and C.R. conceptualized the study. C.R, M.J. and D.A.J. collaboratively designed the study workflow. C.R. conducted formal analysis, including identifying eligible studies, performing data collection and assessing quality of the studies. Writing and Editing: C.R. drafted the initial manuscript, and M.J., D.A.J., C.R., and J.A. actively participated in review, editing, and finalization. Supervision: M.J. provided overall supervision and guidance throughout the research and writing process. All authors read and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ramírez Medina, C.R., Benitez-Aurioles, J., Jenkins, D.A. et al. A systematic review of machine learning applications in predicting opioid associated adverse events. npj Digit. Med. 8, 30 (2025). https://doi.org/10.1038/s41746-024-01312-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-024-01312-4