Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Out-of-distribution generalization from labelled and unlabelled gene expression data for drug response prediction

A preprint version of the article is available at bioRxiv.

Abstract

Data discrepancy between preclinical and clinical datasets poses a major challenge for accurate drug response prediction based on gene expression data. Different methods of transfer learning have been proposed to address such data discrepancy in drug response prediction for different cancers. These methods generally use cell lines as source domains, and patients, patient-derived xenografts or other cell lines as target domains; however, it is assumed that the methods have access to the target domain during training or fine-tuning, and they can only take labelled source domains as input. The former is a strong assumption that is not satisfied during deployment of these models in the clinic, whereas the latter means these methods rely on labelled source domains that are of limited size. To avoid these assumptions, we formulate drug response prediction in cancer as an out-of-distribution generalization problem, which does not assume that the target domain is accessible during training. Moreover, to exploit unlabelled source domain data—which tends to be much more plentiful than labelled data—we adopt a semi-supervised approach. We propose Velodrome, a semi-supervised method of out-of-distribution generalization that takes labelled and unlabelled data from different resources as input and makes generalizable predictions. Velodrome achieves this goal by introducing an objective function that combines a supervised loss for accurate prediction, an alignment loss for generalization and a consistency loss to incorporate unlabelled samples. Our experimental results demonstrate that Velodrome outperforms state-of-the-art pharmacogenomics and transfer learning baselines on cell lines, patient-derived xenografts and patients. Finally, we showed that Velodrome models generalize to different tissue types that were well-represented, under-represented or completely absent in the training data. Overall, our results suggest that Velodrome may guide precision oncology more accurately.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Schematic of the Velodrome method with three source domains (two labelled and one unlabelled).
Fig. 2: Comparisons between Velodrome and state-of-the-art drug response prediction methods.
Fig. 3
Fig. 4: Comparisons of Velodrome predictions to the baseline correlation in terms of Pearson and Spearman correlations.

Similar content being viewed by others

Data availability

All the final preprocessed data employed in this paper are publicly available here: https://zenodo.org/record/4793442#.YK1HVqhKiUk (ref. 76). All the raw data before preprocessing are also publicly available as follows: (1) cell-line datasets with gene expression and drug response data, including CTRPv2, GDSCv2 and gCSI, were downloaded from ORCESTRA69; (2) TCGA cohorts with gene expression data were downloaded from Firehose (http://gdac.broadinstitute.org/) on 28 January 2016. Drug response data for TCGA cohorts was obtained from ref. 39; (3) PDX datasets (gene expression with drug response data) were obtained from the Supplementary Information of ref. 3; (4) Patient dataset (gene expression with drug response data) were obtained from the accession codes GSE25065 (Docetaxel and Paclitaxel) and GSE33072 (Erlotinib). Source data are provided with this paper.

Code availability

All the codes, model objects and supplementary material used to run and reproduce our experimental results are publicly available at https://github.com/hosseinshn/Velodrome (ref. 77). We also provided a conda environment to ensure version compatibility for future users.

References

  1. Marquart, J., Chen, E. Y. & Prasad, V. Estimation of the percentage of US patients with cancer who benefit from genome-driven oncology. JAMA Oncol. 4, 1093–1098 (2018).

    Article  Google Scholar 

  2. Pal, S. K. et al. Clinical cancer advances 2019: annual report on progress against cancer from the American society of clinical oncology. J. Clin. Oncol. 37, 834–849 (2019).

    Article  Google Scholar 

  3. Gao, H. et al. High-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response. Nat. Med. 21, 1318–1325 (2015).

    Article  Google Scholar 

  4. Garnett, M. J. et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature 483, 570–575 (2012).

    Article  Google Scholar 

  5. Barretina, J. et al. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012).

    Article  Google Scholar 

  6. Basu, A. et al. An interactive resource to identify cancer genetic and lineage dependencies targeted by small molecules. Cell 154, 1151–1161 (2013).

    Article  Google Scholar 

  7. Seashore-Ludlow, B. et al. Harnessing connectivity in a large-scale small-molecule sensitivity dataset. Cancer Discov. 5, 1210–1223 (2015).

    Article  Google Scholar 

  8. Klijn, C. et al. A comprehensive transcriptional portrait of human cancer cell lines. Nat. Biotechnol. 33, 306–312 (2015).

    Article  Google Scholar 

  9. Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740–754 (2016).

    Article  Google Scholar 

  10. Haverty, P. M. et al. Reproducible pharmacogenomic profiling of cancer cell line panels. Nature 533, 333–337 (2016).

    Article  Google Scholar 

  11. Mourragui, S., Loog, M., van de Wiel, M. A., Reinders, M. J. T. & Wessels, L. F. A. PRECISE: a domain adaptation approach to transfer predictors of drug response from pre-clinical models to tumors. Bioinformatics 35, i510–i519 (2019).

    Article  Google Scholar 

  12. Sharifi-Noghabi, H., Peng, S., Zolotareva, O., Collins, C. C. & Ester, M. AITL: Adversarial Inductive Transfer Learning with input and output space adaptation for pharmacogenomics. Bioinformatics 36, i380–i388 (2020).

    Article  Google Scholar 

  13. Haibe-Kains, B. et al. Inconsistency in large pharmacogenomic studies. Nature 504, 389–393 (2013).

    Article  Google Scholar 

  14. Mpindi, J. P. et al. Consistency in drug response profiling. Nature 540, E5–E6 (2016).

    Article  Google Scholar 

  15. Geeleher, P., Gamazon, E. R., Seoighe, C., Cox, N. J. & Huang, R. S. Consistency in large pharmacogenomic studies. Nature 540, E1–E2 (2016).

    Article  Google Scholar 

  16. Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2010).

    Article  Google Scholar 

  17. Neyshabur, B., Sedghi, H. & Zhang, C. What is being transferred in transfer learning? In 34th Conference on Neural Information Processing Systems (NeurIPS, 2020).

  18. Raghu, M. et al. Transfusion: understanding transfer learning for medical imaging. In 33rd Conference on Neural Information Processing System (eds, Wallach, H. et al.) 3347–3357 (Curran Associates, 2019).

  19. Hu, J. et al. Iterative transfer learning with neural network for clustering and cell type classification in single-cell RNA-seq analysis. Nat. Mach. Intell. 2, 607–618 (2020).

    Article  Google Scholar 

  20. Sharifi-Noghabi, H., Zolotareva, O., Collins, C. C. & Ester, M. MOLI: multi-omics late integration with deep neural networks for drug response prediction. Bioinformatics 35, i501–i509 (2019).

    Article  Google Scholar 

  21. Snow, O. et al. Interpretable Drug Response Prediction using a Knowledge-based Neural Network. In Proc. 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (2021).

  22. Kuenzi, B. M. et al. Predicting drug response and synergy using a deep learning model of human cancer cells. Cancer Cell 38, 672–684.e6 (2020).

    Article  Google Scholar 

  23. Mourragui, S. et al. Predicting clinical drug response from model systems by non-linear subspace-based transfer learning. Preprint at https://www.biorxiv.org/content/10.1101/2020.06.29.177139v3 (2020).

  24. Ma, J. et al. Few-shot learning creates predictive models of drug response that translate from high-throughput screens to individual patients. Nat. Cancer 2, 233–244 (2021).

  25. Zhu, Y. et al. Ensemble transfer learning for the prediction of anti-cancer drug response. Sci. Rep. 10, 18040 (2020).

    Article  Google Scholar 

  26. Salvadores, M., Fuster-Tormo, F. & Supek, F. Matching cell lines with cancer type and subtype of origin via mutational, epigenomic, and transcriptomic patterns. Sci. Adv. 6, aba1862 (2020).

    Article  Google Scholar 

  27. Najgebauer, H. et al. CELLector: genomics-guided selection of cancer in vitro models. Cell Syst. 10, 424–432.e6 (2020).

    Article  Google Scholar 

  28. Peres da Silva, R., Suphavilai, C. & Nagarajan, N. TUGDA: task uncertainty guided domain adaptation for robust generalization of cancer drug response prediction from in vitro to in vivo settings. Bioinformatics 37, i76–i83 (2021).

  29. Warren, A. et al. Global computational alignment of tumor and cell line transcriptional profiles. Nat. Commun. 12, 22 (2021).

    Article  Google Scholar 

  30. Gulrajani, I. & Lopez-Paz, D. In search of lost domain generalization. In International Conference on Learning Representations (2021).

  31. Wang, J. et al. Generalizing to unseen domains: a survey on domain generalization. In Proc. Thirtieth International Joint Conference on Artificial Intelligence (2021).

  32. Zhou, K., Liu, Z., Qiao, Y., Xiang, T. & Loy, C. C. Domain generalization: a survey. Preprint at https://arxiv.org/abs/2103.02503 (2021).

  33. Zhang, H. et al. An empirical framework for domain generalization in clinical settings. In Proc. Conference on Health, Inference, and Learning (ACM, 2021); https://doi.org/10.1145/3450439.3451878

  34. Zhao, S., Gong, M., Liu, T., Fu, H. & Tao, D. Domain generalization via entropy regularization. In 33rd Conference on Neural Information Processing Systems (NeurIPS, 2020).

  35. Wang, Z., Loog, M. & van Gemert, J. Respecting domain relations: hypothesis invariance for domain generalization. In 2020 25th International Conference on Pattern Recognition 9756–9763 (ICPR, 2021).

  36. Cancer Genome Atlas Research Network et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).

    Article  Google Scholar 

  37. Schwartz, L. H. et al. RECIST 1.1—update and clarification: from the RECIST committee. Eur. J. Cancer 62, 132–137 (2016).

    Article  Google Scholar 

  38. Hatzis, C. et al. A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. JAMA 305, 1873–1881 (2011).

    Article  Google Scholar 

  39. Ding, Z., Zu, S. & Gu, J. Evaluating the molecule-based prediction of clinical drug responses in cancer. Bioinformatics 32, 2891–2895 (2016).

    Article  Google Scholar 

  40. Tarvainen, A. & Valpola, H. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In 31st Conference on Neural Information Processing Systems (2017).

  41. Yang, Y. & Xu, Z. Rethinking the value of labels for improving class-imbalanced learning. In Conference on Neural Information Processing Systems (2020).

  42. Geeleher, P. et al. Discovering novel pharmacogenomic biomarkers by imputing drug response in cancer patients from large genomics studies. Genome Res. 27, 1743–1751 (2017).

    Article  Google Scholar 

  43. Noghabi, H. S. et al. Drug sensitivity prediction from cell line-based pharmacogenomics data: guidelines for developing machine learning models. Briefings Bioinformatics https://doi.org/10.1093/bib/bbab294 (2021).

  44. Renner, W., Langsenlehner, U., Krenn-Pilko, S., Eder, P. & Langsenlehner, T. BCL2 genotypes and prostate cancer survival. Strahlenther. Onkol. 193, 466–471 (2017).

    Article  Google Scholar 

  45. Chaudhary, K. S., Abel, P. D. & Lalani, E. N. Role of the Bcl-2 gene family in prostate cancer progression and its implications for therapeutic intervention. Environ. Health Perspect. 107, 49–57 (1999).

    Google Scholar 

  46. Paraf, F., Gogusev, J., Chrétien, Y. & Droz, D. Expression of Bcl-2 oncoprotein in renal cell tumours. J. Pathol. 177, 247–252 (1995).

    Article  Google Scholar 

  47. Bhat, K. M. R. & Setaluri, V. Microtubule-associated proteins as targets in cancer chemotherapy. Clin. Cancer Res. 13, 2849–2854 (2007).

    Article  Google Scholar 

  48. He, Z., Liu, H., Moch, H. & Simon, H.-U. Machine learning with autophagy-related proteins for discriminating renal cell carcinoma subtypes. Sci. Rep. 10, 720 (2020).

    Article  Google Scholar 

  49. Martin, S. K., Kamelgarn, M. & Kyprianou, N. Cytoskeleton targeting value in prostate cancer treatment. Am. J. Clin. Exp. Urol. 2, 15–26 (2014).

    Google Scholar 

  50. Kelly, R. S. et al. The role of tumor metabolism as a driver of prostate cancer progression and lethal disease: results from a nested case-control study. Cancer Metab. 4, 22 (2016).

    Article  Google Scholar 

  51. Numakura, K. et al. Successful mammalian target of rapamycin inhibitor maintenance therapy following induction chemotherapy with gemcitabine and doxorubicin for metastatic sarcomatoid renal cell carcinoma. Oncol. Lett. 8, 464–466 (2014).

    Article  Google Scholar 

  52. Pignon, J.-C. et al. Androgen receptor controls EGFR and ERBB2 gene expression at different levels in prostate cancer cell lines. Cancer Res. 69, 2941–2949 (2009).

    Article  Google Scholar 

  53. Reid, A., Vidal, L., Shaw, H. & de Bono, J. Dual inhibition of ErbB1 (EGFR/HER1) and ErbB2 (HER2/neu). Eur. J. Cancer 43, 481–489 (2007).

    Article  Google Scholar 

  54. Gordon, M. S. et al. Phase II study of Erlotinib in patients with locally advanced or metastatic papillary histology renal cell cancer: SWOG S0317. J. Clin. Oncol. 27, 5788–5793 (2009).

    Article  Google Scholar 

  55. Chen, Y.-H. et al. No more discrimination: cross city adaptation of road scene segmenters. In Proc. IEEE International Conference on Computer Vision 1992–2001 (IEEE, 2017).

  56. Costello, J. C. et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nat. Biotechnol. 32, 1202–1212 (2014).

    Article  Google Scholar 

  57. Jiang, Y., Rensi, S., Wang, S. & Altman, R. B. DrugOrchestra: jointly predicting drug response, targets, and side effects via deep multi-task learning. Preprint at https://www.biorxiv.org/content/10.1101/2020.11.17.385757v1 (2020).

  58. Pozdeyev, N. et al. Integrating heterogeneous drug sensitivity data from cancer pharmacogenomic studies. Oncotarget 7, 51619–51625 (2016).

    Article  Google Scholar 

  59. Xia F, et al. A cross-study analysis of drug response prediction in cancer cell lines. Brief. Bioinform. (2021).

  60. Sharifi-Noghabi, H., Liu, Y., Erho, N. & Shrestha, R. Deep genomic signature for early metastasis prediction in prostate cancer. Preprint at https://www.biorxiv.org/content/10.1101/276055v2 (2019).

  61. Torrente, A. et al. Identification of cancer related genes using a comprehensive map of human gene expression. PLoS ONE 11, e0157484 (2016).

    Article  Google Scholar 

  62. Villicaña, C., Cruz, G. & Zurita, M. The basal transcription machinery as a target for cancer therapy. Cancer Cell Int. 14, 18 (2014).

    Article  Google Scholar 

  63. Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell 174, 1034–1035 (2018).

    Article  Google Scholar 

  64. Joshi, S. K. et al. ERBB2/HER2 mutations are transforming and therapeutically targetable in leukemia. Leukemia 34, 2798–2804 (2020).

    Article  Google Scholar 

  65. Thomas, R. & Weihua, Z. Rethink of EGFR in cancer with its kinase independent function on board. Front. Oncol. 9, 800 (2019).

  66. Nath, S. et al. The prognostic impact of epidermal growth factor receptor (EGFR) in patients with acute myeloid leukaemia. Indian J. Hematol. Blood Transfus. 36, 749–753 (2020).

    Article  Google Scholar 

  67. Iqbal, N. & Iqbal, N. Human epidermal growth factor receptor 2 (HER2) in cancers: overexpression and therapeutic implications. Molecular Biol. Int. 2014, 1–9 (2014).

    Article  Google Scholar 

  68. Goss, G. D. et al. Association of ERBB mutations with clinical outcomes of Afatinib- or Erlotinib-treated patients with lung squamous cell carcinoma: Secondary analysis of the LUX-lung 8 randomized clinical trial. JAMA Oncol. 4, 1189–1197 (2018).

    Article  Google Scholar 

  69. Mammoliti, A. et al. Orchestrating and sharing large multimodal data for transparent and reproducible research. Nature Communications volume 12, Article number: 5797 (2021).

  70. Smirnov, P. et al. PharmacoGx: an R package for analysis of large pharmacogenomic datasets. Bioinformatics 32, 1244–1246 (2016).

    Article  Google Scholar 

  71. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Erratum: near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 888 (2016).

    Article  Google Scholar 

  72. Manica, M. et al. Toward explainable anticancer compound sensitivity prediction via multimodal attention-based convolutional encoders. Mol. Pharm. 16, 4797–4806 (2019).

    Article  Google Scholar 

  73. Sun, B. & Saenko, K. Deep CORAL: correlation alignment for deep domain adaptation. In Computer VisionECCV 2016 Workshops 443–450 (Springer, 2016).

  74. Sakellaropoulos, T. et al. A deep learning framework for predicting response to therapy in cancer. Cell Rep. 29, 3367–3373.e4 (2019).

    Article  Google Scholar 

  75. Smirnov, P. et al. PharmacoDB: an integrative database for mining in vitro anticancer drug screening studies. Nucl. Acids Res. 46, D994–D1002 (2018).

    Article  Google Scholar 

  76. Sarifi-Noghabi, H,. Harjandi, P. A., Zolotareva, O., Collins, C. C. & Ester, M. Velodrome: Out-of-Distribution Generalization from Labeled and Unlabeled Gene Expression Data for Drug Response Prediction (Zenodo, 2021); https://doi.org/10.5281/zenodo.4793442

  77. Sharifi-Noghabi, H. Code Repository hosseinshn/Velodrome: DOI (v1.0.0) (Zenodo, 2021); https://doi.org/10.5281/zenodo.5164625

Download references

Acknowledgements

We would like to thank H. Asghari (Ocean Genomics) and S. Peng (Simon Fraser University) for their support. We also would like to thank the Vancouver Prostate Centre and Compute Canada (West Grid) for providing the computational resources for this research. This work was supported by a Discovery Grant from the National Science and Engineering Research Council of Canada (to M.E.), Canada Foundation for Innovation (33440 to C.C.C.), The Canadian Institutes of Health Research (PJT-153073 to C.C.C.), Terry Fox Foundation (201012TFF to C.C.C.) and The Terry Fox New Frontiers Program Project Grants (1062 to C.C.C.).

Author information

Authors and Affiliations

Authors

Contributions

H.S-.N. and M.E. conceived the study concept and design. H.S-.N. was responsible for the deep learning design, implementations and analysis. H.S-.N. and O.Z. performed data preprocessing, analysis and interpretation. H.S-.N. and P.A.H. performed the experiments. H.S-N., P.A.H. and O.Z. analysed and interpreted the results. C.C.C. and M.E. supervised the project.

Corresponding author

Correspondence to Martin Ester.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1

The percentage of tissue types in CTRPv2 and GDSCv2 cell line datasets combined.

Source data

Supplementary information

Supplementary Information

Supplementary Tables 1–3.

Source data

Source Data Fig. 2

Results of prediction performance for cell lines, PDXs and patients.

Source Data Fig. 3

Results of multiple runs for cell lines, PDXs and patients (Fig. 3A sheets) and the ablation study (Fig. 3B sheet).

Source Data Fig. 4

Results of prediction performance compared with baseline correlations.

Source Data Extended Data Fig. 1

The percentage of tissue types in CTRPv2 and GDSCv2 cell line datasets combined.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharifi-Noghabi, H., Harjandi, P.A., Zolotareva, O. et al. Out-of-distribution generalization from labelled and unlabelled gene expression data for drug response prediction. Nat Mach Intell 3, 962–972 (2021). https://doi.org/10.1038/s42256-021-00408-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-021-00408-w

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing