Skip to main content

CanDLE: Illuminating Biases in Transcriptomic Pan-Cancer Diagnosis

  • Conference paper
  • First Online:
Computational Mathematics Modeling in Cancer Analysis (CMMCA 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13574))

Abstract

Automatic cancer diagnosis based on RNA-Seq profiles is at the intersection of transcriptome analysis and machine learning. Methods developed for this task could be a valuable support in clinical practice and provide insights into the cancer causal mechanisms. To correctly approach this problem, the largest existing resource (The Cancer Genome Atlas) must be complemented with healthy tissue samples from the Genotype-Tissue Expression project. In this work, we empirically prove that previous approaches to joining these databases suffer from translation biases and correct them using batch z-score normalization. Moreover, we propose CanDLE, a multinomial logistic regression model that achieves state of the art performance in multilabel cancer/healthy tissue type classification (\(94.1\%\) balanced accuracy) and all-vs-one cancer type detection (\(78.0\%\) average \(\max F_1\)).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. The Cancer Genome Atlas Program - National Cancer Institute. https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga

  2. Ahn, T., et al.: Deep learning-based identification of cancer or normal tissue using gene expression data, pp. 1748–1752. IEEE (2018). https://doi.org/10.1109/BIBM.2018.8621108

  3. Chen, H.I.H., Chiu, Y.C., Zhang, T., Zhang, S., Huang, Y., Chen, Y.: GSAE: an autoencoder with embedded gene-set nodes for genomics functional characterization. BMC Syst. Biol. 12(8), 45–57 (2018). https://doi.org/10.1186/S12918-018-0642-2

    Article  Google Scholar 

  4. Dobin, A., et al.: STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). https://doi.org/10.1093/bioinformatics/bts635

    Article  Google Scholar 

  5. Fávero, L.P., Belfiore, P.: Binary and multinomial logistic regression models (2019). https://doi.org/10.1016/B978-0-12-811216-8.00014-8

  6. Ge, S.X., Jung, D., Yao, R.: ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinformatics 36, 2628–2629 (2020). https://doi.org/10.1093/bioinformatics/btz931

    Article  Google Scholar 

  7. Hong, J., Hachem, L.D., Fehlings, M.G.: A deep learning model to classify neoplastic state and tissue origin from transcriptomic data. Sci. Rep. 12, 9669 (2022). https://doi.org/10.1038/s41598-022-13665-5

    Article  Google Scholar 

  8. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings (2014). https://arxiv.org/abs/1412.6980v9

  9. Li, B., Dewey, C.N.: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12, 323 (2011). https://doi.org/10.1186/1471-2105-12-323

    Article  Google Scholar 

  10. Li, Y., et al.: A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data. BMC Genomics 18, 1–13 (2017). https://doi.org/10.1186/S12864-017-3906-0

    Article  Google Scholar 

  11. Lonsdale, J., et al.: The genotype-tissue expression (GTEx) project. Nat. Genet. 45(6), 580–585 (2013). https://doi.org/10.1038/ng.2653

    Article  Google Scholar 

  12. Lyu, B., Haque, A.: Deep learning based tumor type classification using gene expression data. bioRxiv p. 364323 (2018). https://doi.org/10.1101/364323

  13. Mostavi, M., Chiu, Y.C., Huang, Y., Chen, Y.: Convolutional neural network models for cancer type prediction based on gene expression. BMC Med. Genom. 13(5), 44 (2020). https://doi.org/10.1186/s12920-020-0677-2

    Article  Google Scholar 

  14. Quinn, T.P., Nguyen, T., Lee, S.C., Venkatesh, S.: Cancer as a tissue anomaly: classifying tumor transcriptomes based only on healthy data. Front. Genet. 10, 599 (2019). https://doi.org/10.3389/fgene.2019.00599

    Article  Google Scholar 

  15. Ramirez, R., et al.: Classification of cancer types using graph convolutional neural networks. Front. Phys. 8, 1–14 (2020). https://doi.org/10.3389/fphy.2020.00203

    Article  Google Scholar 

  16. Singh, D., Singh, B.: Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 97, 105524 (2020). https://doi.org/10.1016/j.asoc.2019.105524

    Article  Google Scholar 

  17. Tripathi, R., Sharma, P., Chakraborty, P., Varadwaj, P.K.: Next-generation sequencing revolution through big data analytics. Front. Life Sci. 9, 119–149 (2016). https://doi.org/10.1080/21553769.2016.1178180

    Article  Google Scholar 

  18. Vivian, J., et al.: Toil enables reproducible, open source, big biomedical data analyses. Nat. Biotechnol. 35, 314–316 (2017). https://doi.org/10.1038/nbt.3772

    Article  Google Scholar 

  19. Wang, Q., et al.: Unifying cancer and normal RNA sequencing data from different sources. Sci. Data 5, 180061 (2018). https://doi.org/10.1038/sdata.2018.61

    Article  Google Scholar 

Download references

Acknowledgement

GM acknowledges the support of a UniAndes-DeepMind Scholarship 2022. We also acknowledge the valuable help of Camilo Becerra in graphics and tables preparation, and Danniel Moreno for useful discussions and feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriel Mejía .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mejía, G., Bloch, N., Arbelaez, P. (2022). CanDLE: Illuminating Biases in Transcriptomic Pan-Cancer Diagnosis. In: Qin, W., Zaki, N., Zhang, F., Wu, J., Yang, F. (eds) Computational Mathematics Modeling in Cancer Analysis. CMMCA 2022. Lecture Notes in Computer Science, vol 13574. Springer, Cham. https://doi.org/10.1007/978-3-031-17266-3_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-17266-3_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17265-6

  • Online ISBN: 978-3-031-17266-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics