Abstract
Automatic cancer diagnosis based on RNA-Seq profiles is at the intersection of transcriptome analysis and machine learning. Methods developed for this task could be a valuable support in clinical practice and provide insights into the cancer causal mechanisms. To correctly approach this problem, the largest existing resource (The Cancer Genome Atlas) must be complemented with healthy tissue samples from the Genotype-Tissue Expression project. In this work, we empirically prove that previous approaches to joining these databases suffer from translation biases and correct them using batch z-score normalization. Moreover, we propose CanDLE, a multinomial logistic regression model that achieves state of the art performance in multilabel cancer/healthy tissue type classification (\(94.1\%\) balanced accuracy) and all-vs-one cancer type detection (\(78.0\%\) average \(\max F_1\)).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
The Cancer Genome Atlas Program - National Cancer Institute. https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga
Ahn, T., et al.: Deep learning-based identification of cancer or normal tissue using gene expression data, pp. 1748–1752. IEEE (2018). https://doi.org/10.1109/BIBM.2018.8621108
Chen, H.I.H., Chiu, Y.C., Zhang, T., Zhang, S., Huang, Y., Chen, Y.: GSAE: an autoencoder with embedded gene-set nodes for genomics functional characterization. BMC Syst. Biol. 12(8), 45–57 (2018). https://doi.org/10.1186/S12918-018-0642-2
Dobin, A., et al.: STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). https://doi.org/10.1093/bioinformatics/bts635
Fávero, L.P., Belfiore, P.: Binary and multinomial logistic regression models (2019). https://doi.org/10.1016/B978-0-12-811216-8.00014-8
Ge, S.X., Jung, D., Yao, R.: ShinyGO: a graphical gene-set enrichment tool for animals and plants. Bioinformatics 36, 2628–2629 (2020). https://doi.org/10.1093/bioinformatics/btz931
Hong, J., Hachem, L.D., Fehlings, M.G.: A deep learning model to classify neoplastic state and tissue origin from transcriptomic data. Sci. Rep. 12, 9669 (2022). https://doi.org/10.1038/s41598-022-13665-5
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings (2014). https://arxiv.org/abs/1412.6980v9
Li, B., Dewey, C.N.: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12, 323 (2011). https://doi.org/10.1186/1471-2105-12-323
Li, Y., et al.: A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data. BMC Genomics 18, 1–13 (2017). https://doi.org/10.1186/S12864-017-3906-0
Lonsdale, J., et al.: The genotype-tissue expression (GTEx) project. Nat. Genet. 45(6), 580–585 (2013). https://doi.org/10.1038/ng.2653
Lyu, B., Haque, A.: Deep learning based tumor type classification using gene expression data. bioRxiv p. 364323 (2018). https://doi.org/10.1101/364323
Mostavi, M., Chiu, Y.C., Huang, Y., Chen, Y.: Convolutional neural network models for cancer type prediction based on gene expression. BMC Med. Genom. 13(5), 44 (2020). https://doi.org/10.1186/s12920-020-0677-2
Quinn, T.P., Nguyen, T., Lee, S.C., Venkatesh, S.: Cancer as a tissue anomaly: classifying tumor transcriptomes based only on healthy data. Front. Genet. 10, 599 (2019). https://doi.org/10.3389/fgene.2019.00599
Ramirez, R., et al.: Classification of cancer types using graph convolutional neural networks. Front. Phys. 8, 1–14 (2020). https://doi.org/10.3389/fphy.2020.00203
Singh, D., Singh, B.: Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 97, 105524 (2020). https://doi.org/10.1016/j.asoc.2019.105524
Tripathi, R., Sharma, P., Chakraborty, P., Varadwaj, P.K.: Next-generation sequencing revolution through big data analytics. Front. Life Sci. 9, 119–149 (2016). https://doi.org/10.1080/21553769.2016.1178180
Vivian, J., et al.: Toil enables reproducible, open source, big biomedical data analyses. Nat. Biotechnol. 35, 314–316 (2017). https://doi.org/10.1038/nbt.3772
Wang, Q., et al.: Unifying cancer and normal RNA sequencing data from different sources. Sci. Data 5, 180061 (2018). https://doi.org/10.1038/sdata.2018.61
Acknowledgement
GM acknowledges the support of a UniAndes-DeepMind Scholarship 2022. We also acknowledge the valuable help of Camilo Becerra in graphics and tables preparation, and Danniel Moreno for useful discussions and feedback.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mejía, G., Bloch, N., Arbelaez, P. (2022). CanDLE: Illuminating Biases in Transcriptomic Pan-Cancer Diagnosis. In: Qin, W., Zaki, N., Zhang, F., Wu, J., Yang, F. (eds) Computational Mathematics Modeling in Cancer Analysis. CMMCA 2022. Lecture Notes in Computer Science, vol 13574. Springer, Cham. https://doi.org/10.1007/978-3-031-17266-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-17266-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17265-6
Online ISBN: 978-3-031-17266-3
eBook Packages: Computer ScienceComputer Science (R0)