Skip to main content

Advertisement

Log in

Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture

  • Original Article
  • Published:
Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Abstract

Prediction of the stage of cancer plays an important role in planning the course of treatment and has been largely reliant on imaging tools which do not capture molecular events that cause cancer progression. Gene-expression data–based analyses are able to identify these events, allowing RNA-sequence and microarray cancer data to be used for cancer analyses. Breast cancer is the most common cancer worldwide, and is classified into four stages — stages 1, 2, 3, and 4 [2]. While machine learning models have previously been explored to perform stage classification with limited success, multi-class stage classification has not had significant progress. There is a need for improved multi-class classification models, such as by investigating deep learning models. Gene-expression-based cancer data is characterised by the small size of available datasets, class imbalance, and high dimensionality. Class balancing methods must be applied to the dataset. Since all the genes are not necessary for stage prediction, retaining only the necessary genes can improve classification accuracy. The breast cancer samples are to be classified into 4 classes of stages 1 to 4. Invasive ductal carcinoma breast cancer samples are obtained from The Cancer Genome Atlas (TCGA) and Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) datasets and combined. Two class balancing techniques are explored, synthetic minority oversampling technique (SMOTE) and SMOTE followed by random undersampling. A hybrid feature selection pipeline is proposed, with three pipelines explored involving combinations of filter and embedded feature selection methods: Pipeline 1 — minimum-redundancy maximum-relevancy (mRMR) and correlation feature selection (CFS), Pipeline 2 — mRMR, mutual information (MI) and CFS, and Pipeline 3 — mRMR and support vector machine–recursive feature elimination (SVM-RFE). The classification is done using deep learning models, namely deep neural network, convolutional neural network, recurrent neural network, a modified deep neural network, and an AutoKeras generated model. Classification performance post class-balancing and various feature selection techniques show marked improvement over classification prior to feature selection. The best multiclass classification was found to be by a deep neural network post SMOTE and random undersampling, and feature selection using mRMR and recursive feature elimination, with a Cohen-Kappa score of 0.303 and a classification accuracy of 53.1%. For binary classification into early and late-stage cancer, the best performance is obtained by a modified deep neural network (DNN) post SMOTE and random undersampling, and feature selection using mRMR and recursive feature elimination, with an accuracy of 81.0% and a Cohen-Kappa score (CKS) of 0.280. This pipeline also showed improved multiclass classification performance on neuroblastoma cancer data, with a best area under the receiver operating characteristic (auROC) curve score of 0.872, as compared to 0.71 obtained in previous work, an improvement of 22.81%. The results and analysis reveal that feature selection techniques play a vital role in gene-expression data-based classification, and the proposed hybrid feature selection pipeline improves classification performance. Multi-class classification is possible using deep learning models, though further improvement particularly in late-stage classification is necessary and should be explored further.

Graphical Abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Data availability

The dataset used in this research work is obtained from METABRIC [6] and TCGA [31].

References

  1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Zheng X (2015) TensorFlow: Large-scale machine learning on heterogeneous systems. Retrieved July 31, 2023, from https://www.tensorflow.org/

  2. Ahmed O, Brifcani A (2019, April) Gene expression classification based on deep learning. In 2019 4th Scientific International Conference Najaf (SICN). IEEE, pp 145–149

  3. American Cancer Society (2021, June 28) Stages of breast cancer: Understand breast cancer staging. Retrieved October 25, 2021, from https://www.cancer.org/cancer/breast-cancer/understanding-a-breast-cancer-diagnosis/stages-of-breast-cancer.html

  4. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A (2013) NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res 41:D991–D995

    Article  CAS  PubMed  Google Scholar 

  5. Castillo D, Gálvez JM, Herrera LJ, Román BS, Rojas F, Rojas I (2017) Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling. BMC Bioinformatics 18(1):506

    Article  PubMed  PubMed Central  Google Scholar 

  6. cBioPortal for Cancer Genomics (2016) Breast cancer (METABRIC, Nature 2012 & Nat Commune 2016). Retrieved May 25, 2022, from http://www.cbioportal.org/study/summary?id=brca/_metabric

  7. Daoud M, Mayo M (2019) A survey of neural network-based cancer prediction models from microarray data. Artif Intell Med 97:204–214

    Article  PubMed  Google Scholar 

  8. Dertat A (2017, October 9) Applied deep learning — part 1: Artificial neural networks. Medium. Retrieved October 25, 2021, from https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6

  9. Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(02):185–205

  10. Fathi H, AlSalman H, Gumaei A, Manhrawy II, Hussien AG, El-Kafrawy P (2021) An efficient cancer classification model using microarray and high-dimensional data. Comput Intell Neurosci 2021

  11. Goodfellow I, Bengio Y, Courville A (2016) Deep learning. Retrieved October 25, 2021, from https://www.deeplearningbook.org

  12. Google Developers (2020, Feb 11) Classification: Precision and recall | Machine learning crash course. https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall

  13. Google Developers (n.d.) Classification: ROC curve and AUC. Retrieved May 25, 2022, from https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

  14. Google (n.d.) Google Colab. Google Colaboratory. Retrieved May 25, 2022, from https://research.google.com/colaboratory/faq.html

  15. Gosain A, Sardana S (2017) Handling class imbalance problem using oversampling techniques: A review. 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI)

  16. Griffith M, Walker J, Spies N, Ainscough B, Griffith O (2015) Informatics for RNA sequencing: a Web resource for analysis on the cloud. Plos Comput Biol 11(8):e1004393

    Article  PubMed  PubMed Central  Google Scholar 

  17. Hambali MA, Oladele TO, Adewole KS (2020) Microarray cancer feature selection: Review, challenges and research directions. Int J Cogn Comput Eng 1:78–97

  18. IBM Cloud Education (2020) What is deep learning? IBM. Retrieved October 25, 2021, from https://www.ibm.com/cloud/learn/deep-learning

  19. Jin H, Chollet F, Song Q, Hu X (2023) AutoKeras: an AutoML library for deep learning. J Mach Learn Res 6:1–6

    Google Scholar 

  20. Liang H, Zhou G, Lv L et al (2021) KRAS expression is a prognostic indicator and associated with immune infiltration in breast cancer. Breast Cancer 28:379–386. https://doi.org/10.1007/s12282-020-01170-4

    Article  PubMed  Google Scholar 

  21. Lin Z, Ou-Yang L (2023) Inferring gene regulatory networks from single-cell gene expression data via deep multi-view contrastive learning. Brief Bioinforma 24(1):bbac586. https://doi.org/10.1093/bib/bbac586

    Article  CAS  Google Scholar 

  22. Mignone P, Pio G, D’Elia D, Ceci M (2020) Exploiting transfer learning for the reconstruction of the human gene regulatory network. Bioinformatics 36(5):1553–1561. https://doi.org/10.1093/bioinformatics/btz781

    Article  CAS  PubMed  Google Scholar 

  23. Park A, Nam S (2019) Deep learning for stage prediction in neuroblastoma using gene expression data. Genom Inform 17(3)

  24. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Duchesnay É (2011) Scikit-learn: Machine learning in python. J Mach Learn Res 12:2825–2830

  25. Pereira B, Chin SF, Rueda O et al (2016) The somatic mutation profiles of 2,433 breast cancers refine their genomic and transcriptomic landscapes. Nat Commun 7:11479. https://doi.org/10.1038/ncomms11479

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Rajbhandari P, Lopez G, Capdevila C, Salvatori B et al (2018May) Cross-cohort analysis identifies a TEAD4-MYCN positive feedback loop as the core regulatory element of high-risk neuroblastoma. Cancer Discov 8(5):582–599

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Roy S, Kumar R, Mittal V, Gupta D (2020) Classification models for invasive ductal carcinoma progression, based on gene expression data-trained supervised machine learning. Sci Rep 10(1):1–15

    Article  Google Scholar 

  28. Scitable by Nature Education (2014) Gene Expression Is Analyzed by Tracking RNA. Retrieved May 25, 2022, from https://www.nature.com/scitable/topicpage/gene-expression-is-analyzed-by-tracking-rna-6525038/

  29. Sun L, Kong X, Xu J, Zhai R, Zhang S (2019) A hybrid gene selection method based on ReliefF and ant colony optimization algorithm for tumor classification. Sci Rep 9(1):1–14

    Google Scholar 

  30. Suzuki E, Sugimoto M, Kawaguchi K et al (2019) Gene expression profile of peripheral blood mononuclear cells may contribute to the identification and immunological classification of breast cancer patients. Breast Cancer 26:282–289. https://doi.org/10.1007/s12282-018-0920-2

    Article  PubMed  Google Scholar 

  31. The Cancer Genome Atlas Program (n.d.) National Cancer Institute. Retrieved May 25, 2022, from https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga

  32. UICC (2022) UICC and the TNM classification of malignant tumours. UICC. Retrieved May 25, 2022, from https://www.uicc.org/who-we-are/about-uicc/uicc-and-tnm-classification-malignant-tumours

  33. Urbanowicz RJ, Meeker M, La Cava W, Olson RS, Moore JH (2018) Relief-based feature selection: introduction and review. J Biomed Inform 85:189–203

    Article  PubMed  PubMed Central  Google Scholar 

  34. Viera AJ, Garrett JM (2005) Understanding interobserver agreement: the kappa statistic. Fam Med 37(5):360–363

    PubMed  Google Scholar 

  35. World Health Organization (2021) Breast cancer. World Health Organization. Retrieved October 25, 2021, from https://www.who.int/news-room/fact-sheets/detail/breast-cancer

  36. Yao F, Zhang C, Du W, Liu C, Xu Y (2015) Identification of gene-expression signatures and protein markers for breast cancer grading and staging. Plos One 10(9):e0138213

    Article  PubMed  PubMed Central  Google Scholar 

  37. Yuan F, Lu L, Zou Q (2020) Analysis of gene expression profiles of lung cancer subtypes with machine learning algorithms. Biochimica et Biophysica Acta (BBA)-Mol Basis Dis 1866(8):165822

    Article  CAS  Google Scholar 

  38. Yang ZJ, Yu Y, Chi JR et al (2018) The combined pN stage and breast cancer subtypes in breast cancer: a better discriminator of outcome can be used to refine the 8th AJCC staging manual. Breast Cancer 25:315–324. https://doi.org/10.1007/s12282-018-0833-0

    Article  PubMed  Google Scholar 

  39. Zhong L, Meng Q, Chen Y, Du L, Wu P (2021) A laminar augmented cascading flexible neural forest model for classification of cancer subtypes based on gene expression data. BMC Bioinformatics 22(1):1–17. https://doi.org/10.1186/s12859-021-04391-2

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

Akash Kishore: literature survey, collected data set, data preparation and implementation. Lokeswari Y Venkataramana: evaluating the implementation, updating the manuscript, and reviewing the work. D Venkata Vara Prasad: reviewed the paper, suggestions on using different variations of dataset. Akshaya Mohan: literature survey, implementation, writing the manuscript. Bhavya Jha: literature survey, implementation, prepared figures and tables.

Corresponding author

Correspondence to Lokeswari Venkataramana.

Ethics declarations

Ethics approval

This article does not contain any studies with human participants or animals performed by any of the authors. All the authors have agreed to publish this manuscript. The manuscript is not submitted to any other journal or not under consideration of any journal.

Informed consent

Informed consent is not necessary as this article does not involve human or animal participants.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kishore, A., Venkataramana, L., Prasad, D.V. et al. Enhancing the prediction of IDC breast cancer staging from gene expression profiles using hybrid feature selection methods and deep learning architecture. Med Biol Eng Comput 61, 2895–2919 (2023). https://doi.org/10.1007/s11517-023-02892-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11517-023-02892-1

Keywords

Navigation