Abstract
This paper presents the protocol for integration of data coming from two most common types of biological data (clinical and molecular) for more effective classification patients with cancer disease. In this protocol, the identification of the most informative features is performed by using statistical and information-theory based selection methods for molecular data and the Boruta algorithm for clinical data. Predictive models are built with the help of the Random Forest classification algorithm. The process of data integration includes combining the most informative clinical features and the synthetic features obtained from genetic marker models as input variables for classifier algorithms.
We applied this classification protocol to METABRIC breast cancer samples. Clinical data, gene expression data and somatic copy number aberrations data were used for clinical endpoint prediction. We tested the various methods for combining from different dataset information. Our research shows that both types of molecular data contain features that relevant for clinical endpoint prediction. The best model was obtained by using ten clinical and two synthetic features obtained from biomarker models. In the examined cases, the type of filtration molecular markers had a small impact the predictive power of models even though the lists of top informative biomarkers are divergent.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Burke, H.: Biomark. Cancer 8, 89–99 (2016)
Lu, R., Tang, R., Huang, J.: Clinical application of molecular features in therapeutic selection and drug development. In: Fang, L., Su, C. (eds.) Statistical Methods in Biomarker and Early Clinical Development, pp. 137–166. Springer, Cham (2019)
Yang, Z., et al.: Sci. Rep. 9(1), 13504 (2019)
Xu, C., Jackson, S.: Genome Biol. 20(1), 76 (2019)
de Maturana, E.L., et al.: Genes 10(3), 238 (2019)
Zitnik, M., et al.: Inf. Fusion 50, 71–91 (2019)
Gevaert, O., et al.: IFAC Proc. Vol. 39(1), 1174 (2006)
Daemen, A., et al.: Proceedings of the 29th Annual International Conference of IEEE Engineering in Medicine and Biology Society (EMBC 2007), pp. 5411–5415 (2007)
Boulesteix, A., et al.: Bioinformatics 24, 1698–1706 (2008)
van Vliet, M., et al.: PLoS ONE 7, e40385 (2012)
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2017). https://www.R-project.org/
Gentleman, R., et al.: Genome Biol. 5(10), R80 (2004)
Pereira, B., et al.: Nat. Commun. 7, 11479 (2016)
Gentleman, R., et al.: Genefilter: methods for filtering genes from high-throughput experiments. R package version 1.60.0 (2017)
BD Biosciences: Robust Statistics in BD FACSDiva Software. https://www.bdbiosciences.com/documents/Robust_Statistics_in_BDFACSDiva.pdf. Accessed 16 Jan 2019
Margolin, A., et al.: Sci. Transl. Med. 5(181), 181re1 (2013)
Welch, B.: Biometrika 34(1/2), 28 (1947)
Mnich, K., Rudnicki, W.R.: All-relevant feature selection using multidimensional filters with exhaustive search. Inf. Sci. 524, 277–297 (2020)
Piliszek, R., et al.: R J. 11(1), 2073 (2019)
Jović, A., et al.: 2015 38th International Convention on Information and Communication Technology Electronics and Microelectronics (MIPRO), vol. 112, no. 103375, p. 1200 (2015)
Hochberg, Y.: Biometrika 75(4), 800 (1988)
Carvajal-Rodriguez, A., et al.: BMC Bioinform. 10, 209 (2009)
Kursa, M., et al.: Fund. Inform. 101(4), 271 (2010)
Kursa, M., Rudnicki, W.R.: J. Stat. Softw. 36(11), 1 (2010)
Breiman, L.: Mach. Learn. 45, 5 (2001)
Andy, L., Wiener, M.: R News 2(3), 18 (2002)
Fernández-Delgado, M., et al.: J. Mach. Learn. Res. 15(1), 3133 (2014)
Matthews, B.: Biochim. Biophys. Acta 405(2), 442 (1975)
Dessi, N., et al.: BioMed Res. Int. 2013(387673), 1 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Polewko-Klim, A., Rudnicki, W.R. (2020). Data Integration Strategy for Robust Classification of Biomedical Data. In: Rocha, Á., Adeli, H., Reis, L., Costanzo, S., Orovic, I., Moreira, F. (eds) Trends and Innovations in Information Systems and Technologies. WorldCIST 2020. Advances in Intelligent Systems and Computing, vol 1160. Springer, Cham. https://doi.org/10.1007/978-3-030-45691-7_56
Download citation
DOI: https://doi.org/10.1007/978-3-030-45691-7_56
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45690-0
Online ISBN: 978-3-030-45691-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)