Abstract
Type 2 diabetes mellitus (T2D) remains a critical health concern, particularly in its early disease stages such as prediabetes. Understanding these early stages is paramount for improving patient outcomes. Multiomics data integration tools offer promise in unraveling the underlying mechanisms of T2D. The advent of high-throughput technology and the increasing availability of multiomics data has led to the development of several statistical and network-based integration methods. However, the performance of such methods varies, requiring their output evaluation in an unbiased manner. Here, we conducted a comparative analysis of three represented unsupervised multiomics integration tools, MOFA + , GFA, and ICluster alongside an in-house supervised model EMFR, using two complementary benchmarks. First, we assessed how well the features selected by each tool could discriminate between patient and control samples using both linear and nonlinear classification models. Second, we quantified how much each type of omics data-selected features contributed to the total variance. Through such detailed comparisons between the unsupervised, we observed that the features selected by MOFA + and GFA gave the best F1 score (0.7) in the nonlinear classification model, clearly discriminating between patient and control classes. Hence, we recommend these two unsupervised integration tools for feature selection purposes. Our comparative analyses were conducted on a real biological dataset to further study prediabetes patients. Such multiomics data enabled the detection of prediabetes subtypes and provided several clinical insights that will open a new gate toward the era of personalized medicine for diabetic disease.







Similar content being viewed by others
Data availability
The dataset supporting the findings of this study is available in the GitHub repository at (https://github.com/ahmedtariq/MultiOmic-Ensembled-Feature-Reduction) and was retrieved from the integrative human microbiome project 'iHMP' (https://portal.hmpdacc.org; T2D iHMP Google Cloud platform).
References
Allesøe RL, Lundgaard AT, Hernández Medina R, Aguayo-Orozco A, Johansen J, Nissen JN, Brorsson C, Mazzoni G, Niu L, Biel JH, Brasas V, Webel H, Benros ME, Pedersen AG, Chmura PJ, Jacobsen UP, Mari A, Koivula R, Mahajan A, Abdalla M (2023) Discovery of drug–omics associations in type 2 diabetes with generative deep-learning models. Nat Biotechnol 41(3):399–408. https://doi.org/10.1038/s41587-022-01520-x
Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni JC, Buettner F, Huber W, Stegle O (2018) Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol. https://doi.org/10.15252/msb.20178124
Argelaguet R, Arnol D, Bredikhin D, Deloro Y, Velten B, Marioni JC, Stegle O (2020) MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. https://doi.org/10.1186/s13059-020-02015-1
Bersanelli M, Mosca E, Remondini D, Giampieri E, Sala C, Castellani G, Milanesi L (2016) Methods for the integration of multi-omics data: mathematical aspects. BMC Bioinform. https://doi.org/10.1186/s12859-015-0857-9
Cantini L, Zakeri P, Hernandez C, Naldi A, Thieffry D, Remy E, Baudot A (2021) Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer. Nat Commun. https://doi.org/10.1038/s41467-020-20430-7
Cao Y, Ghazanfar S, Yang P, Yang J (2023) Benchmarking of analytical combinations for COVID-19 outcome prediction using single-cell RNA sequencing data. Brief Bioinform. https://doi.org/10.1093/bib/bbad159
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42. https://doi.org/10.1007/s10994-006-6226-1
Huang E, Kim S, Ahn T (2021) Deep learning for integrated analysis of insulin resistance with multi-omics data. J Person Med 11(2):1–14. https://doi.org/10.3390/jpm11020128
Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. https://doi.org/10.1186/s13059-014-0550-8
Mo Q, Shen R, Guo C, Vannucci M, Chan KS, Hilsenbeck SG (2018) A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics 19(1):71–86. https://doi.org/10.1093/biostatistics/kxx017
Nicora G, Vitali F, Dagliati A, Geifman N, Bellazzi R (2020) Integrated multi-omics analyses in oncology: a review of machine learning methods and tools. Front Oncol. https://doi.org/10.3389/fonc.2020.01030
Pierre-Jean M, Deleuze JF, Le Floch E, Mauger F (2020) Clustering and variable selection evaluation of 13 unsupervised methods for multi-omics data integration. Brief Bioinform 21(6):2011–2030. https://doi.org/10.1093/bib/bbz138
Pollard KS, Dudoit S, van der Laan MJ (2005) Multiple testing procedures: the multtest package and applications to genomics. In: Gentleman R, Carey VJ, Huber W, Irizarry RA, Dudoit S (eds) Bioinformatics and computational biology solutions using R and bioconductor. Springer, New York, pp 249–271. https://doi.org/10.1007/0-387-29362-0_15
Rappoport N, Shamir R (2018) Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res 46(20):10546–10562. https://doi.org/10.1093/nar/gky889
Subramanian I, Verma S, Kumar S, Jere A, Anamika K (2020) Multi-omics data integration, interpretation, and its application. Bioinform Biol Insights. https://doi.org/10.1177/1177932219899051
Tini G, Marchetti L, Priami C, Scott-Boyer MP (2018) Multi-omics integration–a comparison of unsupervised clustering methodologies. Brief Bioinform 20(4):1269–1279. https://doi.org/10.1093/bib/bbx167
Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, Haibe-Kains B, Goldenberg A (2014) Similarity network fusion for aggregating data types on a genomic scale. Nat Methods 11(3):333–337. https://doi.org/10.1038/nmeth.2810
Wang J, Li Y, Han X, Hu H, Wang F, Li X, Yang K, Yuan J, Yao P, Miao X, Wei S, Wang Y, Cheng W, Liang Y, Zhang X, Guo H, Yang H, Yuan J, Koh WP, He M (2017) Serum bilirubin levels and risk of type 2 diabetes: Results from two independent cohorts in middle-aged and elderly Chinese. Sci Rep. https://doi.org/10.1038/srep41338
Zhang Y, Zhou C, Li J, Zhang Y, Xie D, Liang M, Wang B, Song Y, Wang X, Huo Y, Hou FF, Xu X, Qin X (2020) Serum alkaline phosphatase levels and the risk of new-onset diabetes in hypertensive adults. Cardiovasc Diabetol. https://doi.org/10.1186/s12933-020-01161-x
Zhou W, Sailani MR, Contrepois K, Zhou Y, Ahadi S, Leopold SR, Zhang MJ, Rao V, Avina M, Mishra T, Johnson J, Lee-McMullen B, Chen S, Metwally AA, Tran TDB, Nguyen H, Zhou X, Albright B, Hong BY, Snyder M (2019) Longitudinal multi-omics of host–microbe dynamics in prediabetes. Nature 569(7758):663–671. https://doi.org/10.1038/s41586-019-1236-x
Chalise, P., Koestler, D. C., Bimali, M., Yu, Q., & Fridley, B. L. (2014). Integrative clustering methods for high-dimensional molecular data. In: Translational Cancer Research (Vol. 3, Issue 3, pp. 202–216). AME Publishing Company. https://doi.org/10.3978/j.issn.2218-676X.2014.06.03
Chauvel C, Novoloaca A, Veyre P, Reynier F, Becker J (2020) Evaluation of integrative clustering methods for the analysis of multi-omics data. In: Briefings in Bioinformatics (Vol. 21, Issue 2, pp. 541–552). Oxford University Press, Oxford. https://doi.org/10.1093/bib/bbz015
Friedman, J. H. (2001). 999 Reitz lecture greedy function approximation: a gradient boosting machine 1. Ann. Stat. 29(5)
Huang S, Chaudhary K, Garmire LX (2017) More is better: recent progress in multi-omics data integration methods. In: Frontiers in Genetics (Vol. 8, Issue JUN). Frontiers Media S.A. https://doi.org/10.3389/fgene.2017.00084
Huang S, Nianguang CAI, Penzuti Pacheco P, Narandes S, Wang Y, Wayne XU (2018) Applications of support vector machine (SVM) learning in cancer genomics. In: Cancer Genomics and Proteomics (Vol. 15, Issue 1, pp. 41–51). International Institute of Anticancer Research. https://doi.org/10.21873/cgp.20063
Jones E, Oliphant T, Peterson P (2001) SciPy: Open Source Scientific Tools for Python. http://www.scipy.org
Leppäaho E, Kaski S, Khan ME (2017) GFA: exploratory analysis of multiple data sources with group factor analysis Muhammad Ammad-ud-din. J Mach Learn Res 18. http://jmlr.org/papers/v18/16-509.html.
Pedregosa F, Michel V, Grisel O, Blondel M, Prettenhofer P, Weiss R, Vanderplas J, Cournapeau D, Pedregosa F, Varoquaux G, Gramfort A, Thirion B, Grisel O, Dubourg V, Passos A, Brucher MP, Édouardand M, Duchesnay É, Duchesnay EF (2011). Scikit-learn: machine learning in Python Gaël varoquaux bertrand thirion vincent dubourg alexandre passos pedregosa, varoquaux, Gramfort et al. Matthieu Perrot. J Mach Learn Res. http://scikit-learn.sourceforge.net.
Acknowledgements
We would like to thank reviewers for taking the effort and time to review the manuscript. We appreciate all your valuable comments and suggestions, which helped us in improving the quality of the manuscript.
Funding
The main author ME acknowledges funding by the “la Caixa” Foundation (ID 100010434), within the Doctoral INPhINIT Program LCF/BQ/D122/11940015. AA was partially supported by the Strategic Funding U-IDB/04423/2020 and UIDP/04423/2020 through national funds provided by the Fundação para a Ciência e a Tecnologia (FCT) and the European Regional Development Fund (ERDF) in the framework of the program PT2020, by the European Structural and Investment Funds (ESIF) through the Competitiveness and Internationalization Operational Program—COMPETE 2020 and by National Funds through the FCT under the projects PTDC/CTA-AMB/31774/2017 (POCI-01–0145-FEDER/031774/2017). ME, AT, MH and MS were funded via two DAAD grants (1) GED-PerMED Z57546888: German-Egyptian Dialog on Tackling Precision Medicine using Artificial Intelligence, and (2) Eg-CompBio Z57587968: Empowering computational biology and bioinformatics research in Egypt, both funded by the DAAD (German Academic Exchange Service) in Germany.
Author information
Authors and Affiliations
Contributions
Mohamed Emam: conceptualization, methodology, formal analysis, investigation, writing—original draft, data curation. Ahmed Tarek: formal analysis, writing—review and editing, methodology. Mohamed Soudy: formal analysis, writing—review and editing, methodology. Agostinho Antunes: conceptualization, supervision, funding acquisition, software, visualization, writing—review and editing. Mohamed El Hadid: project administration, funding acquisition, supervision, methodology, writing—review and editing. Mohamed Hamed: conceptualization, project administration, funding acquisition, supervision, methodology, review and editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
Ethical statement
The study is based on multiomics integration analysis, and the data were retrieved from publicly available databases.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Emam, M., Tarek, A., Soudy, M. et al. Comparative evaluation of multiomics integration tools for the study of prediabetes: insights into the earliest stages of type 2 diabetes mellitus. Netw Model Anal Health Inform Bioinforma 13, 8 (2024). https://doi.org/10.1007/s13721-024-00442-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13721-024-00442-9