Abstract
Identification of genes that lead other genes towards disease with neurological disorders like Parkinson's disease (PD) is an important factor in biomedical research. Machine learning techniques have been extensively used in recent years for effective identification of genes associated with the disease. However, the data used in these methods were based on protein–protein interactions, gene expression, and gene ontology. These data may contain incomplete previous knowledge that is used to construct features for each gene. Therefore, in this study, the physicochemical properties of amino acid as a universal knowledge are used to extract features from the sequences. Also, the several machine learning models are used to classify genes associated with PD. In this study, the ensemble method is designed in such a way, so as to improve the diagnosis accuracy based on top four highest performing classifiers. The comparative analysis reveals that gradient boosting performs better having accuracy of 77.50% and area under curve of 0.774 with respect to other six methods. However, ensemble method achieves an accuracy of 83.75%. Ensemble method is evaluated against existing disease gene identification methods; the results suggest that this approach is more accurate and effective for identification of PD genes. Re-sampling techniques for resolving class imbalance issues have been shown to increase classification accuracy by reducing the bias introduced by class size differences. The proposed model can also be used as a prediction tool for diagnosis Alzheimer’s disease protein sequences.





Similar content being viewed by others
Data availabilty
Data will be made available on request.
References
Ala U, Piro RM, Grassi E, et al. Prediction of human disease genes by human-mouse conserved coexpression analysis. PLoS Comput Biol. 2008;4: e1000043.
Navlakha S, Kingsford C. The power of protein interaction networks for associating genes with diseases. Bioinformatics. 2010;26:1057–63.
Freudenberg J, Propping P. A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics. 2002;18(suppl_2):S110–5.
Xu J, Li Y. Discovering disease-genes by topological features in human protein–protein interaction network. Bioinformatics. 2006;22:2800–5.
Das R. A comparison of multiple classification methods for diagnosis of Parkinson disease. Expert Syst Appl. 2010;37:1568–72.
Chen HL, Huang CC, Yu XG, et al. An efficient diagnosis system for detection of Parkinson’s disease using fuzzy k-nearest neighbor approach. Expert Syst Appl. 2013;40:263–71.
Little MA, McSharry PE, Hunter EJ, Spielman J, Ramig LO. Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease. IEEE Trans Biomed Eng. 2009;56:1015–22.
Aström F, Koker R. A parallel neural network approach to prediction of Parkinson’s disease. Expert Syst Appl. 2011;38:12470–4.
Nilashi M, Ibrahim OB, Ahmadi H, Shahmoradi L. An analytical method for diseases prediction using machine learning techniques. Comput Chem Eng. 2017;106:212–23.
Ozcift A. SVM feature selection based rotation forest ensemble classifiers to improve computer-aided diagnosis of Parkinson disease. J Med Syst. 2012;36:2141–7.
Smalter A, Lei SF, Chen XW. Human disease-gene classification with integrative sequence-based and topological features of protein-protein interaction networks. In: 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007). IEEE; 2007. p. 209–16.
Yang P, Li XL, Mei JP, Kwoh CK, Ng SK. Positive-unlabeled learning for disease gene identification. Bioinformatics. 2012;28:2640–7.
Mordelet F, Vert JP. ProDiGe: prioritization of disease genes with multitask machine learning from positive and unlabelled examples. BMC Bioinformatics. 2011;12(1):389.
Yousef A, Moghadam CN. A novel method based on physicochemical properties of amino acids and one class classification algorithm for disease gene identification. J Biomed Inform. 2015;56:300–6.
Xiao Y, Wu J, Lin Z, Zhao X. A deep learning-based multi-model ensemble method for cancer prediction. Comput Methods Progr Biomed. 2018;153:1–9.
Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 2002;38(4):367–78.
Ozcift A, Gulten A. Classifier ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms. Comput Methods Programs Biomed. 2011;104(3):443–51.
Jacob SG, Athilakshmi R. Extraction of protein sequence features for prediction of neuro-degenerative brain disorders: pioneering the CGAP database. In: Proceedings of the International Conference on Informatics and Analytics, 2016, p. 1–4.
Radivojac P, Peng K, Clark WT, Peters BJ, Mohan A, Boyle SM, Mooney SD. An integrated approach to inferring gene–disease associations in humans. Proteins Struct Funct Bioinform. 2008;72(3):1030–7.
Yang P, Li X, Chua HN, Kwoh CK, Ng SK. Ensemble positive unlabeled learning for disease gene identification. PLoS ONE. 2014;9(5): e97079.
Yousef A, Charkari NM. A novel method based on physicochemical properties of amino acids and one class classification algorithm for disease gene identification. J Biomed Inform. 2015;56:300–306.
Universal Protein Resource. Available: http://www.uniprot.org.
Simm S, Einloft J, Mirus O, Schleiff E. 50 years of amino acid hydrophobicity scales: revisiting the capacity for peptide classification. Biol Res. 2016;49(1):31.
Carugo O. Amino acid composition and protein dimension. Protein Sci. 2008;17(12):2187–91.
Jowkar G, Eghbal GM. Perceptron ensemble of graph-based positive-unlabeled learning for disease gene identification. Computational biology and chemistry. 2016;64:263–70.
Cui Y, Cai M, Dai Y, Stanley HE. A hybrid network-based method for the detection of disease-related genes. Physica A. 2018;492:389–94.
Arora P, Mishra A, Malhi A. N-semble-based method for identifying Parkinson’s disease genes. Neural Comput Appl. 2023;35(33):23829–39.
Signol F, Arnal L, Navarro-Cerdán JR, Llobet R, Arlandis J, Perez-Cortes JC. SEQENS: an ensemble method for relevant gene identification in microarray data. Comput Biol Med. 2023;152: 106413.
Leo B. Random forests. Mach Learn. 2001;45(1):5–32.
Wu CC, Yeh WC, Hsu WD, Islam MM, Nguyen PAA, Poly TN, Wang YC, Yang HC, Li YCJ. Prediction of fatty liver disease using machine learning algorithms. Comput Methods Progr Biomed. 2019;170:23–9.
Kaur S, Gupta S, Singh S, Gupta I. Detection of Alzheimer’s disease using deep convolutional neural network. Int J Image Graph. 2022;22(03):2140012.
Kumar M, Bajaj K, Sharma B, Narang S. A comparative performance assessment of optimized multilevel ensemble learning model with existing classifier models. Big Data. 2022;10(5):371–87.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
None.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Arora, P., Mishra, A. & Malhi, A. An Ensemble Machine Learning Method Highlights Possible Parkinson’s Disease Genes and Accessing Performance of Re-sampling Techniques. SN COMPUT. SCI. 5, 483 (2024). https://doi.org/10.1007/s42979-024-02805-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-024-02805-5