Abstract
Gene clustering based on amino acid sequence similarity has been one of the most important problems and always challenging in molecular biology. The most conventional methods are based on alignment-technique. These methods cannot identify and classify sequences, especially when the lengths of sequence are long and unequal. Therefore, in order to classify fungal hexosaminidase amino acid sequences and put them in the right taxonomical group we evaluate the feasibility of computational free alignment methods based on machine learning classifiers such as SVM, KNN, SOM and ensemble technique. The classifiers have appropriately categorized large Dikarya hexosaminidase amino acid sequences as data sets according to their taxonomical groups in two phyla named, the “Ascomycota” and the “Basidiomycota”. Two statistical methods including paired t test and PCA were used for the feature selection and reduce the dimensionality of the features, respectively. Seven classifier performance metrics, randomized complete block design, pairwise Tukey’s honestly significant difference tests and the technique for order preference by similarity to ideal solution with modified k-fold cross validation have been used as tools in order to evaluate and ranking of classifiers. In this study, the effect of training data size on the classifier performance was investigated. The results showed that the rank and the performance of classifiers were depended on the training data size. The highest obtained values for the average overall accuracy of the following training data sizes, 80, 60, 40 and 20% using KNN, KNN, ensemble and ensemble classifier were 96.96, 95.81, 94.47 and 92.47%, respectively.
Similar content being viewed by others
Abbreviations
- ANN:
-
Artificial neural network
- ANOVA:
-
Analysis of variance
- ARB:
-
Adaptive rule-based
- AUC:
-
Area under an ROC curve
- DNA:
-
Deoxyribonucleic acid
- Ens:
-
Ensemble classifier
- FH:
-
Fungal hexosaminidases
- FN:
-
Number of positive samples
- FP:
-
Number of negative samples
- HSD:
-
Honestly significant difference
- KNN:
-
K-nearest neighbor
- MCC:
-
Matthew’s correlation coefficient
- MCDM:
-
Multi-criteria decision-making
- MLP:
-
Multilayer perceptron
- NB:
-
Naïve Bayes
- PC:
-
Principal component
- PCA:
-
Principal component analysis
- PNN:
-
Probability neural network
- Poly2:
-
Polynomial degree 2
- Poly3:
-
Polynomial degree 3
- PSO:
-
Particle swarm optimization
- RBF:
-
Radial basic function
- RCBD:
-
Randomized complete block design
- RF:
-
Random forest
- SOFM:
-
Self-organizing feature map
- SOM:
-
Self-organized map
- SST:
-
Total sum of squares
- SSW:
-
Within-groups sum of squares
- SVM:
-
Support vector machine
- TDS:
-
Training data size
- TLCF:
-
Two-layer classification framework
- TN:
-
Negative samples
- TOPSIS:
-
Technique for order preference by similarity to ideal solution
- TP:
-
Number of positive samples
- YI:
-
Youden’s index
References
Hibbett DS, Binder M, Bischoff JF, Blackwell M, Cannon PF, Eriksson OE, Huhndorf S, James T, Kirk PM, Lücking R (2007) A higher-level phylogenetic classification of the Fungi. Mycol Res 111(5):509–547
Taylor JW, Berbee ML (2014) 1 Fungi from PCR to genomics: the spreading revolution in evolutionary biology. In: Systematics and evolution. Springer, Berlin, pp 1–18
Sorimachi K, Okayasu T (2013) Phylogenetic tree construction based on amino acid composition and nucleotide content of complete vertebrate mitochondrial genomes. IOSR J Phamacy 3:51–56
Larkin MA, Blackshields G, Brown N, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23(21):2947–2948
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7(1):539
Notredame C, Higgins DG, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217
Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res 15(2):330–340
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res 32(5):1792–1797
Kohonen T (2001) Self-organizing maps. Springer, Berlin
Kohonen T, Somervuo P (1998) Self-organizing maps of symbol strings. Neurocomputing 21(1):19–30
Chang R-I, Chu C-C, Wu Y-Y, Chen Y-L (2010) Gene clustering by using query-based self-organizing maps. Expert Syst Appl 37(9):6689–6694
Vesanto J, Alhoniemi E (2000) Clustering of the self-organizing map. IEEE Trans Neural Netw 11(3):586–600
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Astel A, Tsakovski S, Barbieri P, Simeonov V (2007) Comparison of self-organizing maps classification approach with cluster and principal components analysis for large environmental data sets. Water Res 41(19):4566–4578
Delgado S, Morán F, Mora A, Merelo JJ, Briones C (2015) A novel representation of genomic sequences for taxonomic clustering and visualization by means of self-organizing maps. Bioinformatics 31(5):736–744
Anke Z, Xinjian Q, Guojian C (2014) Clustering analysis of gene data based on PCA and SOM neural networks. In: Fifth international conference on intelligent systems design and engineering applications (ISDEA), 2014. IEEE, pp 284–287
Duda RO, Hart PE, Stork DG (1973) Pattern classification, vol 2. Wiley, New York
Wang J, Neskovic P, Cooper LN (2006) Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence. Pattern Recogn 39(3):417–423
Agrawala AK (1977) Machine recognition of patterns. IEEE Press, New York
Fix E, Hodges JL (1989) Discriminatory analysis nonparametric discrimination: consistency properties. Int Stat Rev 57(3):238–247
Ghosh AK, Chaudhuri P, Murthy C (2005) On visualization and aggregation of nearest neighbor classifiers. IEEE Trans Pattern Anal Mach Intell 27(10):1592–1602
Horton P, Nakai K (1997) Better prediction of protein cellular localization sites with the it k nearest neighbors classifier. In: Ismb, pp 147–152
Nathan R, Spiegel O, Fortmann-Roe S, Harel R, Wikelski M, Getz WM (2012) Using tri-axial acceleration data to identify behavioral modes of free-ranging animals: general concepts and tools illustrated for griffon vultures. J Exp Biol 215(6):986–996
Khamis HS, Cheruiyot KW, Kimani S (2014) Application of k-nearest neighbour classification in medical data mining. Int J Inf Commun Technol Res 4:4
Medjahed SA, Saadi TA, Benyettou A (2013) Breast cancer diagnosis by using k-nearest neighbor with different distances and classification rules. Int J Comput Appl 62(1):1
Deolekar S, Abraham S (2016) Classification of tabla strokes using neural network. In: Computational intelligence in data mining—volume 1. Springer, pp 347–356
Modak S, Sharma S, Prabhakar P, Yadav A, Jayaraman V (2013) Application of support vector machines in fungal genome and proteome annotation. In: Laboratory protocols in fungal biology. Springer, pp 565–577
Manimekalai K, Vijaya M (2014) Taxonomic classification of Plant species using support vector machine. J Bioinf Intell Control 3(1):65–71
Kittler J, Hatef M, Duin RP, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239
Rahman A, Tasnim S (2014) Ensemble classifiers and their applications: a review. arXiv preprint arXiv:14044088
Yang P, Li X, Chua H-N, Kwoh C-K, Ng S-K (2014) Ensemble positive unlabeled learning for disease gene identification. PLoS ONE 9(5):e97079
Mohapatra S, Patra D, Satpathy S (2014) An ensemble classifier system for early diagnosis of acute lymphoblastic leukemia in blood microscopic images. Neural Comput Appl 24(7–8):1887–1904
Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, Zou Q (2013) Hierarchical classification of protein folds using a novel ensemble classifier. PLoS ONE 8(2):e56499
Sueoka N (1961) Correlation between base composition of deoxyribonucleic acid and amino acid composition of protein. Proc Natl Acad Sci 47(8):1141–1149
Sorimachi K (1999) Evolutionary changes reflected by the cellular amino acid composition. Amino Acids 17(2):207–226
Sorimachi K, Okayasu T (2014) Classification of non-animals and invertebrates based on amino acid composition of complete mitochondrial genomes. Int J Biol 6(1):1
Mamarabadi M, Tokhmechi B (2012) Signal processing approaches as novel tools for the clus-tering of N-acetyl-β-d-glucosaminidases. Iran J Biotechnol 10(3):1
Mamarabadi M, Rohani A (2017) Clustering of fungal hexosaminidase enzymes based on free alignment method using MLP neural network. Neural Comput Appl 1:1–11
Satpathy R, Behera R, Padhi SK, Guru RK (2013) Computational phylogenetic study and data mining approach to laccase enzyme sequences. J Phylogen Evol Biol 1:108
Ozbudak O, Dokur Z (2014) Protein fold classification using Kohonen’s self-organizing map. In: IWBBIO, pp 903–911
Kumar R, Srivastava A, Kumari B, Kumar M (2015) Prediction of β-lactamase and its class by Chou’s pseudo-amino acid composition and support vector machine. J Theor Biol 365:96–103
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley Longman Publishing Co., Inc., Boston
Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433–459
López M, Ramírez J, Górriz J, Salas-Gonzalez D, Alvarez I, Segovia F, Puntonet C (2009) Automatic tool for Alzheimer’s disease diagnosis using PCA and Bayesian classification rules. Electron Lett 45(8):389–391
Suganthy M, Ramamoorthy P (2012) Principal component analysis based feature extraction, morphological edge detection and localization for fast iris recognition. J Comput Sci 8(9):1428
Li Y, Xia J, Zhang S, Yan J, Ai X, Dai K (2012) An efficient intrusion detection system based on support vector machines and gradually feature removal method. Expert Syst Appl 39(1):424–430
Vieira SM, Mendonça LF, Farinha GJ, Sousa JM (2013) Modified binary PSO for feature selection using SVM applied to mortality prediction of septic patients. Appl Soft Comput 13(8):3494–3504
Sprent P, Smeeton NC (2016) Applied nonparametric statistical methods. CRC Press, Boston
Refaeilzadeh P, Tang L, Liu H (2009) Cross-validation. In: Encyclopedia of database systems. Springer, pp 532–538
Simon RM, Subramanian J, Li M-C, Menezes S (2011) Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data. Brief Bioinform 12(3):203–214
Varma S, Simon R (2006) Bias in error estimation when using cross-validation for model selection. BMC Bioinf 7(1):91
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Ijcai, vol 2. Stanford, CA, pp 1137–1145
Saini H, Raicar G, Dehzangi A, Lal S, Sharma A (2015) Subcellular localization for Gram positive and Gram negative bacterial proteins using linear interpolation smoothing model. J Theor Biol 386:25–33
Lin W-J, Chen JJ (2012) Class-imbalanced classifiers for high-dimensional data. Brief Bioinf 14:13
May RJ, Maier HR, Dandy GC (2010) Data splitting for artificial neural networks using SOM-based stratified sampling. Neural Netw 23(2):283–294
Li D, Deogun JS, Wang K (2007) Gene function classification using fuzzy k-nearest neighbor approach. In: IEEE international conference on granular computing, 2007. GRC 2007. IEEE, pp 644
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Farid DM, Al-Mamun MA, Manderick B, Nowe A (2016) An adaptive rule-based classifier for mining big biological data. Expert Syst Appl 64:305–316
Vapnik V (2013) The nature of statistical learning theory. Springer, Berlin
Hsu C-W, Lin C-J (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Shen Q, Shi W-M, Kong W, Ye B-X (2007) A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification. Talanta 71(4):1679–1683
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Mortazavi A, Pepke S, Jansen C, Marinov GK, Ernst J, Kellis M, Hardison RC, Myers RM, Wold BJ (2013) Integrating and mining the chromatin landscape of cell-type specificity using self-organizing maps. Genome Res 23(12):2136–2148
Yan A, Nie X, Wang K, Wang M (2013) Classification of Aurora kinase inhibitors by self-organizing map (SOM) and support vector machine (SVM). Eur J Med Chem 61:73–83
Nam Y, Koh S-H, Jeon S-J, Youn H-J, Park Y-S, Choi WI (2015) Hazard rating of coastal pine forests for a black pine bast scale using self-organizing map (SOM) and random forest approaches. Ecol Inf 29:206–213
Cho S-B, Won H-H (2003) Data mining for gene expression profiles from DNA microarray. Int J Softw Eng Knowl Eng 13(06):593–608
Kim K-J, Cho S-B (2006) Ensemble classifiers based on correlation analysis for DNA microarray classification. Neurocomputing 70(1):187–199
Weng C-H, Huang TC-K, Han R-P (2016) Disease prediction with different types of neural network classifiers. Telemat Inform 33(2):277–292
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874
Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35
Montgomery DC (2008) Design and analysis of experiments. Wiley, New York
Opricovic S, Tzeng G-H (2004) Compromise solution by MCDM methods: a comparative analysis of VIKOR and TOPSIS. Eur J Oper Res 156(2):445–455
Peng Y, Wang G, Kou G, Shi Y (2011) An empirical study of classification algorithm evaluation for financial risk prediction. Appl Soft Comput 11(2):2906–2915
Kou G, Lu Y, Peng Y, Shi Y (2012) Evaluation of classification algorithms using MCDM and rank correlation. Int J Inf Technol Decis Mak 11(01):197–225
Beura S, Majhi B, Dash R (2015) Mammogram classification using two dimensional discrete wavelet transform and gray-level co-occurrence matrix for detection of breast cancer. Neurocomputing 154:1–14
Yousefi MR, Dougherty ER (2012) Performance reproducibility index for classification. Bioinformatics 28(21):2824–2833
Howley T, Madden MG, O’Connell M-L, Ryder AG (2006) The effect of principal component analysis on machine learning accuracy with high-dimensional spectral data. Knowl Based Syst 19(5):363–370
Erkmen B, Yıldırım T (2008) Improving classification performance of sonar targets by applying general regression neural network with PCA. Expert Syst Appl 35(1):472–475
Kumar R, Goyal MK, Ahmed P, Kumar A (2012) Unconstrained handwritten numeral recognition using majority voting classifier. In: 2012 2nd IEEE international conference on Parallel distributed and grid computing (PDGC), 2012. IEEE, pp 284–289
Jafari N, Chodorowski A (2012) Histology-based oral lesion classification. In: 2012 20th Iranian conference on electrical engineering (ICEE). IEEE, pp 1612–1617
Cunningham P, Delany SJ (2007) k-Nearest neighbour classifiers. Multiple Classif Syst 34:1–17
Jiang S, Pang G, Wu M, Kuang L (2012) An improved K-nearest-neighbor algorithm for text categorization. Expert Syst Appl 39(1):1503–1509
Mu Y, Ding W, Tao D, Stepinski TF (2011) Biologically inspired model for crater detection. In: The 2011 international joint conference on neural networks (IJCNN). IEEE, pp 2487–2494
Ahmad J, Fiaz M, Kwon S-I, Sodanil M, Vo B, Baik SW (2016) Gender identification using MFCC for telephone applications—a comparative study. arXiv preprint arXiv:160101577
Li S, Wu X, Tan M (2008) Gene selection using hybrid particle swarm optimization and genetic algorithm. Soft Comput 12(11):1039–1048
Zhang Y, Wang S, Ji G, Dong Z (2013) An MR brain images classifier system via particle swarm optimization and kernel support vector machine. Sci World J 2013:130–134
Figueiredo J, Santos CP, Urendes E, Pons JL, Moreno JC (2015) Implementation of feature extraction methods and support vector machine for classification of partial body weight supports in overground robot-aided walking. In: 2015 7th international IEEE/EMBS conference on neural engineering (NER), IEEE, pp 763–766
Ozkan H (2016) A comparison of classification methods for telediagnosis of Parkinson’s disease. Entropy 18(4):115
Petrov N, Georgieva A, Jordanov I (2013) Self-organizing maps for texture classification. Neural Comput Appl 22(7–8):1499–1508
George AJ, Gopakumar G, Pradhan M, Nazeer KA, Palakal MJ (2015) A self organizing map-harmony search hybrid algorithm for clustering biological data. In: 2015 IEEE international conference on signal processing, informatics, communication and energy systems (SPICES), IEEE, pp 1–5
Kumar D, Rai C, Kumar S (2005) Face recognition using self-organizing map and principal component analysis. In: International conference on neural networks and brain. ICNN&B’05. IEEE, pp 1469–1473
Cho S-B, Ryu J (2002) Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features. Proc IEEE 90(11):1744–1753
Shen H-B, Chou K-C (2006) Ensemble classifier for protein fold pattern recognition. Bioinformatics 22(14):1717–1722
Aram RZ, Charkari NM (2015) A two-layer classification framework for protein fold recognition. J Theor Biol 365:32–39
Ding CH, Dubchak I (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17(4):349–358
Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20(15):2429–2437
Subashini T, Ramalingam V, Palanivel S (2009) Breast mass classification based on cytological patterns using RBFNN and SVM. Expert Syst Appl 36(3):5284–5290
Li L, Wu Y, Ye M (2015) Experimental comparisons of multi-class classifiers. Informatica 39(1):71
Banerjee S, Anura A, Chakrabarty J, Sengupta S, Chatterjee J (2016) Identification and functional assessment of novel gene sets towards better understanding of dysplasia associated oral carcinogenesis. Gene Rep 4:131–138
Waris M, Ahmad K, Kabir M, Hayat M (2016) Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix. Neurocomputing 199:154–162
Acknowledgements
Financial support from the vice president for research and technology of Ferdowsi University of Mashhad, is highly appreciated.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rohani, A., Mamarabadi, M. Free alignment classification of dikarya fungi using some machine learning methods. Neural Comput & Applic 31, 6995–7016 (2019). https://doi.org/10.1007/s00521-018-3539-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-018-3539-5