Free alignment classification of dikarya fungi using some machine learning methods

Rohani, Abbas; Mamarabadi, Mojtaba

doi:10.1007/s00521-018-3539-5

Free alignment classification of dikarya fungi using some machine learning methods

Original Article
Published: 17 May 2018

Volume 31, pages 6995–7016, (2019)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

379 Accesses
10 Citations
Explore all metrics

Abstract

Gene clustering based on amino acid sequence similarity has been one of the most important problems and always challenging in molecular biology. The most conventional methods are based on alignment-technique. These methods cannot identify and classify sequences, especially when the lengths of sequence are long and unequal. Therefore, in order to classify fungal hexosaminidase amino acid sequences and put them in the right taxonomical group we evaluate the feasibility of computational free alignment methods based on machine learning classifiers such as SVM, KNN, SOM and ensemble technique. The classifiers have appropriately categorized large Dikarya hexosaminidase amino acid sequences as data sets according to their taxonomical groups in two phyla named, the “Ascomycota” and the “Basidiomycota”. Two statistical methods including paired t test and PCA were used for the feature selection and reduce the dimensionality of the features, respectively. Seven classifier performance metrics, randomized complete block design, pairwise Tukey’s honestly significant difference tests and the technique for order preference by similarity to ideal solution with modified k-fold cross validation have been used as tools in order to evaluate and ranking of classifiers. In this study, the effect of training data size on the classifier performance was investigated. The results showed that the rank and the performance of classifiers were depended on the training data size. The highest obtained values for the average overall accuracy of the following training data sizes, 80, 60, 40 and 20% using KNN, KNN, ensemble and ensemble classifier were 96.96, 95.81, 94.47 and 92.47%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

CDKAM: a taxonomic classification tool using discriminative k-mers and approximate matching strategies

Article Open access 20 October 2020

Genomic Sequence Classification Using Probabilistic Topic Modeling

A Revamp Approach for Training of HMM to Accelerate Classification of 16S rRNA Gene Sequences

Abbreviations

ANN:: Artificial neural network
ANOVA:: Analysis of variance
ARB:: Adaptive rule-based
AUC:: Area under an ROC curve
DNA:: Deoxyribonucleic acid
Ens:: Ensemble classifier
FH:: Fungal hexosaminidases
FN:: Number of positive samples
FP:: Number of negative samples
HSD:: Honestly significant difference
KNN:: K-nearest neighbor
MCC:: Matthew’s correlation coefficient
MCDM:: Multi-criteria decision-making
MLP:: Multilayer perceptron
NB:: Naïve Bayes
PC:: Principal component
PCA:: Principal component analysis
PNN:: Probability neural network
Poly2:: Polynomial degree 2
Poly3:: Polynomial degree 3
PSO:: Particle swarm optimization
RBF:: Radial basic function
RCBD:: Randomized complete block design
RF:: Random forest
SOFM:: Self-organizing feature map
SOM:: Self-organized map
SST:: Total sum of squares
SSW:: Within-groups sum of squares
SVM:: Support vector machine
TDS:: Training data size
TLCF:: Two-layer classification framework
TN:: Negative samples
TOPSIS:: Technique for order preference by similarity to ideal solution
TP:: Number of positive samples
YI:: Youden’s index

References

Hibbett DS, Binder M, Bischoff JF, Blackwell M, Cannon PF, Eriksson OE, Huhndorf S, James T, Kirk PM, Lücking R (2007) A higher-level phylogenetic classification of the Fungi. Mycol Res 111(5):509–547
Article Google Scholar
Taylor JW, Berbee ML (2014) 1 Fungi from PCR to genomics: the spreading revolution in evolutionary biology. In: Systematics and evolution. Springer, Berlin, pp 1–18
Sorimachi K, Okayasu T (2013) Phylogenetic tree construction based on amino acid composition and nucleotide content of complete vertebrate mitochondrial genomes. IOSR J Phamacy 3:51–56
Google Scholar
Larkin MA, Blackshields G, Brown N, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23(21):2947–2948
Article Google Scholar
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7(1):539
Article Google Scholar
Notredame C, Higgins DG, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217
Article Google Scholar
Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res 15(2):330–340
Article Google Scholar
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res 32(5):1792–1797
Article Google Scholar
Kohonen T (2001) Self-organizing maps. Springer, Berlin
Book MATH Google Scholar
Kohonen T, Somervuo P (1998) Self-organizing maps of symbol strings. Neurocomputing 21(1):19–30
Article MATH Google Scholar
Chang R-I, Chu C-C, Wu Y-Y, Chen Y-L (2010) Gene clustering by using query-based self-organizing maps. Expert Syst Appl 37(9):6689–6694
Article Google Scholar
Vesanto J, Alhoniemi E (2000) Clustering of the self-organizing map. IEEE Trans Neural Netw 11(3):586–600
Article Google Scholar
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Article Google Scholar
Astel A, Tsakovski S, Barbieri P, Simeonov V (2007) Comparison of self-organizing maps classification approach with cluster and principal components analysis for large environmental data sets. Water Res 41(19):4566–4578
Article Google Scholar
Delgado S, Morán F, Mora A, Merelo JJ, Briones C (2015) A novel representation of genomic sequences for taxonomic clustering and visualization by means of self-organizing maps. Bioinformatics 31(5):736–744
Article Google Scholar
Anke Z, Xinjian Q, Guojian C (2014) Clustering analysis of gene data based on PCA and SOM neural networks. In: Fifth international conference on intelligent systems design and engineering applications (ISDEA), 2014. IEEE, pp 284–287
Duda RO, Hart PE, Stork DG (1973) Pattern classification, vol 2. Wiley, New York
MATH Google Scholar
Wang J, Neskovic P, Cooper LN (2006) Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence. Pattern Recogn 39(3):417–423
Article MATH Google Scholar
Agrawala AK (1977) Machine recognition of patterns. IEEE Press, New York
Google Scholar
Fix E, Hodges JL (1989) Discriminatory analysis nonparametric discrimination: consistency properties. Int Stat Rev 57(3):238–247
Article MATH Google Scholar
Ghosh AK, Chaudhuri P, Murthy C (2005) On visualization and aggregation of nearest neighbor classifiers. IEEE Trans Pattern Anal Mach Intell 27(10):1592–1602
Article Google Scholar
Horton P, Nakai K (1997) Better prediction of protein cellular localization sites with the it k nearest neighbors classifier. In: Ismb, pp 147–152
Nathan R, Spiegel O, Fortmann-Roe S, Harel R, Wikelski M, Getz WM (2012) Using tri-axial acceleration data to identify behavioral modes of free-ranging animals: general concepts and tools illustrated for griffon vultures. J Exp Biol 215(6):986–996
Article Google Scholar
Khamis HS, Cheruiyot KW, Kimani S (2014) Application of k-nearest neighbour classification in medical data mining. Int J Inf Commun Technol Res 4:4
Google Scholar
Medjahed SA, Saadi TA, Benyettou A (2013) Breast cancer diagnosis by using k-nearest neighbor with different distances and classification rules. Int J Comput Appl 62(1):1
Google Scholar
Deolekar S, Abraham S (2016) Classification of tabla strokes using neural network. In: Computational intelligence in data mining—volume 1. Springer, pp 347–356
Modak S, Sharma S, Prabhakar P, Yadav A, Jayaraman V (2013) Application of support vector machines in fungal genome and proteome annotation. In: Laboratory protocols in fungal biology. Springer, pp 565–577
Manimekalai K, Vijaya M (2014) Taxonomic classification of Plant species using support vector machine. J Bioinf Intell Control 3(1):65–71
Article Google Scholar
Kittler J, Hatef M, Duin RP, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239
Article Google Scholar
Rahman A, Tasnim S (2014) Ensemble classifiers and their applications: a review. arXiv preprint arXiv:14044088
Yang P, Li X, Chua H-N, Kwoh C-K, Ng S-K (2014) Ensemble positive unlabeled learning for disease gene identification. PLoS ONE 9(5):e97079
Article Google Scholar
Mohapatra S, Patra D, Satpathy S (2014) An ensemble classifier system for early diagnosis of acute lymphoblastic leukemia in blood microscopic images. Neural Comput Appl 24(7–8):1887–1904
Article Google Scholar
Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, Zou Q (2013) Hierarchical classification of protein folds using a novel ensemble classifier. PLoS ONE 8(2):e56499
Article Google Scholar
Sueoka N (1961) Correlation between base composition of deoxyribonucleic acid and amino acid composition of protein. Proc Natl Acad Sci 47(8):1141–1149
Article Google Scholar
Sorimachi K (1999) Evolutionary changes reflected by the cellular amino acid composition. Amino Acids 17(2):207–226
Article Google Scholar
Sorimachi K, Okayasu T (2014) Classification of non-animals and invertebrates based on amino acid composition of complete mitochondrial genomes. Int J Biol 6(1):1
Google Scholar
Mamarabadi M, Tokhmechi B (2012) Signal processing approaches as novel tools for the clus-tering of N-acetyl-β-d-glucosaminidases. Iran J Biotechnol 10(3):1
Google Scholar
Mamarabadi M, Rohani A (2017) Clustering of fungal hexosaminidase enzymes based on free alignment method using MLP neural network. Neural Comput Appl 1:1–11
Google Scholar
Satpathy R, Behera R, Padhi SK, Guru RK (2013) Computational phylogenetic study and data mining approach to laccase enzyme sequences. J Phylogen Evol Biol 1:108
Article Google Scholar
Ozbudak O, Dokur Z (2014) Protein fold classification using Kohonen’s self-organizing map. In: IWBBIO, pp 903–911
Kumar R, Srivastava A, Kumari B, Kumar M (2015) Prediction of β-lactamase and its class by Chou’s pseudo-amino acid composition and support vector machine. J Theor Biol 365:96–103
Article MathSciNet MATH Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley Longman Publishing Co., Inc., Boston
Google Scholar
Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433–459
Article Google Scholar
López M, Ramírez J, Górriz J, Salas-Gonzalez D, Alvarez I, Segovia F, Puntonet C (2009) Automatic tool for Alzheimer’s disease diagnosis using PCA and Bayesian classification rules. Electron Lett 45(8):389–391
Article Google Scholar
Suganthy M, Ramamoorthy P (2012) Principal component analysis based feature extraction, morphological edge detection and localization for fast iris recognition. J Comput Sci 8(9):1428
Article Google Scholar
Li Y, Xia J, Zhang S, Yan J, Ai X, Dai K (2012) An efficient intrusion detection system based on support vector machines and gradually feature removal method. Expert Syst Appl 39(1):424–430
Article Google Scholar
Vieira SM, Mendonça LF, Farinha GJ, Sousa JM (2013) Modified binary PSO for feature selection using SVM applied to mortality prediction of septic patients. Appl Soft Comput 13(8):3494–3504
Article Google Scholar
Sprent P, Smeeton NC (2016) Applied nonparametric statistical methods. CRC Press, Boston
MATH Google Scholar
Refaeilzadeh P, Tang L, Liu H (2009) Cross-validation. In: Encyclopedia of database systems. Springer, pp 532–538
Simon RM, Subramanian J, Li M-C, Menezes S (2011) Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data. Brief Bioinform 12(3):203–214
Article Google Scholar
Varma S, Simon R (2006) Bias in error estimation when using cross-validation for model selection. BMC Bioinf 7(1):91
Article Google Scholar
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Ijcai, vol 2. Stanford, CA, pp 1137–1145
Saini H, Raicar G, Dehzangi A, Lal S, Sharma A (2015) Subcellular localization for Gram positive and Gram negative bacterial proteins using linear interpolation smoothing model. J Theor Biol 386:25–33
Article Google Scholar
Lin W-J, Chen JJ (2012) Class-imbalanced classifiers for high-dimensional data. Brief Bioinf 14:13
Article Google Scholar
May RJ, Maier HR, Dandy GC (2010) Data splitting for artificial neural networks using SOM-based stratified sampling. Neural Netw 23(2):283–294
Article Google Scholar
Li D, Deogun JS, Wang K (2007) Gene function classification using fuzzy k-nearest neighbor approach. In: IEEE international conference on granular computing, 2007. GRC 2007. IEEE, pp 644
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Article Google Scholar
Farid DM, Al-Mamun MA, Manderick B, Nowe A (2016) An adaptive rule-based classifier for mining big biological data. Expert Syst Appl 64:305–316
Article Google Scholar
Vapnik V (2013) The nature of statistical learning theory. Springer, Berlin
MATH Google Scholar
Hsu C-W, Lin C-J (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425
Article Google Scholar
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Article MATH Google Scholar
Shen Q, Shi W-M, Kong W, Ye B-X (2007) A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification. Talanta 71(4):1679–1683
Article Google Scholar
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Article Google Scholar
Mortazavi A, Pepke S, Jansen C, Marinov GK, Ernst J, Kellis M, Hardison RC, Myers RM, Wold BJ (2013) Integrating and mining the chromatin landscape of cell-type specificity using self-organizing maps. Genome Res 23(12):2136–2148
Article Google Scholar
Yan A, Nie X, Wang K, Wang M (2013) Classification of Aurora kinase inhibitors by self-organizing map (SOM) and support vector machine (SVM). Eur J Med Chem 61:73–83
Article Google Scholar
Nam Y, Koh S-H, Jeon S-J, Youn H-J, Park Y-S, Choi WI (2015) Hazard rating of coastal pine forests for a black pine bast scale using self-organizing map (SOM) and random forest approaches. Ecol Inf 29:206–213
Article Google Scholar
Cho S-B, Won H-H (2003) Data mining for gene expression profiles from DNA microarray. Int J Softw Eng Knowl Eng 13(06):593–608
Article Google Scholar
Kim K-J, Cho S-B (2006) Ensemble classifiers based on correlation analysis for DNA microarray classification. Neurocomputing 70(1):187–199
Article Google Scholar
Weng C-H, Huang TC-K, Han R-P (2016) Disease prediction with different types of neural network classifiers. Telemat Inform 33(2):277–292
Article Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874
Article MathSciNet Google Scholar
Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35
Article Google Scholar
Montgomery DC (2008) Design and analysis of experiments. Wiley, New York
Google Scholar
Opricovic S, Tzeng G-H (2004) Compromise solution by MCDM methods: a comparative analysis of VIKOR and TOPSIS. Eur J Oper Res 156(2):445–455
Article MATH Google Scholar
Peng Y, Wang G, Kou G, Shi Y (2011) An empirical study of classification algorithm evaluation for financial risk prediction. Appl Soft Comput 11(2):2906–2915
Article Google Scholar
Kou G, Lu Y, Peng Y, Shi Y (2012) Evaluation of classification algorithms using MCDM and rank correlation. Int J Inf Technol Decis Mak 11(01):197–225
Article Google Scholar
Beura S, Majhi B, Dash R (2015) Mammogram classification using two dimensional discrete wavelet transform and gray-level co-occurrence matrix for detection of breast cancer. Neurocomputing 154:1–14
Article Google Scholar
Yousefi MR, Dougherty ER (2012) Performance reproducibility index for classification. Bioinformatics 28(21):2824–2833
Article Google Scholar
Howley T, Madden MG, O’Connell M-L, Ryder AG (2006) The effect of principal component analysis on machine learning accuracy with high-dimensional spectral data. Knowl Based Syst 19(5):363–370
Article Google Scholar
Erkmen B, Yıldırım T (2008) Improving classification performance of sonar targets by applying general regression neural network with PCA. Expert Syst Appl 35(1):472–475
Article Google Scholar
Kumar R, Goyal MK, Ahmed P, Kumar A (2012) Unconstrained handwritten numeral recognition using majority voting classifier. In: 2012 2nd IEEE international conference on Parallel distributed and grid computing (PDGC), 2012. IEEE, pp 284–289
Jafari N, Chodorowski A (2012) Histology-based oral lesion classification. In: 2012 20th Iranian conference on electrical engineering (ICEE). IEEE, pp 1612–1617
Cunningham P, Delany SJ (2007) k-Nearest neighbour classifiers. Multiple Classif Syst 34:1–17
Google Scholar
Jiang S, Pang G, Wu M, Kuang L (2012) An improved K-nearest-neighbor algorithm for text categorization. Expert Syst Appl 39(1):1503–1509
Article Google Scholar
Mu Y, Ding W, Tao D, Stepinski TF (2011) Biologically inspired model for crater detection. In: The 2011 international joint conference on neural networks (IJCNN). IEEE, pp 2487–2494
Ahmad J, Fiaz M, Kwon S-I, Sodanil M, Vo B, Baik SW (2016) Gender identification using MFCC for telephone applications—a comparative study. arXiv preprint arXiv:160101577
Li S, Wu X, Tan M (2008) Gene selection using hybrid particle swarm optimization and genetic algorithm. Soft Comput 12(11):1039–1048
Article Google Scholar
Zhang Y, Wang S, Ji G, Dong Z (2013) An MR brain images classifier system via particle swarm optimization and kernel support vector machine. Sci World J 2013:130–134
Google Scholar
Figueiredo J, Santos CP, Urendes E, Pons JL, Moreno JC (2015) Implementation of feature extraction methods and support vector machine for classification of partial body weight supports in overground robot-aided walking. In: 2015 7th international IEEE/EMBS conference on neural engineering (NER), IEEE, pp 763–766
Ozkan H (2016) A comparison of classification methods for telediagnosis of Parkinson’s disease. Entropy 18(4):115
Article Google Scholar
Petrov N, Georgieva A, Jordanov I (2013) Self-organizing maps for texture classification. Neural Comput Appl 22(7–8):1499–1508
Article Google Scholar
George AJ, Gopakumar G, Pradhan M, Nazeer KA, Palakal MJ (2015) A self organizing map-harmony search hybrid algorithm for clustering biological data. In: 2015 IEEE international conference on signal processing, informatics, communication and energy systems (SPICES), IEEE, pp 1–5
Kumar D, Rai C, Kumar S (2005) Face recognition using self-organizing map and principal component analysis. In: International conference on neural networks and brain. ICNN&B’05. IEEE, pp 1469–1473
Cho S-B, Ryu J (2002) Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features. Proc IEEE 90(11):1744–1753
Article Google Scholar
Shen H-B, Chou K-C (2006) Ensemble classifier for protein fold pattern recognition. Bioinformatics 22(14):1717–1722
Article Google Scholar
Aram RZ, Charkari NM (2015) A two-layer classification framework for protein fold recognition. J Theor Biol 365:32–39
Article MathSciNet MATH Google Scholar
Ding CH, Dubchak I (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17(4):349–358
Article Google Scholar
Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20(15):2429–2437
Article Google Scholar
Subashini T, Ramalingam V, Palanivel S (2009) Breast mass classification based on cytological patterns using RBFNN and SVM. Expert Syst Appl 36(3):5284–5290
Article Google Scholar
Li L, Wu Y, Ye M (2015) Experimental comparisons of multi-class classifiers. Informatica 39(1):71
MathSciNet Google Scholar
Banerjee S, Anura A, Chakrabarty J, Sengupta S, Chatterjee J (2016) Identification and functional assessment of novel gene sets towards better understanding of dysplasia associated oral carcinogenesis. Gene Rep 4:131–138
Article Google Scholar
Waris M, Ahmad K, Kabir M, Hayat M (2016) Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix. Neurocomputing 199:154–162
Article Google Scholar

Download references

Acknowledgements

Financial support from the vice president for research and technology of Ferdowsi University of Mashhad, is highly appreciated.

Author information

Authors and Affiliations

Department of Biosystems Engineering, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran
Abbas Rohani
Department of Plant Protection, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran
Mojtaba Mamarabadi

Authors

Abbas Rohani
View author publications
You can also search for this author in PubMed Google Scholar
Mojtaba Mamarabadi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abbas Rohani.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rohani, A., Mamarabadi, M. Free alignment classification of dikarya fungi using some machine learning methods. Neural Comput & Applic 31, 6995–7016 (2019). https://doi.org/10.1007/s00521-018-3539-5

Download citation

Received: 13 July 2017
Accepted: 11 May 2018
Published: 17 May 2018
Issue Date: November 2019
DOI: https://doi.org/10.1007/s00521-018-3539-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Free alignment classification of dikarya fungi using some machine learning methods

Abstract

Access this article

Similar content being viewed by others

CDKAM: a taxonomic classification tool using discriminative k-mers and approximate matching strategies

Genomic Sequence Classification Using Probabilistic Topic Modeling

A Revamp Approach for Training of HMM to Accelerate Classification of 16S rRNA Gene Sequences

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Free alignment classification of dikarya fungi using some machine learning methods

Abstract

Access this article

Similar content being viewed by others

CDKAM: a taxonomic classification tool using discriminative k-mers and approximate matching strategies

Genomic Sequence Classification Using Probabilistic Topic Modeling

A Revamp Approach for Training of HMM to Accelerate Classification of 16S rRNA Gene Sequences

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation