Skip to main content
Log in

Free alignment classification of dikarya fungi using some machine learning methods

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Gene clustering based on amino acid sequence similarity has been one of the most important problems and always challenging in molecular biology. The most conventional methods are based on alignment-technique. These methods cannot identify and classify sequences, especially when the lengths of sequence are long and unequal. Therefore, in order to classify fungal hexosaminidase amino acid sequences and put them in the right taxonomical group we evaluate the feasibility of computational free alignment methods based on machine learning classifiers such as SVM, KNN, SOM and ensemble technique. The classifiers have appropriately categorized large Dikarya hexosaminidase amino acid sequences as data sets according to their taxonomical groups in two phyla named, the “Ascomycota” and the “Basidiomycota”. Two statistical methods including paired t test and PCA were used for the feature selection and reduce the dimensionality of the features, respectively. Seven classifier performance metrics, randomized complete block design, pairwise Tukey’s honestly significant difference tests and the technique for order preference by similarity to ideal solution with modified k-fold cross validation have been used as tools in order to evaluate and ranking of classifiers. In this study, the effect of training data size on the classifier performance was investigated. The results showed that the rank and the performance of classifiers were depended on the training data size. The highest obtained values for the average overall accuracy of the following training data sizes, 80, 60, 40 and 20% using KNN, KNN, ensemble and ensemble classifier were 96.96, 95.81, 94.47 and 92.47%, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Abbreviations

ANN:

Artificial neural network

ANOVA:

Analysis of variance

ARB:

Adaptive rule-based

AUC:

Area under an ROC curve

DNA:

Deoxyribonucleic acid

Ens:

Ensemble classifier

FH:

Fungal hexosaminidases

FN:

Number of positive samples

FP:

Number of negative samples

HSD:

Honestly significant difference

KNN:

K-nearest neighbor

MCC:

Matthew’s correlation coefficient

MCDM:

Multi-criteria decision-making

MLP:

Multilayer perceptron

NB:

Naïve Bayes

PC:

Principal component

PCA:

Principal component analysis

PNN:

Probability neural network

Poly2:

Polynomial degree 2

Poly3:

Polynomial degree 3

PSO:

Particle swarm optimization

RBF:

Radial basic function

RCBD:

Randomized complete block design

RF:

Random forest

SOFM:

Self-organizing feature map

SOM:

Self-organized map

SST:

Total sum of squares

SSW:

Within-groups sum of squares

SVM:

Support vector machine

TDS:

Training data size

TLCF:

Two-layer classification framework

TN:

Negative samples

TOPSIS:

Technique for order preference by similarity to ideal solution

TP:

Number of positive samples

YI:

Youden’s index

References

  1. Hibbett DS, Binder M, Bischoff JF, Blackwell M, Cannon PF, Eriksson OE, Huhndorf S, James T, Kirk PM, Lücking R (2007) A higher-level phylogenetic classification of the Fungi. Mycol Res 111(5):509–547

    Article  Google Scholar 

  2. Taylor JW, Berbee ML (2014) 1 Fungi from PCR to genomics: the spreading revolution in evolutionary biology. In: Systematics and evolution. Springer, Berlin, pp 1–18

  3. Sorimachi K, Okayasu T (2013) Phylogenetic tree construction based on amino acid composition and nucleotide content of complete vertebrate mitochondrial genomes. IOSR J Phamacy 3:51–56

    Google Scholar 

  4. Larkin MA, Blackshields G, Brown N, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23(21):2947–2948

    Article  Google Scholar 

  5. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 7(1):539

    Article  Google Scholar 

  6. Notredame C, Higgins DG, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302(1):205–217

    Article  Google Scholar 

  7. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res 15(2):330–340

    Article  Google Scholar 

  8. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res 32(5):1792–1797

    Article  Google Scholar 

  9. Kohonen T (2001) Self-organizing maps. Springer, Berlin

    Book  MATH  Google Scholar 

  10. Kohonen T, Somervuo P (1998) Self-organizing maps of symbol strings. Neurocomputing 21(1):19–30

    Article  MATH  Google Scholar 

  11. Chang R-I, Chu C-C, Wu Y-Y, Chen Y-L (2010) Gene clustering by using query-based self-organizing maps. Expert Syst Appl 37(9):6689–6694

    Article  Google Scholar 

  12. Vesanto J, Alhoniemi E (2000) Clustering of the self-organizing map. IEEE Trans Neural Netw 11(3):586–600

    Article  Google Scholar 

  13. Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

    Article  Google Scholar 

  14. Astel A, Tsakovski S, Barbieri P, Simeonov V (2007) Comparison of self-organizing maps classification approach with cluster and principal components analysis for large environmental data sets. Water Res 41(19):4566–4578

    Article  Google Scholar 

  15. Delgado S, Morán F, Mora A, Merelo JJ, Briones C (2015) A novel representation of genomic sequences for taxonomic clustering and visualization by means of self-organizing maps. Bioinformatics 31(5):736–744

    Article  Google Scholar 

  16. Anke Z, Xinjian Q, Guojian C (2014) Clustering analysis of gene data based on PCA and SOM neural networks. In: Fifth international conference on intelligent systems design and engineering applications (ISDEA), 2014. IEEE, pp 284–287

  17. Duda RO, Hart PE, Stork DG (1973) Pattern classification, vol 2. Wiley, New York

    MATH  Google Scholar 

  18. Wang J, Neskovic P, Cooper LN (2006) Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence. Pattern Recogn 39(3):417–423

    Article  MATH  Google Scholar 

  19. Agrawala AK (1977) Machine recognition of patterns. IEEE Press, New York

    Google Scholar 

  20. Fix E, Hodges JL (1989) Discriminatory analysis nonparametric discrimination: consistency properties. Int Stat Rev 57(3):238–247

    Article  MATH  Google Scholar 

  21. Ghosh AK, Chaudhuri P, Murthy C (2005) On visualization and aggregation of nearest neighbor classifiers. IEEE Trans Pattern Anal Mach Intell 27(10):1592–1602

    Article  Google Scholar 

  22. Horton P, Nakai K (1997) Better prediction of protein cellular localization sites with the it k nearest neighbors classifier. In: Ismb, pp 147–152

  23. Nathan R, Spiegel O, Fortmann-Roe S, Harel R, Wikelski M, Getz WM (2012) Using tri-axial acceleration data to identify behavioral modes of free-ranging animals: general concepts and tools illustrated for griffon vultures. J Exp Biol 215(6):986–996

    Article  Google Scholar 

  24. Khamis HS, Cheruiyot KW, Kimani S (2014) Application of k-nearest neighbour classification in medical data mining. Int J Inf Commun Technol Res 4:4

    Google Scholar 

  25. Medjahed SA, Saadi TA, Benyettou A (2013) Breast cancer diagnosis by using k-nearest neighbor with different distances and classification rules. Int J Comput Appl 62(1):1

    Google Scholar 

  26. Deolekar S, Abraham S (2016) Classification of tabla strokes using neural network. In: Computational intelligence in data mining—volume 1. Springer, pp 347–356

  27. Modak S, Sharma S, Prabhakar P, Yadav A, Jayaraman V (2013) Application of support vector machines in fungal genome and proteome annotation. In: Laboratory protocols in fungal biology. Springer, pp 565–577

  28. Manimekalai K, Vijaya M (2014) Taxonomic classification of Plant species using support vector machine. J Bioinf Intell Control 3(1):65–71

    Article  Google Scholar 

  29. Kittler J, Hatef M, Duin RP, Matas J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239

    Article  Google Scholar 

  30. Rahman A, Tasnim S (2014) Ensemble classifiers and their applications: a review. arXiv preprint arXiv:14044088

  31. Yang P, Li X, Chua H-N, Kwoh C-K, Ng S-K (2014) Ensemble positive unlabeled learning for disease gene identification. PLoS ONE 9(5):e97079

    Article  Google Scholar 

  32. Mohapatra S, Patra D, Satpathy S (2014) An ensemble classifier system for early diagnosis of acute lymphoblastic leukemia in blood microscopic images. Neural Comput Appl 24(7–8):1887–1904

    Article  Google Scholar 

  33. Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, Zou Q (2013) Hierarchical classification of protein folds using a novel ensemble classifier. PLoS ONE 8(2):e56499

    Article  Google Scholar 

  34. Sueoka N (1961) Correlation between base composition of deoxyribonucleic acid and amino acid composition of protein. Proc Natl Acad Sci 47(8):1141–1149

    Article  Google Scholar 

  35. Sorimachi K (1999) Evolutionary changes reflected by the cellular amino acid composition. Amino Acids 17(2):207–226

    Article  Google Scholar 

  36. Sorimachi K, Okayasu T (2014) Classification of non-animals and invertebrates based on amino acid composition of complete mitochondrial genomes. Int J Biol 6(1):1

    Google Scholar 

  37. Mamarabadi M, Tokhmechi B (2012) Signal processing approaches as novel tools for the clus-tering of N-acetyl-β-d-glucosaminidases. Iran J Biotechnol 10(3):1

    Google Scholar 

  38. Mamarabadi M, Rohani A (2017) Clustering of fungal hexosaminidase enzymes based on free alignment method using MLP neural network. Neural Comput Appl 1:1–11

    Google Scholar 

  39. Satpathy R, Behera R, Padhi SK, Guru RK (2013) Computational phylogenetic study and data mining approach to laccase enzyme sequences. J Phylogen Evol Biol 1:108

    Article  Google Scholar 

  40. Ozbudak O, Dokur Z (2014) Protein fold classification using Kohonen’s self-organizing map. In: IWBBIO, pp 903–911

  41. Kumar R, Srivastava A, Kumari B, Kumar M (2015) Prediction of β-lactamase and its class by Chou’s pseudo-amino acid composition and support vector machine. J Theor Biol 365:96–103

    Article  MathSciNet  MATH  Google Scholar 

  42. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  43. Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley Longman Publishing Co., Inc., Boston

    Google Scholar 

  44. Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433–459

    Article  Google Scholar 

  45. López M, Ramírez J, Górriz J, Salas-Gonzalez D, Alvarez I, Segovia F, Puntonet C (2009) Automatic tool for Alzheimer’s disease diagnosis using PCA and Bayesian classification rules. Electron Lett 45(8):389–391

    Article  Google Scholar 

  46. Suganthy M, Ramamoorthy P (2012) Principal component analysis based feature extraction, morphological edge detection and localization for fast iris recognition. J Comput Sci 8(9):1428

    Article  Google Scholar 

  47. Li Y, Xia J, Zhang S, Yan J, Ai X, Dai K (2012) An efficient intrusion detection system based on support vector machines and gradually feature removal method. Expert Syst Appl 39(1):424–430

    Article  Google Scholar 

  48. Vieira SM, Mendonça LF, Farinha GJ, Sousa JM (2013) Modified binary PSO for feature selection using SVM applied to mortality prediction of septic patients. Appl Soft Comput 13(8):3494–3504

    Article  Google Scholar 

  49. Sprent P, Smeeton NC (2016) Applied nonparametric statistical methods. CRC Press, Boston

    MATH  Google Scholar 

  50. Refaeilzadeh P, Tang L, Liu H (2009) Cross-validation. In: Encyclopedia of database systems. Springer, pp 532–538

  51. Simon RM, Subramanian J, Li M-C, Menezes S (2011) Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data. Brief Bioinform 12(3):203–214

    Article  Google Scholar 

  52. Varma S, Simon R (2006) Bias in error estimation when using cross-validation for model selection. BMC Bioinf 7(1):91

    Article  Google Scholar 

  53. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Ijcai, vol 2. Stanford, CA, pp 1137–1145

  54. Saini H, Raicar G, Dehzangi A, Lal S, Sharma A (2015) Subcellular localization for Gram positive and Gram negative bacterial proteins using linear interpolation smoothing model. J Theor Biol 386:25–33

    Article  Google Scholar 

  55. Lin W-J, Chen JJ (2012) Class-imbalanced classifiers for high-dimensional data. Brief Bioinf 14:13

    Article  Google Scholar 

  56. May RJ, Maier HR, Dandy GC (2010) Data splitting for artificial neural networks using SOM-based stratified sampling. Neural Netw 23(2):283–294

    Article  Google Scholar 

  57. Li D, Deogun JS, Wang K (2007) Gene function classification using fuzzy k-nearest neighbor approach. In: IEEE international conference on granular computing, 2007. GRC 2007. IEEE, pp 644

  58. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  59. Farid DM, Al-Mamun MA, Manderick B, Nowe A (2016) An adaptive rule-based classifier for mining big biological data. Expert Syst Appl 64:305–316

    Article  Google Scholar 

  60. Vapnik V (2013) The nature of statistical learning theory. Springer, Berlin

    MATH  Google Scholar 

  61. Hsu C-W, Lin C-J (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425

    Article  Google Scholar 

  62. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422

    Article  MATH  Google Scholar 

  63. Shen Q, Shi W-M, Kong W, Ye B-X (2007) A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification. Talanta 71(4):1679–1683

    Article  Google Scholar 

  64. Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480

    Article  Google Scholar 

  65. Mortazavi A, Pepke S, Jansen C, Marinov GK, Ernst J, Kellis M, Hardison RC, Myers RM, Wold BJ (2013) Integrating and mining the chromatin landscape of cell-type specificity using self-organizing maps. Genome Res 23(12):2136–2148

    Article  Google Scholar 

  66. Yan A, Nie X, Wang K, Wang M (2013) Classification of Aurora kinase inhibitors by self-organizing map (SOM) and support vector machine (SVM). Eur J Med Chem 61:73–83

    Article  Google Scholar 

  67. Nam Y, Koh S-H, Jeon S-J, Youn H-J, Park Y-S, Choi WI (2015) Hazard rating of coastal pine forests for a black pine bast scale using self-organizing map (SOM) and random forest approaches. Ecol Inf 29:206–213

    Article  Google Scholar 

  68. Cho S-B, Won H-H (2003) Data mining for gene expression profiles from DNA microarray. Int J Softw Eng Knowl Eng 13(06):593–608

    Article  Google Scholar 

  69. Kim K-J, Cho S-B (2006) Ensemble classifiers based on correlation analysis for DNA microarray classification. Neurocomputing 70(1):187–199

    Article  Google Scholar 

  70. Weng C-H, Huang TC-K, Han R-P (2016) Disease prediction with different types of neural network classifiers. Telemat Inform 33(2):277–292

    Article  Google Scholar 

  71. Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874

    Article  MathSciNet  Google Scholar 

  72. Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35

    Article  Google Scholar 

  73. Montgomery DC (2008) Design and analysis of experiments. Wiley, New York

    Google Scholar 

  74. Opricovic S, Tzeng G-H (2004) Compromise solution by MCDM methods: a comparative analysis of VIKOR and TOPSIS. Eur J Oper Res 156(2):445–455

    Article  MATH  Google Scholar 

  75. Peng Y, Wang G, Kou G, Shi Y (2011) An empirical study of classification algorithm evaluation for financial risk prediction. Appl Soft Comput 11(2):2906–2915

    Article  Google Scholar 

  76. Kou G, Lu Y, Peng Y, Shi Y (2012) Evaluation of classification algorithms using MCDM and rank correlation. Int J Inf Technol Decis Mak 11(01):197–225

    Article  Google Scholar 

  77. Beura S, Majhi B, Dash R (2015) Mammogram classification using two dimensional discrete wavelet transform and gray-level co-occurrence matrix for detection of breast cancer. Neurocomputing 154:1–14

    Article  Google Scholar 

  78. Yousefi MR, Dougherty ER (2012) Performance reproducibility index for classification. Bioinformatics 28(21):2824–2833

    Article  Google Scholar 

  79. Howley T, Madden MG, O’Connell M-L, Ryder AG (2006) The effect of principal component analysis on machine learning accuracy with high-dimensional spectral data. Knowl Based Syst 19(5):363–370

    Article  Google Scholar 

  80. Erkmen B, Yıldırım T (2008) Improving classification performance of sonar targets by applying general regression neural network with PCA. Expert Syst Appl 35(1):472–475

    Article  Google Scholar 

  81. Kumar R, Goyal MK, Ahmed P, Kumar A (2012) Unconstrained handwritten numeral recognition using majority voting classifier. In: 2012 2nd IEEE international conference on Parallel distributed and grid computing (PDGC), 2012. IEEE, pp 284–289

  82. Jafari N, Chodorowski A (2012) Histology-based oral lesion classification. In: 2012 20th Iranian conference on electrical engineering (ICEE). IEEE, pp 1612–1617

  83. Cunningham P, Delany SJ (2007) k-Nearest neighbour classifiers. Multiple Classif Syst 34:1–17

    Google Scholar 

  84. Jiang S, Pang G, Wu M, Kuang L (2012) An improved K-nearest-neighbor algorithm for text categorization. Expert Syst Appl 39(1):1503–1509

    Article  Google Scholar 

  85. Mu Y, Ding W, Tao D, Stepinski TF (2011) Biologically inspired model for crater detection. In: The 2011 international joint conference on neural networks (IJCNN). IEEE, pp 2487–2494

  86. Ahmad J, Fiaz M, Kwon S-I, Sodanil M, Vo B, Baik SW (2016) Gender identification using MFCC for telephone applications—a comparative study. arXiv preprint arXiv:160101577

  87. Li S, Wu X, Tan M (2008) Gene selection using hybrid particle swarm optimization and genetic algorithm. Soft Comput 12(11):1039–1048

    Article  Google Scholar 

  88. Zhang Y, Wang S, Ji G, Dong Z (2013) An MR brain images classifier system via particle swarm optimization and kernel support vector machine. Sci World J 2013:130–134

    Google Scholar 

  89. Figueiredo J, Santos CP, Urendes E, Pons JL, Moreno JC (2015) Implementation of feature extraction methods and support vector machine for classification of partial body weight supports in overground robot-aided walking. In: 2015 7th international IEEE/EMBS conference on neural engineering (NER), IEEE, pp 763–766

  90. Ozkan H (2016) A comparison of classification methods for telediagnosis of Parkinson’s disease. Entropy 18(4):115

    Article  Google Scholar 

  91. Petrov N, Georgieva A, Jordanov I (2013) Self-organizing maps for texture classification. Neural Comput Appl 22(7–8):1499–1508

    Article  Google Scholar 

  92. George AJ, Gopakumar G, Pradhan M, Nazeer KA, Palakal MJ (2015) A self organizing map-harmony search hybrid algorithm for clustering biological data. In: 2015 IEEE international conference on signal processing, informatics, communication and energy systems (SPICES), IEEE, pp 1–5

  93. Kumar D, Rai C, Kumar S (2005) Face recognition using self-organizing map and principal component analysis. In: International conference on neural networks and brain. ICNN&B’05. IEEE, pp 1469–1473

  94. Cho S-B, Ryu J (2002) Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features. Proc IEEE 90(11):1744–1753

    Article  Google Scholar 

  95. Shen H-B, Chou K-C (2006) Ensemble classifier for protein fold pattern recognition. Bioinformatics 22(14):1717–1722

    Article  Google Scholar 

  96. Aram RZ, Charkari NM (2015) A two-layer classification framework for protein fold recognition. J Theor Biol 365:32–39

    Article  MathSciNet  MATH  Google Scholar 

  97. Ding CH, Dubchak I (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17(4):349–358

    Article  Google Scholar 

  98. Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20(15):2429–2437

    Article  Google Scholar 

  99. Subashini T, Ramalingam V, Palanivel S (2009) Breast mass classification based on cytological patterns using RBFNN and SVM. Expert Syst Appl 36(3):5284–5290

    Article  Google Scholar 

  100. Li L, Wu Y, Ye M (2015) Experimental comparisons of multi-class classifiers. Informatica 39(1):71

    MathSciNet  Google Scholar 

  101. Banerjee S, Anura A, Chakrabarty J, Sengupta S, Chatterjee J (2016) Identification and functional assessment of novel gene sets towards better understanding of dysplasia associated oral carcinogenesis. Gene Rep 4:131–138

    Article  Google Scholar 

  102. Waris M, Ahmad K, Kabir M, Hayat M (2016) Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix. Neurocomputing 199:154–162

    Article  Google Scholar 

Download references

Acknowledgements

Financial support from the vice president for research and technology of Ferdowsi University of Mashhad, is highly appreciated.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abbas Rohani.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rohani, A., Mamarabadi, M. Free alignment classification of dikarya fungi using some machine learning methods. Neural Comput & Applic 31, 6995–7016 (2019). https://doi.org/10.1007/s00521-018-3539-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-018-3539-5

Keywords

Navigation