Abstract
Computational localization of transcription factor binding sites (TFBSs, also termed as motif instances) in DNA sequences greatly helps biologists in saving experimental cost and time for motif discovery. The task can be formulated as feature-based object location identification problem, which is remarkably different from traditional pattern recognition tasks. This paper aims to develop a machine learning approach for TFBSs location prediction through feature-based classifiers. Some specific features are extracted to characterize and distinguish the TFBSs from random k-mers. Then, a sampling technique is employed to generate dummy positives in the feature space for achieving better prediction performance. Three learner models are examined and a simple ensemble method is adopted in our classifiers design. Experimental results on eight benchmark datasets demonstrate that our proposed techniques have good potential for conserved motif detections. Comparative studies indicate that the extreme learning machine-based ensemble classifier outperforms the other learner models in terms of overall prediction accuracy and computational complexity.
Similar content being viewed by others
References
Chacko B, Krishnan V, Raju G, Anto P (2011) Handwritten character recognition using wavelet energy and extreme learning machine. Int J Mach Learn Cybern. doi:10.1007/s13042-011-0049-5
Chan TM, Leung KS, Lee KH (2008) TFBS identification based on genetic algorithm with combined representations and adaptive post-processing. Bioinformatics 24(3):341–349
Chauvin Y, Rumelhart DE (1995) Backpropagation: theory, architectures, and applications. Taylor & Francis, Inc., USA
Chawla NV, Bowyer KW, Kegelmeyer PW (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res (JAIR) 16:321–357
Dineen DG, Wilm A, Cunningham P, Higgins DG (2009) High DNA melting temperature predicts transcription start site location in human and mouse. Nucleic Acids Res 37(22):7360–7367
Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley-Interscience, New York
Ernst J, Plasterer HL, Simon I, Bar-Joseph Z (2010) Integrating multiple evidence sources to predict transcription factor binding in the human genome. Genome Res 20(4):526–536
Fu W, Ray P, Xing EP (2009) DISCOVER: a feature-based discriminative method for motif search in complex genomes. Bioinformatics 25(12):i321–i329
Gunewardena S, Zhang Z (2006) Accounting for structural properties and nucleotide co-variations in the quantitative prediction of binding affinities of protein-DNA interactions. In: Proceedings of the pacific symposium on biocomputing, Maui, pp 379–390
Heron L (2011) A new fast fuzzy Cocke–Younger–Kasami algorithm for DNA strings analysis. Int J Mach Learn Cybern 2(3):209–218
Huang GB, Zhu QY, Siew CK (2004) Extreme learning machine: a new learning scheme of feedforward neural networks. In: Proceedings of IEEE international joint conference on neural networks (IJCNN’04), vol 2, pp 985–990
Huang GB, Wang DH, Lan Y (2011) Extreme learning machines: a survey. Int J Mach Learn Cybern 2(2):107–122
Kang K, Chung JHH, Kim J (2009) Evolutionary conserved motif finder (ECMFinder) for genome-wide identification of clustered YY1- and CTCF- binding sites. Nucl Acids Res 37(6):2003–2013
Kel AE, Gössling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E (2003) MATCH: a tool for searching transcription factor binding sites in DNA sequences. Nucl Acids Res 31(13):3576–3579
Kheradpour P, Stark A, Roy S (2007) Reliable prediction of regulator targets using 12 Drosophila genomes. Genome Res 17:1919–1931
Liu F, Tostesen E, Sundet JK, Jenssen TK, Bock C, Jerstad GI, Thilly WG, Hovig E (2007) The human genomic melting map. PLoS Comput Biol 3:e93
Liu R, Blackwell TW, States DJ (2001a) Conformational model for binding site recognition by the E. coli MetJ transcription factor. Bioinformatics 17(7):622–633
Liu X, Brutlag DL, Liu JS (2001b) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput 6:127–138
Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, London
Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel AE, Wingender E (2006) TRANSFAC(R) and its module TRANSCompel(R): transcriptional gene regulation in eukaryotes. Nucl Acids Res 34:D108–D110
Meysman P, Dang TH, Laukens K, De Smet R, Wu Y, Marchal K, Engelen K (2011) Use of structural dna properties for the prediction of transcription-factor binding sites in Escherichia coli. Nucl Acids Res 39(2):e6
Ponomarenko MP, Ponomarenko JV, Frolov AS, Podkolodny NL, Savinkova LK, Kolchanov NA, Overton GC (1999) Identification of sequence-dependent DNA features correlating to activity of DNA sites interacting with proteins. Bioinformatics 15(7):687–703
Pudimat R, Schukat-Talamazzini EG, Backofen R (2005) A multiple-feature framework for modelling and predicting transcription factor binding sites. Bioinformatics 21(14):3082–3088
Quandt K, FrechH K Karas, Wingender E, Werner T (1995) MatInd and MatInspector—new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucl Acids Res 23:4878–4884
Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucl Acids Res 32:D91–D94
Sandve GK, Drabls F (2006) A survey of motif discovery methods in an integrated framework. Biol Direct 1(1):11+
Satija R, Pachter L, Hein J (2008) Combining statistical alignment and phylogenetic footprinting to detect regulatory elements. Bioinformatics 24(10):1236–1242
Sharon E, Lubliner S, Segal E (2008) A feature-based approach to modeling protein–DNA interactions. PLoS Comput Biol 4(8):e1000154
Stormo GD (2000) DNA binding sites: representation and discovery. Bioinformatics 16(1):16–23
Tang V, Yan H (2011) Noise reduction in microarray gene expression data based on spectral analysis. Int J Mach Learn Cybern. doi:10.1007/s13042-011-0039-7
Thijs G, Lescot M, Marchal K, Rombauts S, Moor BD, Rouze P, Moreau Y (2001) A higher-order background model improves the detection of promoter regulatory elements by gibbs sampling. Bioinformatics 17(12):1113–1122
Tompa M, Li N, Bailey TL, Church GM, Moor BD, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23(1):137–144
Vapnik VN (1999) The nature of statistical learning theory, 2nd edn. Springer, New York
Wang DH (2009) Characterization of regulatory motif models. Technical report, La Trobe Univeristy
Wang DH, Lee NK (2008) MISCORE: mismatch-based matrix similarity scores for DNA motifs detection. In: Proceedings of the international conference on neural information processing (ICONIP’08), pp 478–485
Wang DH, Li X (2009) GAPK: genetic algorithms with prior knowledge for motif discovery in DNA sequences. In: Proceedings of the IEEE congress on evolutionary computation (CEC ’09), pp 277–284
Wang DH, Tapan S (2010) Fuzzy filtering systems for performing environment improvement of computational dna motif discovery. In: Proceedings of the IEEE international conference on fuzzy systems (FUZZ-IEEE’10), pp 1–8
Wang XZ, Chen AX, Feng HM (2011) Upper integral network with extreme learning mechanism. Neurocomputing 74(16): 2520–2525
Wang XZ, Dong CR (2009) Improving generalization of fuzzy if-then rules by maximizing fuzzy entropy. IEEE Trans Fuzzy Syst 17(3):556–567
Wang XZ, Dong LC, Yan JH (2011) Maximum ambiguity based sample selection in fuzzy decision tree induction. IEEE Trans Knowl Data Eng. doi:10.1109 /TKDE.2011.67
Wei Z, Jensen ST (2006) GAME: detecting cis-regulatory elements using a genetic algorithm. Bioinformatics 22(13):1577–1584
Wu J, Wang ST, Chung FL (2011) Positive and negative fuzzy rule system, extreme learning machine and image classification. Int J Mach Learn Cybern 2(4):261–271
Yaragatti M, Sandler T, Ungar L (2009) A predictive model for identifying mini-regulatory modules in the mouse genome. Bioinformatics 25(3):353–357
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, D., Do, H.T. Computational localization of transcription factor binding sites using extreme learning machines. Soft Comput 16, 1595–1606 (2012). https://doi.org/10.1007/s00500-012-0820-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-012-0820-x