Abstract
The identification of protein-coding regions in genomic DNA sequences is a well-known problem in computational genomics. Various computational algorithms can be employed to achieve the identification process. The rapid advances in this field have motivated the development of innovative engineering methods that allow for further analysis and modeling of many processes in molecular biology. The proposed algorithm utilizes well-known concepts in communications theory, such as correlation, the maximal ratio combining (MRC) algorithm, and filtering techniques to create a signal whose maxima and minima indicate coding and noncoding regions, respectively. The proposed algorithm investigates several prokaryotic genome sequences. Two Bayesian classifiers are designed to test and evaluate the performance of the proposed algorithm. The obtained simulation results prove that the algorithm can efficiently and accurately detect protein-coding regions, which is being demonstrated by the obtained sensitivity and specificity values that are comparable to well-known gene detection methods in prokaryotes. The obtained results further verify the correctness and the biological relevance of using communications theory concepts for genomic sequence analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Atkins, G.: Information Theory and Molecular Biology, vol. 327, no. 1. Cambridge University Press, New York (1993)
Battail, G.: Information theory and error-correcting codes in genetics and biological evolution. In: Barbieri, M. (ed.) Introduction to Biosemiotics, pp. 299–345. Springer, Dordrecht (2008). https://doi.org/10.1007/1-4020-4814-9_13
Weindl, J., Hanus, P., Dawy, Z., Zech, J., Hagenauer, J., Mueller, J.C.: Modeling DNA-binding of Escherichia coli sigma(70) exhibits a characteristic energy landscape around strong promoters. Nucleic Acids Res. 35(20), 7003–7010 (2007)
Al Bataineh, M., Al-qudah, Z.: Cognitive interference channel: achievable rate region and power allocation. IET Commun. 9(2), 249–257 (2015)
Al Bataineh, M., Huang, L., Atkin, G.: TFBS detection algorithm using distance metrics based on center of mass and polyphase mapping. In: 2012 7th International Symposium on Health Informatics and Bioinformatics, no. 1, pp. 37–40 (2012)
Al Bataineh, M.: Analysis of genomic translation using a communications theory approach. Illinois Institute of Technology, Chicago (2010)
Al Bataineh, M., Alonso, M., Wang, S., Zhang, W., Atkin, G.: Ribosome binding model using a codebook and exponential metric. In: 2007 IEEE International Conference on Electro/Information Technology, pp. 438–442 (2007)
Al Bataineh, M., Huang, L., Muhamed, I., Menhart, N., Atkin, G.E.: Gene expression analysis using communications, coding and information theory based models. In: 2009 International Conference on Bioinformatics & Computational Biology, BIOCOMP 2009, pp. 181–185 (2009)
Al Bataineh, M., Huang, L., Alonso, M., Menhart, N., Atkin, G.E.: Analysis of gene translation using a communications theory approach. In: Arabnia, H. (ed.) Advances in Computational Biology, vol. 680, pp. 387–397. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-5913-3_44
Huang, L., et al.: Identification of transcription factor binding sites based on the Chi-Square (X2) distance of a probabilistic vector model. In: 2009 International Conference on Future BioMedical Information Engineering (FBIE 2009), pp. 73–76 (2009)
Weindl, J., Hagenauer, J.: Applying techniques from frame synchronization for biological sequence analysis. In: IEEE International Conference on Communications, pp. 833–838 (2007)
Reiss, D.J., Schwikowski, B.: Predicting protein-peptide interactions via a network-based motif sampler. Bioinformatics 20(Suppl. 1), i274–i282 (2004)
Dawy, Z., Hanus, P., Weindl, J., Dingel, J., Morcos, F.: On genomic coding theory. Eur. Trans. Telecommun. 18(8), 873–879 (2007)
Rosen, G.L., Moore, J.D.: Investigation of coding structure in DNA. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), vol. 2, pp. 361–364 (2003)
MacDonaill, D.A.: Digital parity and the composition of the nucleotide alphabet. Shaping the alphabet with error coding. IEEE Eng. Med. Biol. Mag. 25(1), 54–61 (2006)
Crowley, E.M.: A Bayesian method for finding regulatory segments in DNA. Biopolymers 58(2), 165–174 (2001)
Huang, L., Bataineh, M.A., Atkin, G.E., Wang, S., Zhang, W.: A Novel gene detection method based on period-3 property. In: Conference Proceedings - IEEE Engineering in Medicine and Biology Society, vol. 2009, pp. 3857–3860 (2009)
Kakumani, R., Devabhaktuni, V., Ahmad, M.O.: Prediction of protein-coding regions in DNA sequences using a model-based approach. In: ISCAS 2008, vol. 18, no. 21, pp. 1918–1921 (2008)
Uberbacher, E.C., Mural, R.J.: Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. U. S. A. 88(24), 11261–11265 (1991)
Henderson, J., Salzberg, S., Fasman, K.H.: Finding genes in DNA with a hidden Markov model. J. Comput. Biol. 4(2), 127–141 (1997)
Eddy, S.R.: Hidden Markov models and genome sequence analysis. FASEB J. 12(8), A1327–A1327 (1998)
Yada, T., Totoki, Y., Takagi, T., Nakai, K.: A novel bacterial gene-finding system with improved accuracy in locating start codons. DNA Res. 8(3), 97–106 (2001)
Besemer, J., Lomsadze, A., Borodovsky, M.: GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29(12), 2607–2618 (2001)
Walker, M., Pavlovic, V., Kasif, S.: A comparative genomic method for computational identification of prokaryotic translation initiation sites. Nucleic Acids Res. 30(14), 3181–3191 (2002)
Hannenhalli, S.S., Hayes, W.S., Hatzigeorgiou, A.G., Fickett, J.W.: Bacterial start site prediction. Nucleic Acids Res. 27(17), 3577–3582 (1999)
Nishi, T., Ikemura, T., Kanaya, S.: GeneLook: a novel ab initio gene identification system suitable for automated annotation of prokaryotic sequences. Gene 346, 115–125 (2005)
Hayes, W.S., Borodovsky, M.: How to interpret an anonymous bacterial genome: machine learning approach to gene identification. Genome Res. 8(11), 1154–1171 (1998)
Osada, Y., Saito, R., Tomita, M.: Analysis of base-pairing potentials between 16S rRNA and 5′ UTR for translation initiation in various prokaryotes. Bioinformatics 15(7), 578–581 (1999)
Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.-R.: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16(9), 799–807 (2000)
Schneider, T.D.: Measuring molecular information. J. Theor. Biol. 201(1), 87–92 (1999)
Besemer, J., Borodovsky, M.: GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res. 33(Suppl. 2), W451–W454 (2005)
Raman, R., Overton, G.C.: Application of hidden Markov modeling in the characterization of transcription factor binding sites. In: Proceedings of the Twenty-Seventh Annual Hawaii International Conference on System Sciences, vol. 5, pp. 275–283 (1994)
Krogh, A., Mian, I.S., Haussler, D.: A hidden markov model that finds genes in Escherichia-Coli DNA. Nucleic Acids Res. 22(22), 4768–4778 (1994)
Eddy, S.R.: Hidden Markov models. Curr. Opin. Struct. Biol. 6(3), 361–365 (1996)
Delcher, A.L., Bratke, K.A., Powers, E.C., Salzberg, S.L.: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23(6), 673–679 (2007)
Vaidyanathan, P.P.: Genomics and proteomics: a signal processor’s tour. Circuits Syst. Mag. IEEE 4(4), 6–29 (2004)
Al Bataineh, M., Al-qudah, Z.: A novel gene identification algorithm with Bayesian classification. Biomed. Signal Process. Control 31, 6–15 (2017)
Guan, R., Tuqan, J.: IIR filter design for gene identification. In: Gensips Processing, Baltimore, Maryland (2004)
Vaidyanathan, P., Yoon, B.: Gene and exon prediction using allpass-based filters. In: Workshop on Genomic Signal Processing and Statistics, vol. 3 (2002)
Murray, K.B., Gorse, D., Thornton, J.M.: Wavelet transforms for the characterization and detection of repeating motifs. J. Mol. Biol. 316, 341–363 (2002)
Borodovsky, M., Ekisheva, S.: Problems and Solutions in Biological Sequence Analysis. Cambridge University Press, Cambridge (2006)
Vaidyanathan, P.P., Yoon, B.: Digital filters for gene prediction applications. In: Proceedings of the 36th Asilomar Conference on Signals, Systems, and Computers. Monterey, CA (2002)
Sharma, S.D., Shakya, K., Sharma, S.N.: Evaluation of DNA mapping schemes for exon detection. In: 2011 International Conference on Computer, Communication and Electrical Technology, ICCCET 2011, pp. 71–74 (2011)
Anastassiou, D.: Genomic signal processing. IEEE Signal Process. Mag. 18, 8–20 (2001)
Rangel, P., Giovannetti, J.: Genomes and Databases on the Internet: A Practical Guide to Functions and Applications. Horizon Scientific Press, Wymondham (2002)
Pruitt, K.D., Tatusova, T., Maglott, D.R.: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35(Suppl. 1), D61–D65 (2007)
Baisnee, P.F., Hampson, S., Baldi, P.: Why are complementary DNA strands symmetric? Bioinformatics 18(8), 1021–1033 (2002)
Burset, M., Guigó, R.: Evaluation of gene structure prediction programs. Genomics 34(3), 353–367 (1996)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Al Bataineh, M. (2020). Identification of Coding Regions in Prokaryotic DNA Sequences Using Bayesian Classification. In: Rojas, I., Valenzuela, O., Rojas, F., Herrera, L., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2020. Lecture Notes in Computer Science(), vol 12108. Springer, Cham. https://doi.org/10.1007/978-3-030-45385-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-45385-5_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-45384-8
Online ISBN: 978-3-030-45385-5
eBook Packages: Computer ScienceComputer Science (R0)