Skip to main content

Identification of Coding Regions in Prokaryotic DNA Sequences Using Bayesian Classification

  • Conference paper
  • First Online:
Bioinformatics and Biomedical Engineering (IWBBIO 2020)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12108))

Abstract

The identification of protein-coding regions in genomic DNA sequences is a well-known problem in computational genomics. Various computational algorithms can be employed to achieve the identification process. The rapid advances in this field have motivated the development of innovative engineering methods that allow for further analysis and modeling of many processes in molecular biology. The proposed algorithm utilizes well-known concepts in communications theory, such as correlation, the maximal ratio combining (MRC) algorithm, and filtering techniques to create a signal whose maxima and minima indicate coding and noncoding regions, respectively. The proposed algorithm investigates several prokaryotic genome sequences. Two Bayesian classifiers are designed to test and evaluate the performance of the proposed algorithm. The obtained simulation results prove that the algorithm can efficiently and accurately detect protein-coding regions, which is being demonstrated by the obtained sensitivity and specificity values that are comparable to well-known gene detection methods in prokaryotes. The obtained results further verify the correctness and the biological relevance of using communications theory concepts for genomic sequence analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Atkins, G.: Information Theory and Molecular Biology, vol. 327, no. 1. Cambridge University Press, New York (1993)

    Google Scholar 

  2. Battail, G.: Information theory and error-correcting codes in genetics and biological evolution. In: Barbieri, M. (ed.) Introduction to Biosemiotics, pp. 299–345. Springer, Dordrecht (2008). https://doi.org/10.1007/1-4020-4814-9_13

    Chapter  Google Scholar 

  3. Weindl, J., Hanus, P., Dawy, Z., Zech, J., Hagenauer, J., Mueller, J.C.: Modeling DNA-binding of Escherichia coli sigma(70) exhibits a characteristic energy landscape around strong promoters. Nucleic Acids Res. 35(20), 7003–7010 (2007)

    Article  CAS  Google Scholar 

  4. Al Bataineh, M., Al-qudah, Z.: Cognitive interference channel: achievable rate region and power allocation. IET Commun. 9(2), 249–257 (2015)

    Article  Google Scholar 

  5. Al Bataineh, M., Huang, L., Atkin, G.: TFBS detection algorithm using distance metrics based on center of mass and polyphase mapping. In: 2012 7th International Symposium on Health Informatics and Bioinformatics, no. 1, pp. 37–40 (2012)

    Google Scholar 

  6. Al Bataineh, M.: Analysis of genomic translation using a communications theory approach. Illinois Institute of Technology, Chicago (2010)

    Google Scholar 

  7. Al Bataineh, M., Alonso, M., Wang, S., Zhang, W., Atkin, G.: Ribosome binding model using a codebook and exponential metric. In: 2007 IEEE International Conference on Electro/Information Technology, pp. 438–442 (2007)

    Google Scholar 

  8. Al Bataineh, M., Huang, L., Muhamed, I., Menhart, N., Atkin, G.E.: Gene expression analysis using communications, coding and information theory based models. In: 2009 International Conference on Bioinformatics & Computational Biology, BIOCOMP 2009, pp. 181–185 (2009)

    Google Scholar 

  9. Al Bataineh, M., Huang, L., Alonso, M., Menhart, N., Atkin, G.E.: Analysis of gene translation using a communications theory approach. In: Arabnia, H. (ed.) Advances in Computational Biology, vol. 680, pp. 387–397. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-5913-3_44

    Chapter  Google Scholar 

  10. Huang, L., et al.: Identification of transcription factor binding sites based on the Chi-Square (X2) distance of a probabilistic vector model. In: 2009 International Conference on Future BioMedical Information Engineering (FBIE 2009), pp. 73–76 (2009)

    Google Scholar 

  11. Weindl, J., Hagenauer, J.: Applying techniques from frame synchronization for biological sequence analysis. In: IEEE International Conference on Communications, pp. 833–838 (2007)

    Google Scholar 

  12. Reiss, D.J., Schwikowski, B.: Predicting protein-peptide interactions via a network-based motif sampler. Bioinformatics 20(Suppl. 1), i274–i282 (2004)

    Article  CAS  Google Scholar 

  13. Dawy, Z., Hanus, P., Weindl, J., Dingel, J., Morcos, F.: On genomic coding theory. Eur. Trans. Telecommun. 18(8), 873–879 (2007)

    Article  Google Scholar 

  14. Rosen, G.L., Moore, J.D.: Investigation of coding structure in DNA. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), vol. 2, pp. 361–364 (2003)

    Google Scholar 

  15. MacDonaill, D.A.: Digital parity and the composition of the nucleotide alphabet. Shaping the alphabet with error coding. IEEE Eng. Med. Biol. Mag. 25(1), 54–61 (2006)

    Article  Google Scholar 

  16. Crowley, E.M.: A Bayesian method for finding regulatory segments in DNA. Biopolymers 58(2), 165–174 (2001)

    Article  CAS  Google Scholar 

  17. Huang, L., Bataineh, M.A., Atkin, G.E., Wang, S., Zhang, W.: A Novel gene detection method based on period-3 property. In: Conference Proceedings - IEEE Engineering in Medicine and Biology Society, vol. 2009, pp. 3857–3860 (2009)

    Google Scholar 

  18. Kakumani, R., Devabhaktuni, V., Ahmad, M.O.: Prediction of protein-coding regions in DNA sequences using a model-based approach. In: ISCAS 2008, vol. 18, no. 21, pp. 1918–1921 (2008)

    Google Scholar 

  19. Uberbacher, E.C., Mural, R.J.: Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. U. S. A. 88(24), 11261–11265 (1991)

    Article  CAS  Google Scholar 

  20. Henderson, J., Salzberg, S., Fasman, K.H.: Finding genes in DNA with a hidden Markov model. J. Comput. Biol. 4(2), 127–141 (1997)

    Article  CAS  Google Scholar 

  21. Eddy, S.R.: Hidden Markov models and genome sequence analysis. FASEB J. 12(8), A1327–A1327 (1998)

    Google Scholar 

  22. Yada, T., Totoki, Y., Takagi, T., Nakai, K.: A novel bacterial gene-finding system with improved accuracy in locating start codons. DNA Res. 8(3), 97–106 (2001)

    Article  CAS  Google Scholar 

  23. Besemer, J., Lomsadze, A., Borodovsky, M.: GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29(12), 2607–2618 (2001)

    Article  CAS  Google Scholar 

  24. Walker, M., Pavlovic, V., Kasif, S.: A comparative genomic method for computational identification of prokaryotic translation initiation sites. Nucleic Acids Res. 30(14), 3181–3191 (2002)

    Article  CAS  Google Scholar 

  25. Hannenhalli, S.S., Hayes, W.S., Hatzigeorgiou, A.G., Fickett, J.W.: Bacterial start site prediction. Nucleic Acids Res. 27(17), 3577–3582 (1999)

    Article  CAS  Google Scholar 

  26. Nishi, T., Ikemura, T., Kanaya, S.: GeneLook: a novel ab initio gene identification system suitable for automated annotation of prokaryotic sequences. Gene 346, 115–125 (2005)

    Article  CAS  Google Scholar 

  27. Hayes, W.S., Borodovsky, M.: How to interpret an anonymous bacterial genome: machine learning approach to gene identification. Genome Res. 8(11), 1154–1171 (1998)

    Article  CAS  Google Scholar 

  28. Osada, Y., Saito, R., Tomita, M.: Analysis of base-pairing potentials between 16S rRNA and 5′ UTR for translation initiation in various prokaryotes. Bioinformatics 15(7), 578–581 (1999)

    Article  CAS  Google Scholar 

  29. Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., Müller, K.-R.: Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 16(9), 799–807 (2000)

    Article  CAS  Google Scholar 

  30. Schneider, T.D.: Measuring molecular information. J. Theor. Biol. 201(1), 87–92 (1999)

    Article  CAS  Google Scholar 

  31. Besemer, J., Borodovsky, M.: GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res. 33(Suppl. 2), W451–W454 (2005)

    Article  CAS  Google Scholar 

  32. Raman, R., Overton, G.C.: Application of hidden Markov modeling in the characterization of transcription factor binding sites. In: Proceedings of the Twenty-Seventh Annual Hawaii International Conference on System Sciences, vol. 5, pp. 275–283 (1994)

    Google Scholar 

  33. Krogh, A., Mian, I.S., Haussler, D.: A hidden markov model that finds genes in Escherichia-Coli DNA. Nucleic Acids Res. 22(22), 4768–4778 (1994)

    Article  CAS  Google Scholar 

  34. Eddy, S.R.: Hidden Markov models. Curr. Opin. Struct. Biol. 6(3), 361–365 (1996)

    Article  CAS  Google Scholar 

  35. Delcher, A.L., Bratke, K.A., Powers, E.C., Salzberg, S.L.: Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23(6), 673–679 (2007)

    Article  CAS  Google Scholar 

  36. Vaidyanathan, P.P.: Genomics and proteomics: a signal processor’s tour. Circuits Syst. Mag. IEEE 4(4), 6–29 (2004)

    Article  Google Scholar 

  37. Al Bataineh, M., Al-qudah, Z.: A novel gene identification algorithm with Bayesian classification. Biomed. Signal Process. Control 31, 6–15 (2017)

    Article  Google Scholar 

  38. Guan, R., Tuqan, J.: IIR filter design for gene identification. In: Gensips Processing, Baltimore, Maryland (2004)

    Google Scholar 

  39. Vaidyanathan, P., Yoon, B.: Gene and exon prediction using allpass-based filters. In: Workshop on Genomic Signal Processing and Statistics, vol. 3 (2002)

    Google Scholar 

  40. Murray, K.B., Gorse, D., Thornton, J.M.: Wavelet transforms for the characterization and detection of repeating motifs. J. Mol. Biol. 316, 341–363 (2002)

    Article  CAS  Google Scholar 

  41. Borodovsky, M., Ekisheva, S.: Problems and Solutions in Biological Sequence Analysis. Cambridge University Press, Cambridge (2006)

    Google Scholar 

  42. Vaidyanathan, P.P., Yoon, B.: Digital filters for gene prediction applications. In: Proceedings of the 36th Asilomar Conference on Signals, Systems, and Computers. Monterey, CA (2002)

    Google Scholar 

  43. Sharma, S.D., Shakya, K., Sharma, S.N.: Evaluation of DNA mapping schemes for exon detection. In: 2011 International Conference on Computer, Communication and Electrical Technology, ICCCET 2011, pp. 71–74 (2011)

    Google Scholar 

  44. Anastassiou, D.: Genomic signal processing. IEEE Signal Process. Mag. 18, 8–20 (2001)

    Article  Google Scholar 

  45. Rangel, P., Giovannetti, J.: Genomes and Databases on the Internet: A Practical Guide to Functions and Applications. Horizon Scientific Press, Wymondham (2002)

    Google Scholar 

  46. Pruitt, K.D., Tatusova, T., Maglott, D.R.: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35(Suppl. 1), D61–D65 (2007)

    Article  CAS  Google Scholar 

  47. Baisnee, P.F., Hampson, S., Baldi, P.: Why are complementary DNA strands symmetric? Bioinformatics 18(8), 1021–1033 (2002)

    Article  CAS  Google Scholar 

  48. Burset, M., Guigó, R.: Evaluation of gene structure prediction programs. Genomics 34(3), 353–367 (1996)

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Al Bataineh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Al Bataineh, M. (2020). Identification of Coding Regions in Prokaryotic DNA Sequences Using Bayesian Classification. In: Rojas, I., Valenzuela, O., Rojas, F., Herrera, L., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2020. Lecture Notes in Computer Science(), vol 12108. Springer, Cham. https://doi.org/10.1007/978-3-030-45385-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-45385-5_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-45384-8

  • Online ISBN: 978-3-030-45385-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics