Skip to main content
Log in

Nucleotide distribution variance-based dynamic representation scheme for novel gene prediction

  • Original Article
  • Published:
Network Modeling Analysis in Health Informatics and Bioinformatics Aims and scope Submit manuscript

Abstract

Predicting novel genes is an important topic in bioinformatics. De novo protein coding region prediction techniques are more powerful than homology-based techniques in the analysis of novel DNA sequences for annotating protein regions. In this article, a new gene finding technique is proposed to predict protein coding regions in DNA sequences. The technique is based on the spectral analysis of DNA sequences and calculating the period-3 spectrum. An analysis of the effectiveness of the nucleotides is performed to study the variance in the strength of the nucleotides in the period-3 spectrum in protein coding and non-coding regions. The proposed technique uses a dynamic representation scheme to map DNA sequences into a numerical form. The dynamic representation scheme provides better differentiation between coding and non-coding regions as it enhances the participation of nucleotides that are effective in the period-3 spectrum. The technique also uses post-processing to detect the period-3 spectrum peaks instead of thresholding. The proposed technique is compared with other spectrum-based techniques by plotting the receiver operating characteristic (ROC) curves and calculating the area under the curve (AUC). The results show that the proposed technique outperforms other spectrum-based techniques. In addition, we analyze the false positive peaks that result in from the prediction by scanning the availability of stop-codon-like triplet combinations in the three possible reading frames. This analysis provides an insight for future work to improve the performance of the spectrum-based gene finding techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Akhtar M, Epps J, Ambikairajah E (2007) On DNA numerical representations for period-3 based exon prediction. In: Fifth IEEE International Workshop on Genomic Signal Processing and Statistics, pp 34–37

  • Akhtar M, Ambikairajah E, Epps J (2008a) Digital signal processing techniques for gene finding in eukaryotes. Lect Notes Comput Sci 5099:144–152

    Article  Google Scholar 

  • Akhtar M, Epps J, Ambikairajah E (2008b) Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE J Sel Topics Signal Process 2(3):310–321

    Article  Google Scholar 

  • Anastassiou D (2000) Frequency-domain analysis of biomolecular sequences. Bioinformatics 16(12):1073–1081

    Article  MathSciNet  Google Scholar 

  • Borodovsky M, McIninch J (1993) GENMARK: parallel gene recognition for both DNA strands. Comput Chem 17(2):123–133

    Article  MATH  Google Scholar 

  • Burge C, Karlin S (1997) Prediction of complete gene structure in human genomic DNA. J Mol Biol 268(1):78–94

    Article  Google Scholar 

  • Burset M, Guigo R (1996) Evaluation of gene structure prediction programs. Genomics 34(3):353–367

    Article  Google Scholar 

  • Chechetkin VR, Turygin AY (1995) Size-dependence of three-periodicity and long-range correlations in DNA sequences. Phys Lett A 199:75–80

    Article  Google Scholar 

  • Eddy SR (2001) Noncoding RNA genes and the modern RNA world. Nat Rev Genet 2(12):919–929

    Article  Google Scholar 

  • Fickett J (1982) Recognition of protein coding regions in DNA sequences. Nucleic Acids Res 10(17):5303–5318

    Article  Google Scholar 

  • Fox T, Carreira A (2004) A digital signal processing method for gene prediction with improved noise suppression. EURASIP J Appl Signal Process 1:108–114

    Article  Google Scholar 

  • Gibbs W (2003) The unseen genome: beyond DNA. Sci Am 289:106–113

    Article  Google Scholar 

  • Guigo R (1999) DNA composition, codon usage and exon prediction: in Genetic Databases, chapter 17. Academic Press, pp 53–80

  • Gunawan T, Epps J, Ambikairajah E (2008) Boosting approach to exon detection in DNA sequences. Electron Lett 44(4):323–324

    Article  Google Scholar 

  • Jiang X, Lavenier D, Yau S (2008) Coding region prediction based on a universal DNA sequence representation method. J Comput Biol 15(10):1237–1256

    Article  MathSciNet  Google Scholar 

  • Kotlar D, Lavner Y (2003) Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Res 13(8):1930–1937

    Google Scholar 

  • Krogh A (1997) Two methods for improving performance of an HMM and their applications for gene-finding. In: 5th International Conference on Intelligent Systems for Molecular Biology, pp 179–186

  • Kulp D, Haussler D, Reese MG, Eeckman FH (1996) A generalized hidden Markov model for the recognition of human genes in DNA. In: Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, pp 134–142

  • Marhon S, Kremer SC (2010) Theoretical justification of computing the 3-base periodicity using nucleotide distribution variance. BioSyst 101(3):185–186

    Article  Google Scholar 

  • Marhon SA, Kremer SC (2011a) Gene prediction based on DNA spectral analysis: a literature review. J Comput Biol 18(4):639–676

    Article  MathSciNet  Google Scholar 

  • Marhon SA, Kremer SC (2011b) Protein coding region prediction based on the adaptive representation method. In: 24th IEEE Canadian Conference on Electrical and Computer Engineering, pp 415–418

  • Mena-Chalco J (2014). http://www.vision.ime.usp.br/~jmena/mgwt/datasets/

  • Mena-Chalco J, Carrer H, Zana Y, Cesar R Jr (2008) Identification of protein coding regions using the modified Gabor-wavelet transform. IEEE/ACM Trans Comput Biol Bioinform 5(2):198–207

    Article  Google Scholar 

  • Rogic S, Mackworth AK, Ouellette BF (2001) Evaluation of gene finding programs on mammalian sequences. Genome Res 11(5):817–832

    Article  Google Scholar 

  • Shakya D, Saxena R, Sharma S (2013a) An adaptive window length strategy for eukaryotic CDS prediction. IEEE/ACM Trans Comput Biol Bioinform 10(5):1241–1252

    Article  Google Scholar 

  • Shakya D, Saxena R, Sharma S (2013b) Improved exon prediction with transforms by de-noising period-3 measure. Digit Signal Process 23(2):499–505

    Article  MathSciNet  Google Scholar 

  • Storz C (2002) An expanding universe of noncoding RNAs. Science 296:1260–1263

    Article  Google Scholar 

  • Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R (1997) Prediction of probable genes by Fourier analysis of genomic sequences. Comput Appl Biosci 13(3):263–270

    Google Scholar 

  • Tsonis A, Elsner J, Tsonis P (1981) Periodicity in DNA coding sequences: implications in gene evolution. J Theor Biol 151(3):323–331

    Article  Google Scholar 

  • Voss RF (1992) Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys Rev Lett 68(25):3805–3808

    Article  Google Scholar 

  • Xu S, Rao N, Chen X, Liu G, Wang Y (2010) Analysis of threshold influence on the accuracy of gene-prediction methods based on power spectrum analysis. In: 2010 IEEE 10th International Conference on Signal Processing (ICSP), pp 1–4

  • Yin C, Yau S (2005) Fourier characteristics of coding sequences: origins and a non-Fourier approximation. J Comput Biol 12(9):1153–1165

    Article  Google Scholar 

  • Yin C, Yau S (2007) Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J Theor Biol 247(4):687–694

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

Dr. Stefan C. Kremer provided helpful feedback on the manuscript. The author of the manuscript would like to thank the authors of the other methods who provided guidelines to re-implement their methods.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sajid A. Marhon.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Marhon, S.A. Nucleotide distribution variance-based dynamic representation scheme for novel gene prediction. Netw Model Anal Health Inform Bioinforma 4, 31 (2015). https://doi.org/10.1007/s13721-015-0103-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13721-015-0103-4

Keywords

Navigation