Abstract
Predicting novel genes is an important topic in bioinformatics. De novo protein coding region prediction techniques are more powerful than homology-based techniques in the analysis of novel DNA sequences for annotating protein regions. In this article, a new gene finding technique is proposed to predict protein coding regions in DNA sequences. The technique is based on the spectral analysis of DNA sequences and calculating the period-3 spectrum. An analysis of the effectiveness of the nucleotides is performed to study the variance in the strength of the nucleotides in the period-3 spectrum in protein coding and non-coding regions. The proposed technique uses a dynamic representation scheme to map DNA sequences into a numerical form. The dynamic representation scheme provides better differentiation between coding and non-coding regions as it enhances the participation of nucleotides that are effective in the period-3 spectrum. The technique also uses post-processing to detect the period-3 spectrum peaks instead of thresholding. The proposed technique is compared with other spectrum-based techniques by plotting the receiver operating characteristic (ROC) curves and calculating the area under the curve (AUC). The results show that the proposed technique outperforms other spectrum-based techniques. In addition, we analyze the false positive peaks that result in from the prediction by scanning the availability of stop-codon-like triplet combinations in the three possible reading frames. This analysis provides an insight for future work to improve the performance of the spectrum-based gene finding techniques.
Similar content being viewed by others
References
Akhtar M, Epps J, Ambikairajah E (2007) On DNA numerical representations for period-3 based exon prediction. In: Fifth IEEE International Workshop on Genomic Signal Processing and Statistics, pp 34–37
Akhtar M, Ambikairajah E, Epps J (2008a) Digital signal processing techniques for gene finding in eukaryotes. Lect Notes Comput Sci 5099:144–152
Akhtar M, Epps J, Ambikairajah E (2008b) Signal processing in sequence analysis: advances in eukaryotic gene prediction. IEEE J Sel Topics Signal Process 2(3):310–321
Anastassiou D (2000) Frequency-domain analysis of biomolecular sequences. Bioinformatics 16(12):1073–1081
Borodovsky M, McIninch J (1993) GENMARK: parallel gene recognition for both DNA strands. Comput Chem 17(2):123–133
Burge C, Karlin S (1997) Prediction of complete gene structure in human genomic DNA. J Mol Biol 268(1):78–94
Burset M, Guigo R (1996) Evaluation of gene structure prediction programs. Genomics 34(3):353–367
Chechetkin VR, Turygin AY (1995) Size-dependence of three-periodicity and long-range correlations in DNA sequences. Phys Lett A 199:75–80
Eddy SR (2001) Noncoding RNA genes and the modern RNA world. Nat Rev Genet 2(12):919–929
Fickett J (1982) Recognition of protein coding regions in DNA sequences. Nucleic Acids Res 10(17):5303–5318
Fox T, Carreira A (2004) A digital signal processing method for gene prediction with improved noise suppression. EURASIP J Appl Signal Process 1:108–114
Gibbs W (2003) The unseen genome: beyond DNA. Sci Am 289:106–113
Guigo R (1999) DNA composition, codon usage and exon prediction: in Genetic Databases, chapter 17. Academic Press, pp 53–80
Gunawan T, Epps J, Ambikairajah E (2008) Boosting approach to exon detection in DNA sequences. Electron Lett 44(4):323–324
Jiang X, Lavenier D, Yau S (2008) Coding region prediction based on a universal DNA sequence representation method. J Comput Biol 15(10):1237–1256
Kotlar D, Lavner Y (2003) Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions. Genome Res 13(8):1930–1937
Krogh A (1997) Two methods for improving performance of an HMM and their applications for gene-finding. In: 5th International Conference on Intelligent Systems for Molecular Biology, pp 179–186
Kulp D, Haussler D, Reese MG, Eeckman FH (1996) A generalized hidden Markov model for the recognition of human genes in DNA. In: Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, pp 134–142
Marhon S, Kremer SC (2010) Theoretical justification of computing the 3-base periodicity using nucleotide distribution variance. BioSyst 101(3):185–186
Marhon SA, Kremer SC (2011a) Gene prediction based on DNA spectral analysis: a literature review. J Comput Biol 18(4):639–676
Marhon SA, Kremer SC (2011b) Protein coding region prediction based on the adaptive representation method. In: 24th IEEE Canadian Conference on Electrical and Computer Engineering, pp 415–418
Mena-Chalco J (2014). http://www.vision.ime.usp.br/~jmena/mgwt/datasets/
Mena-Chalco J, Carrer H, Zana Y, Cesar R Jr (2008) Identification of protein coding regions using the modified Gabor-wavelet transform. IEEE/ACM Trans Comput Biol Bioinform 5(2):198–207
Rogic S, Mackworth AK, Ouellette BF (2001) Evaluation of gene finding programs on mammalian sequences. Genome Res 11(5):817–832
Shakya D, Saxena R, Sharma S (2013a) An adaptive window length strategy for eukaryotic CDS prediction. IEEE/ACM Trans Comput Biol Bioinform 10(5):1241–1252
Shakya D, Saxena R, Sharma S (2013b) Improved exon prediction with transforms by de-noising period-3 measure. Digit Signal Process 23(2):499–505
Storz C (2002) An expanding universe of noncoding RNAs. Science 296:1260–1263
Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R (1997) Prediction of probable genes by Fourier analysis of genomic sequences. Comput Appl Biosci 13(3):263–270
Tsonis A, Elsner J, Tsonis P (1981) Periodicity in DNA coding sequences: implications in gene evolution. J Theor Biol 151(3):323–331
Voss RF (1992) Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys Rev Lett 68(25):3805–3808
Xu S, Rao N, Chen X, Liu G, Wang Y (2010) Analysis of threshold influence on the accuracy of gene-prediction methods based on power spectrum analysis. In: 2010 IEEE 10th International Conference on Signal Processing (ICSP), pp 1–4
Yin C, Yau S (2005) Fourier characteristics of coding sequences: origins and a non-Fourier approximation. J Comput Biol 12(9):1153–1165
Yin C, Yau S (2007) Prediction of protein coding regions by the 3-base periodicity analysis of a DNA sequence. J Theor Biol 247(4):687–694
Acknowledgments
Dr. Stefan C. Kremer provided helpful feedback on the manuscript. The author of the manuscript would like to thank the authors of the other methods who provided guidelines to re-implement their methods.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Marhon, S.A. Nucleotide distribution variance-based dynamic representation scheme for novel gene prediction. Netw Model Anal Health Inform Bioinforma 4, 31 (2015). https://doi.org/10.1007/s13721-015-0103-4
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13721-015-0103-4