Abstract
Recently, digital signal processing has been widely applied in the study of genomics. One of the genomic studies is identification of protein-coding regions. Where is a protein coded? How much is encoded? Where are growth and development regulated? The answer to these questions is possible by DNA sequences that can be classified as the exon and intron. In signal processing application, numerical signals are used due to symbolic signal nature of DNA sequence; yet, it must be converted from symbolic sequence to numeric sequence prior the analysis in data preprocessing. The bases in a DNA sequence are represented with four letters A, G, C and T. Each letter corresponds to a numeric value. In the literature, several numerical mapping techniques exist. In this paper, a novel numerical mapping approach has been proposed for converting string to numerical values. Each codon is mapped by improved fractional derivative of Shannon equation in this approach. For exon regions prediction, three methods have been used. These methods are singular value decomposition (SVD), discrete Fourier transform (DFT) and short-time Fourier transform (STFT). The performance of the proposed mapping technique has been evaluated based on the above-mentioned three classification methods. The proposed novel technique has showed more success in the identification of protein-coding regions as compared to the predominant existing mapping techniques SVD, DFT and STFT methods.
Similar content being viewed by others
References
Ficket JW, Tung CS (1992) Assessment of protein coding measures. Nucleic Acid Res 20(24):6441–6450
Koonin EV, Novozhilov AS (2009) Origin and evolution of the genetic code: the universal enigma. IUBMB Life 61(2):99–111. doi:10.1002/iub.146
Course Hero. http://www.coursehero.com. Accessed 01 Mar 2016
Tugan J, Rushdi A (2008) A DSP based approach for finding the codon bias in DNA sequences. IEEE J Signal Process 2(3):343–356. doi:10.1109/JSTSP.2008.923851
Kwan HK, Arniker SB (2009) Numerical representation of DNA sequences. In: IEEE international conference on electro/information technology, EIT ‘09, Windsor, pp 307–310
Grandhi DG, Vijaykumar C (2007) Simplex mapping for identifying the protein coding regions in DNA. TENCON-2007, Taiwan
Cristea PD (2002) Genetic signal representation and analysis. In: SPIE information conference biomedical optics, pp 77–84
Akhtar M, Epps J, Ambikairajah E (2007) On DNA numerical representations for period-3 based exon prediction. IEEE workshop on genomic signal processing and statistics (GENSIPS), pp 1–4. doi:10.1109/GENSIPS.2007.4365821
Holden T, Subramaniam R, Sullivan R, Cheng E, Sneider C, Tremberger G, Flamholz JA, Leiberman DH, Cheung TD (2007) ATCG nucleotide fluctuation of deinococcus radiodurans radiation genes. In: Proceedings of society of photo-optical instrumentation engineers (SPIE), pp 1598–1609
Zahhad MA (2014) A novel circular mapping technique for spectral classification of exons and introns in human DNA sequences. Int J Inf Technol Comput Sci. doi:10.5815/ijitcs.2014.04.02
Zahhad MA, Ahmed SM, Elrahman SAA (2012) Genomic analysis and classification of exon and intron sequences using DNA numerical mapping techniques. Int J Inf Technol Comput Sci. doi:10.5815/ijitcs.2012.08.03
Wang SY, Tian FC, Liu X, Wang J (2009) A novel representation approach to DNA sequence and its application. IEEE Signal Process Lett 16(4):275–278. doi:10.1109/LSP.2009.2014291
Zahhad MA, Ahmed SM, Elrahman SAA (2013) A new numerical mapping technique for recognition of exons and introns in DNA sequences. In: National radio science conference
Cosic I (1994) Macromolecular bioactivity: is it resonant interaction between macromolecules? Theory and applications. IEEE Trans Biomed Eng. doi:10.1109/10.335859
Ficket JW, Tung CS (1982) Recognition of protein coding regions in DNA sequence. Nucleic Acids Res 10(17):5303–5318. doi:10.1093/nar/10.17.5303
Cristea PD (2002) Conversion of nucleotides sequences into genomic signals. J Cell Mol Med 6:279–303. doi:10.1111/j.1582-4934.2002.tb00196.x
Buldyrev SV, Goilberger AL, Havlin S, Mantegna RN, Mastsa ME, Peng CK, Simons M, Stanley HE (1995) Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. Phys Rev E 51(5):5084–5091. doi:10.1103/PhysRevE.51.5084
Peng C-K, Buldyrev SV, Goldberger AL, Havlin S, Sciortino F, Simons M, Stanley HE, Goldberger AL, Havlin S, Peng CK, Stanley HE, Viswanathan GM (1998) Analysis of DNA sequences using methods of statistical physics. Phys A 249:430–438. doi:10.1016/S0378-4371(97)00503-7
Hota MK (2011) Identification of protein-coding regions in eukaryotes using Fourier Transforms and Singular Value Decomposition using multiple length sliding windows. Int J Signal Imaging Syst Eng. doi:10.1504/IJSISE.2011.041604
Massachusetts Institute of Technology, Biological Engineering. http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm. Accessed 03 Jan 2016
Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97:10101–10106. doi:10.1073/pnas.97.18.10101
Golub GH, Van Loan CF (1989) Matrix computations, 2nd edn. Johns Hopkins University Press, Baltimore
Akhtar M, Epps J, Ambikairajah E (2007) Time and frequency domain methods for gene and exon prediction in eukaryotes. In: Proceedings of IEEE ICASSP, pp 573–576. doi:10.1109/ICASSP.2007.366300
Kwan JYY, Kwan BYM, Kwan HK (2010) Spectral analysis of numerical exon and intron sequences. In: Proceedings of IEEE international conference on bioinformatics and biomedicine workshops, Hong Kong, pp 876–877
Vaidyanathan PP, ve Yoon B-J (2002) Gene and exon prediction using allpass-based filters. Workshop on genomic signal processing and statistics, Raleigh, NC, pp 45–55. doi:10.1016/S1672-0229(11)60007-7
Hota MK, Srivastava VK (2010) Performance analysis of different DNA to numerical mapping techniques for identification of protein coding regions using tapered window based short-time Discrete Fourier Transform. In: 2010 international conference on power control and embedded systems. doi:10.1109/ICPCES.2010.5698675
Schmitt AO, Herzel H (1997) Estimating the entropy of DNA sequences. J Theor Biol 188(3):369–377. doi:10.1006/jtbi.1997.0493
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623–656. doi:10.1002/j.1538-7305.1948.tb01338
Machado JAT (2012) Shannon entropy analysis of the genome code. Math Probl Eng. Article ID 132625, 12 pages. 10.1155/2012/132625
Koslicki D (2011) Topological entropy of DNA sequences. Bioinformatics 27(8):1061–1067. doi:10.1093/bioinformatics/btr077
Kozarzewski B (2012) A method for nucleotide sequence analysis. Comput Methods Sci Technol 18(1):5–10
Vinga S, Almeida JS (2007) Local Renyi entropic profiles of DNA sequences. BMC Bioinform 8:393. doi:10.1186/1471-2105-8-393
Schneider TD (2010) A brief review of molecular information theory. Nano Commun Netw 1(3):173–180. doi:10.1016/j.nancom.2010.09.002
Karcı A (2016) New kinds of entropy: fractional entropy. In: International conference on natural science and engineering (ICNASE’16). 19–20 March, Kilis
NCBI GenBank database. http://www.ncbi.nlm.nih.gov/Genbank. Accessed Jan 2016
Sendra GH (2008) Dynamic speckle algorithms comparison using receiver operating characteristic. Opt Eng 47(5):057005. doi:10.1117/1.2920429
Das R (2010) A comparison of multiple classification methods for diagnosis of Parkinson disease. Expert Syst Appl 37(2):1568–1572. doi:10.1016/j.eswa.2009.06.040
Akhtar M, Ambikairajah E, Epps J (2005) Detection of period-3 behavior in genomic sequences using singular value decomposition. In: International conference on emerging technologies, vol 12, p 430. doi:10.1186/1471-2105-12-430
Das B, Turkoglu I (2016) A new mapping technique for separation of exons and introns by using DFT method. In: International conference on engineering and natural science, Sarajevo, vol 2, no 10, pp 2778–2784
Das B, Turkoglu I (2016) Sayisal Haritalama Teknikleri ve Fourier Dönüşümü Kullanılarak DNA Dizilimlerinin Sınıflandırılması, (Turkish). J Fac Eng Archit Gazi Univ 31(4):921–932. doi:10.17341/gazimmfd.278447
Das B, Turkoglu I (2016) A new numerical mapping approach for identification protein coding regions in DNA sequences by using SVD method. In: International conference on engineering and natural science, Sarajevo, vol 2, no 10, pp 2773–2777
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
There is no conflict of interest.
Rights and permissions
About this article
Cite this article
Das, B., Turkoglu, I. A novel numerical mapping method based on entropy for digitizing DNA sequences. Neural Comput & Applic 29, 207–215 (2018). https://doi.org/10.1007/s00521-017-2871-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-017-2871-5