A novel gene identification algorithm with Bayesian classification
Introduction
The rapid advances in the field computational genomics and bioinformatics have motivated the development of innovative engineering methods for data acquisition, interpretation, and analysis. Techniques from the information theory [1], [2], [3], [4], [5], [6], [7], communications [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], coding theory [5], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], signal processing [31], [32], [33], [34], [35], [36], machine learning [37] and various statistical methods [38], [39], [40], [41], [42], [43], [44] have been actively researched for use in gene detection, genomic sequence analysis and alignment. The developed analyses made possible by the use of the latter methods allow the testing of different biological aspects related to the process of gene expression. For example, then can help in determining whether certain regions of a given genome are protein-coding sequences (i.e. gene detection). These methods promote new interdisciplinary collaborations in research and education, integrating biomedical engineering, electrical engineering and life sciences. The knowledge gain can help address fundamentally important issues that cannot be explored systematically and quantitatively by experimentation alone. Moreover, it can reduce the consumption of laboratory resources, minimize time-consuming laboratory experimentations and lead to a better understanding of the complex genetic processes.
A DNA sequence can be divided into two types of regions: genic and intergenic spaces. Genes are the segments of DNA that contain the coding information required for protein synthesis. A considerable target of genomic research is to understand the nature and role of the coding and noncoding information embedded in the DNA sequence structure. A crucial step in attaining this target is the detection of the gene locations in the entire DNA sequence. Several diverse methods have been proposed in literature for gene detection in prokaryotes. For example, probabilistic methods [45], [46], statistical methods [47], [48], [49], and other computational techniques including: machine learning [50], free energy calculations [51], support vector machine [52], Bayesian methods [33] information theory [53], hidden Markov model such GeneMark [38], [39], [40], [43], [44], [54], and interpolated Markov model such as GLIMMER [55].
The algorithm employed in GLIMMER has been described in detail in Ref. [56]. The method employs an interpolated Markov model to identify coding regions. Specifically, the algorithm identifies open reading frames (ORFs) of sufficient length which are most likely to be coding to give an initial model of the coding regions of the organism. This information is subsequently used in a Markov chain in order to locate all other coding sequences. The GLIMMER system is a widely used and highly accurate gene finder for prokaryotes. On the other hand, GeneMark is an HMM-like algorithm. The algorithm introduces inhomogeneous three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in the two DNA strands simultaneously. The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being “protein-coding” (carrying genetic code) in each of six possible reading frames (including three frames in the complementary DNA strand) or being “non-coding”.
The design of a general gene identification algorithm is a compelling research problem. The gene identification method presented in this work utilizes a particular property of the DNA protein coding regions, specifically the period-3 property [57], [58], [59], [60], [61], [62], [63] in a new novel approach using concepts from communications theory. The period-3 pattern is generally considered as a strong indication of coding regions. By mapping the DNA sequences to digital signals, standard digital signal processing (DSP) techniques the discrete Fourier transformation (DFT) [57], [58], [62], digital filtering [64], [65], wavelet transformations [66], Markov modeling [67] and IIR filtering [63] have shown good performance in the detection of this period-3 behavior and, therefore, in identifying coding regions. However, the efficiency of the latter methods in detecting period-3 property and suppressing the background noise is obtained at the expense of increased computational complexity. In our previous work [35], we have proposed a novel algorithm for identifying protein-coding regions in the DNA sequences based on the period-3 property. The proposed algorithm in Ref. [35] identifies protein-coding regions by applying a digital correlating and filtering process to the entire genomic sequence under study. However, our proposed algorithm in this work is both an enhanced and a generalized version of the work in Ref. [35] in terms of methodology, performance, and experimental validation.
This paper proposes a novel application of principles and techniques from communications theory and digital signal processing for the detection and identification of protein coding regions in prokaryotic genomes. The proposed algorithm employs polyphase complex mapping to provide a numerical representation of the genomic sequences involved in the analysis and then uses basic concepts from communications theory and digital signal processing such as correlation, maximal ratio combining (MRC) algorithms and filtering to generate a signal whose peaks signify locations of coding regions and whose troughs signify locations of noncoding regions. The proposed gene detection algorithm is applied to the complete genome sequences of several prokaryotes (e.g. MG1655 and O157H7 E. coli bacterial strains). Moreover, two Bayesian classifiers are designed to evaluate the performance of the proposed gene detection algorithm and compare it to well-known gene detection methods in prokaryotes. The obtained simulation results show that the proposed algorithm can efficiently and accurately identify protein coding regions with sensitivity and specificity values comparable to well-known gene detection methods in prokaryotes such as GLIMMER and GeneMark. This further proves the relevance of using communications theory concepts for genomic sequence analysis. Moreover, the proposed algorithm does not entail any prior information about the coding regions as does the DFT method described in Ref. [57]. It can sharply extract the period-3 component and hence effectively identify protein coding regions in the whole genomic sequences of prokaryotes. In addition, it can effectively suppress the background noise with no added computational complexity.
The rest of this paper is organized as follows. Section 2 highlights the so-called period-3 behavior and how it can be detected by the use of digital signal processing techniques (like the DFT) to locate protein-coding regions in the genomic structure. Section 3 gives a detailed mathematical description of the proposed gene detection algorithm. The method used for peaks (corresponding to coding regions) and troughs (corresponding to noncoding regions) detection is described in Algorithm I. It also describes two period-3 based Bayesian classifiers that are designed to evaluate the performance of the proposed algorithm. The period-3 based classification system is described in Algorithm II. Simulation results are shown and discussed in Section 4. Finally, our paper is concluded in Section 5.
Section snippets
Protein coding region identification using the period-3 property
It has been emphasized by many articles that the coding regions of the DNA possess a period-3 property caused by codon biases in the translation of codons into amino acids. This fundamental characteristic is not detected outside the coding regions, and hence can be utilized to locate the coding regions in the entire genomic structure [58], [62]. This observation can be traced back to the work of Trifonov and Sussman [68].
The period-3 property implies a clear short-range correlation behavior in
The proposed gene detection algorithm
Fig. 3 shows a schematic system-like representation of the proposed algorithm. The input parameter of the proposed detection algorithm is the genomic sequence under study, of length . The output parameters are three sequences: ., and whose lengths are the same as the length of the input test sequence . The sequence represents the correlation of the genomic sequence with twenty-four hypothetically generated period-3 based subsequences after passing through a maximal
Simulation results and analysis
In order to demonstrate the fidelity and biological significance of the proposed gene detection algorithm, it is applied to several prokaryotic genome sequences. For example, the complete genome sequence of Escherichia coli bacterial strains MG1655 and O157:H7 are used as input test sequences. Such sequences are available at the NCBI [77].
The length of each one of the twenty-four hypothetical period-3 based subsequences, , is selected to be 1950 (i.e. the number of repetitions, , is
Conclusions
This work proposes a novel application of principles and concepts from communications theory and digital signal processing for the detection of protein coding regions in prokaryotic genomes. The proposed gene detection algorithm employs polyphase complex mapping to provide a numerical representation of the genomic sequences involved in the analysis, and then uses basic concepts from communications theory and digital signal processing as correlation, maximal ratio combining (MRC) algorithms and
Mohammad F. Al Bataineh was born in Irbid, Jordan in 1979. He received his B.S. degree in Telecommunications Engineering with high honors from Yarmouk University, Jordan, in 2003. He received his M.S. and PhD degrees in Electrical Engineering with excellent distinction from Illinois Institute of Technology (IIT) in 2006 and 2010, respectively. His research interests are focused in the application of communications, coding theory, and information theory concepts to the interpretation and
References (78)
- et al.
Application of information theory to DNA sequence analysis: a review
Pattern Recognit.
(1996) - et al.
Mapping of statistical physics to information theory with application to biological systems
J. Theor. Biol.
(2000) Information theory in molecular biology
Phys. Life Rev.
(2004)- et al.
An error-correcting code framework for genetic sequence analysis
J. Franklin Inst.-Eng. Appl. Math.
(2004) - et al.
Coding theory based models for protein translation initiation in prokaryotic organisms
Biosystems
(2004) - et al.
Is there an error correcting code in the base sequence in DNA?
Biophys. J.
(1996) - et al.
Wavelet based fractal analysis of DNA sequences
Phys. D-Nonlinear Phenom.
(1996) - et al.
GENEMARK: parallel gene recognition for both DNA strands
Comput. Chem.
(1993) Hidden Markov models
Curr. Opin. Struct. Biol.
(1996)- et al.
GeneLook: a novel ab initio gene identification system suitable for automated annotation of prokaryotic sequences
Gene
(2005)
Measuring molecular information
J. Theor. Biol.
Sizedependence of three-periodicity and long-range correlations in DNA sequences
Phys. Lett. A
The gene identification problem: an overview for developers
Comput. Chem.
Wavelet transforms for the characterization and detection of repeating motifs
J. Mol. Biol.
Systems and transforms with applications in optics
J. Franklin Inst.
Evaluation of gene structure prediction programs
Genomics
Information Theory and Molecular Biology
Information theory and error-correcting codes in genetics and biological evolution
Introd. to Biosemiotics New Biol. Synth
Information-theoretic bounds of evolutionary processes modeled as a protein communication system
IEEE Work. Stat. Signal Process. Proc.
Information and communication theory in molecular biology
Electr. Eng.
TFBS detection algorithm using distance metrics based on center of mass and polyphase mapping, 2012
7th Int. Symp. Heal. Informatics Bioinforma
Analysis of Genomic Translation Using a Communications Theory Approach
Communication theory and molecular biology at the crossroads
IEEE Eng. Med. Biol. Mag.
Ribosome binding model using a codebook and exponential metric, 2007
IEEE Int. Conf. Electro/Inform. Technol.
Effect of mutations on the detection of translational signals based on a communications theory approach
Conf. Proc. IEEE Eng. Med. Biol. Soc.
Analysis of gene translation using a communications theory approach
Adv. Exp. Med. Biol.
Gene expression analysis using communications, coding and information theory based models, BIOCOMP’09–2009
Int. Conf. Bioinf. Comput. Biol.
Applying techniques from frame synchronization for biological sequence analysis
IEEE Int. Conf. Commun.
Analysis of Coding Theory Based Models for Initiating Protein Translation in Prokaryotic Organisms
On genomic coding theory
Eur. Trans. Telecommun.
Examining coding structure and redundancy in DNA
IEEE Eng. Med. Biol. Mag.
Why nature chose A, C, G and U/T: an error-coding perspective of nucleotide alphabet composition
Orig. Life Evol. Biosph.
The quest for error correction in biology
IEEE Eng. Med. Biol. Mag.
On circular coding properties of gene and protein sequences
Croat. Chem. Acta
Cited by (5)
Communications Theory-Inspired Algorithms for Detecting Protein-Coding Regions in Prokaryotic Genomes: A Comparative Study: Using Communications Theory for Gene Detection in Prokaryotic Genomes: A Comparative Analysis of Correlation-Based Algorithms and Bayesian Classifiers
2023, ACM International Conference Proceeding SeriesIdentification of Coding Regions in Prokaryotic DNA Sequences Using Bayesian Classification
2020, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Parallel scheduling algorithm with improved Bayesian classification algorithm
2017, Wutan Huatan Jisuan JishuReweighting forest for extreme multi-label classification
2017, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Mohammad F. Al Bataineh was born in Irbid, Jordan in 1979. He received his B.S. degree in Telecommunications Engineering with high honors from Yarmouk University, Jordan, in 2003. He received his M.S. and PhD degrees in Electrical Engineering with excellent distinction from Illinois Institute of Technology (IIT) in 2006 and 2010, respectively. His research interests are focused in the application of communications, coding theory, and information theory concepts to the interpretation and understanding of information flow in biological systems such as gene expression. Since September 2010, Mohammad Al Bataineh has been with the Telecommunications Engineering Department at Yarmouk University, Jordan, where he is currently an assistant professor. He teaches undergraduate courses in Signals and Systems, Analog Communications, Digital Communications, Probability and Random Processes, Digital Signal Processing for the graduate level, and Information Theory and Coding for the graduate level.
Zouhair Al-qudah was born in Irbid, Jordan in 1979. He received his B.S. degree in Telecommunications Engineering from Yarmouk University, Jordan, in 2002. He received his M.S. degree in Electrical Engineering, with emphasis on digital communications and signal processing for wireless communication, from Kalmar University College, Sweden in 2006. He received his PhD degree in Electrical Engineering from Southern Methodist University at Dallas, Texas in 2013. Since August 2013, he has been with Al-Hussein Bin Talal University at Ma'an, Jordan, where he is currently an Assistant Professor. His research interest span various aspects of multipath fading channels, including Multiuser information theory, interference cancellation techniques, and practical coding techniques for Dirty Paper problem.