A novel gene identification algorithm with Bayesian classification

https://doi.org/10.1016/j.bspc.2016.07.002Get rights and content

Highlights

  • A novel gene detection algorithm for prokaryotes is proposed.

  • The algorithm applies concepts and principles in communications theory and DSP to identify coding and noncoding regions.

  • The proposed algorithm is applied to several prokaryotic genome sequences.

  • Two Bayesian classifiers are designed to evaluate the performance of the proposed gene detection algorithm.

  • The algorithm performance is comparable to well-known ab initio gene detection methods such as GLIMMER and GeneMark.

Abstract

The rapid advances in the field of computational genomics and bioinformatics have motivated the development of innovative engineering methods for data acquisition, interpretation, and analysis. With the help of the later methods, many processes in molecular biology can be modeled and further analyzed. Identification and discovery of the coding regions in the genomic structure using computational algorithms is a clear example of such processes. This work proposes a novel application of well-known principles and concepts from communications theory and digital signal processing for the detection of protein coding regions in prokaryotic genomes. The proposed algorithm employs a polyphase complex mapping scheme to provide a numerical representation of the genomic sequences involved in the analysis. It then utilizes concepts in communications theory such as correlation, the maximal ratio combining (MRC) algorithm, and filtering to generate a signal whose peaks and troughs signify coding and noncoding regions, respectively. The proposed algorithm is applied to several prokaryotic genome sequences. Two Bayesian classifiers are designed to evaluate the performance of the proposed algorithm. The obtained simulation results show that the algorithm is able to efficiently and accurately identify protein coding regions with sensitivity and specificity values comparable to well-known gene detection methods in prokaryotes such as GLIMMER and GeneMark. This further proves the relevance of using communications theory concepts for genomic sequence analysis.

Introduction

The rapid advances in the field computational genomics and bioinformatics have motivated the development of innovative engineering methods for data acquisition, interpretation, and analysis. Techniques from the information theory [1], [2], [3], [4], [5], [6], [7], communications [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], coding theory [5], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], signal processing [31], [32], [33], [34], [35], [36], machine learning [37] and various statistical methods [38], [39], [40], [41], [42], [43], [44] have been actively researched for use in gene detection, genomic sequence analysis and alignment. The developed analyses made possible by the use of the latter methods allow the testing of different biological aspects related to the process of gene expression. For example, then can help in determining whether certain regions of a given genome are protein-coding sequences (i.e. gene detection). These methods promote new interdisciplinary collaborations in research and education, integrating biomedical engineering, electrical engineering and life sciences. The knowledge gain can help address fundamentally important issues that cannot be explored systematically and quantitatively by experimentation alone. Moreover, it can reduce the consumption of laboratory resources, minimize time-consuming laboratory experimentations and lead to a better understanding of the complex genetic processes.

A DNA sequence can be divided into two types of regions: genic and intergenic spaces. Genes are the segments of DNA that contain the coding information required for protein synthesis. A considerable target of genomic research is to understand the nature and role of the coding and noncoding information embedded in the DNA sequence structure. A crucial step in attaining this target is the detection of the gene locations in the entire DNA sequence. Several diverse methods have been proposed in literature for gene detection in prokaryotes. For example, probabilistic methods [45], [46], statistical methods [47], [48], [49], and other computational techniques including: machine learning [50], free energy calculations [51], support vector machine [52], Bayesian methods [33] information theory [53], hidden Markov model such GeneMark [38], [39], [40], [43], [44], [54], and interpolated Markov model such as GLIMMER [55].

The algorithm employed in GLIMMER has been described in detail in Ref. [56]. The method employs an interpolated Markov model to identify coding regions. Specifically, the algorithm identifies open reading frames (ORFs) of sufficient length which are most likely to be coding to give an initial model of the coding regions of the organism. This information is subsequently used in a Markov chain in order to locate all other coding sequences. The GLIMMER system is a widely used and highly accurate gene finder for prokaryotes. On the other hand, GeneMark is an HMM-like algorithm. The algorithm introduces inhomogeneous three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in the two DNA strands simultaneously. The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being “protein-coding” (carrying genetic code) in each of six possible reading frames (including three frames in the complementary DNA strand) or being “non-coding”.

The design of a general gene identification algorithm is a compelling research problem. The gene identification method presented in this work utilizes a particular property of the DNA protein coding regions, specifically the period-3 property [57], [58], [59], [60], [61], [62], [63] in a new novel approach using concepts from communications theory. The period-3 pattern is generally considered as a strong indication of coding regions. By mapping the DNA sequences to digital signals, standard digital signal processing (DSP) techniques the discrete Fourier transformation (DFT) [57], [58], [62], digital filtering [64], [65], wavelet transformations [66], Markov modeling [67] and IIR filtering [63] have shown good performance in the detection of this period-3 behavior and, therefore, in identifying coding regions. However, the efficiency of the latter methods in detecting period-3 property and suppressing the background noise is obtained at the expense of increased computational complexity. In our previous work [35], we have proposed a novel algorithm for identifying protein-coding regions in the DNA sequences based on the period-3 property. The proposed algorithm in Ref. [35] identifies protein-coding regions by applying a digital correlating and filtering process to the entire genomic sequence under study. However, our proposed algorithm in this work is both an enhanced and a generalized version of the work in Ref. [35] in terms of methodology, performance, and experimental validation.

This paper proposes a novel application of principles and techniques from communications theory and digital signal processing for the detection and identification of protein coding regions in prokaryotic genomes. The proposed algorithm employs polyphase complex mapping to provide a numerical representation of the genomic sequences involved in the analysis and then uses basic concepts from communications theory and digital signal processing such as correlation, maximal ratio combining (MRC) algorithms and filtering to generate a signal whose peaks signify locations of coding regions and whose troughs signify locations of noncoding regions. The proposed gene detection algorithm is applied to the complete genome sequences of several prokaryotes (e.g. MG1655 and O157H7 E. coli bacterial strains). Moreover, two Bayesian classifiers are designed to evaluate the performance of the proposed gene detection algorithm and compare it to well-known gene detection methods in prokaryotes. The obtained simulation results show that the proposed algorithm can efficiently and accurately identify protein coding regions with sensitivity and specificity values comparable to well-known gene detection methods in prokaryotes such as GLIMMER and GeneMark. This further proves the relevance of using communications theory concepts for genomic sequence analysis. Moreover, the proposed algorithm does not entail any prior information about the coding regions as does the DFT method described in Ref. [57]. It can sharply extract the period-3 component and hence effectively identify protein coding regions in the whole genomic sequences of prokaryotes. In addition, it can effectively suppress the background 1/f noise with no added computational complexity.

The rest of this paper is organized as follows. Section 2 highlights the so-called period-3 behavior and how it can be detected by the use of digital signal processing techniques (like the DFT) to locate protein-coding regions in the genomic structure. Section 3 gives a detailed mathematical description of the proposed gene detection algorithm. The method used for peaks (corresponding to coding regions) and troughs (corresponding to noncoding regions) detection is described in Algorithm I. It also describes two period-3 based Bayesian classifiers that are designed to evaluate the performance of the proposed algorithm. The period-3 based classification system is described in Algorithm II. Simulation results are shown and discussed in Section 4. Finally, our paper is concluded in Section 5.

Section snippets

Protein coding region identification using the period-3 property

It has been emphasized by many articles that the coding regions of the DNA possess a period-3 property caused by codon biases in the translation of codons into amino acids. This fundamental characteristic is not detected outside the coding regions, and hence can be utilized to locate the coding regions in the entire genomic structure [58], [62]. This observation can be traced back to the work of Trifonov and Sussman [68].

The period-3 property implies a clear short-range correlation behavior in

The proposed gene detection algorithm

Fig. 3 shows a schematic system-like representation of the proposed algorithm. The input parameter of the proposed detection algorithm is the genomic sequence under study,g, of length Lx. The output parameters are three sequences: f[n]., p[n] and t[n] whose lengths are the same as the length of the input test sequence Lx. The sequence f[n] represents the correlation of the genomic sequence g with twenty-four hypothetically generated period-3 based subsequences after passing through a maximal

Simulation results and analysis

In order to demonstrate the fidelity and biological significance of the proposed gene detection algorithm, it is applied to several prokaryotic genome sequences. For example, the complete genome sequence of Escherichia coli bacterial strains MG1655 and O157:H7 are used as input test sequences. Such sequences are available at the NCBI [77].

The length of each one of the twenty-four hypothetical period-3 based subsequences, si[n], is selected to be 1950 (i.e. the number of repetitions, Lr, is

Conclusions

This work proposes a novel application of principles and concepts from communications theory and digital signal processing for the detection of protein coding regions in prokaryotic genomes. The proposed gene detection algorithm employs polyphase complex mapping to provide a numerical representation of the genomic sequences involved in the analysis, and then uses basic concepts from communications theory and digital signal processing as correlation, maximal ratio combining (MRC) algorithms and

Mohammad F. Al Bataineh was born in Irbid, Jordan in 1979. He received his B.S. degree in Telecommunications Engineering with high honors from Yarmouk University, Jordan, in 2003. He received his M.S. and PhD degrees in Electrical Engineering with excellent distinction from Illinois Institute of Technology (IIT) in 2006 and 2010, respectively. His research interests are focused in the application of communications, coding theory, and information theory concepts to the interpretation and

References (78)

  • T.D. Schneider

    Measuring molecular information

    J. Theor. Biol.

    (1999)
  • V.R. Chechetkin et al.

    Sizedependence of three-periodicity and long-range correlations in DNA sequences

    Phys. Lett. A

    (1995)
  • J.W. Fickett

    The gene identification problem: an overview for developers

    Comput. Chem.

    (1996)
  • K.B. Murray et al.

    Wavelet transforms for the characterization and detection of repeating motifs

    J. Mol. Biol.

    (2002)
  • A. Lohmann

    Systems and transforms with applications in optics

    J. Franklin Inst.

    (1969)
  • M. Burset et al.

    Evaluation of gene structure prediction programs

    Genomics

    (1996)
  • G. Atkins

    Information Theory and Molecular Biology

    (1993)
  • G. Battail

    Information theory and error-correcting codes in genetics and biological evolution

    Introd. to Biosemiotics New Biol. Synth

    (2007)
  • L. Gong et al.

    Information-theoretic bounds of evolutionary processes modeled as a protein communication system

    IEEE Work. Stat. Signal Process. Proc.

    (2007)
  • P. Hanus et al.

    Information and communication theory in molecular biology

    Electr. Eng.

    (2007)
  • M. Al Bataineh, L. Huang, G.E. Atkin, Transcription Factor Binding Site Detection Algorithm Using Distance Metrics...
  • M. Al Bataineh et al.

    TFBS detection algorithm using distance metrics based on center of mass and polyphase mapping, 2012

    7th Int. Symp. Heal. Informatics Bioinforma

    (2012)
  • M. Al Bataineh

    Analysis of Genomic Translation Using a Communications Theory Approach

    (2010)
  • E.E. May

    Communication theory and molecular biology at the crossroads

    IEEE Eng. Med. Biol. Mag.

    (2006)
  • M. Al Bataineh et al.

    Ribosome binding model using a codebook and exponential metric, 2007

    IEEE Int. Conf. Electro/Inform. Technol.

    (2007)
  • M. Al Bataineh, M. Alonso, S. Wang, G.E. Atkin, W. Zhang, An Optimized Ribosome Binding Model Using Communication...
  • M. Al Bataineh et al.

    Effect of mutations on the detection of translational signals based on a communications theory approach

    Conf. Proc. IEEE Eng. Med. Biol. Soc.

    (2009)
  • M. Al Bataineh et al.

    Analysis of gene translation using a communications theory approach

    Adv. Exp. Med. Biol.

    (2010)
  • L. Huang, M. Al Bataineh, G.E. Atkin, M. Parra, M. del Mar Perez, I. Mohammed, W. Zhang, M. Parra, M. Perez,...
  • M. Al Bataineh et al.

    Gene expression analysis using communications, coding and information theory based models, BIOCOMP’09–2009

    Int. Conf. Bioinf. Comput. Biol.

    (2009)
  • J. Weindl et al.

    Applying techniques from frame synchronization for biological sequence analysis

    IEEE Int. Conf. Commun.

    (2007)
  • E.E. May

    Analysis of Coding Theory Based Models for Initiating Protein Translation in Prokaryotic Organisms

    (2002)
  • E.E. May, Comparative Analysis of Information Based Models for Initiating Protein Translation in Escherichia coli K-12,...
  • Z. Dawy et al.

    On genomic coding theory

    Eur. Trans. Telecommun.

    (2007)
  • G.L. Rosen

    Examining coding structure and redundancy in DNA

    IEEE Eng. Med. Biol. Mag.

    (2005)
  • D.A. Mac Donaill

    Why nature chose A, C, G and U/T: an error-coding perspective of nucleotide alphabet composition

    Orig. Life Evol. Biosph.

    (2003)
  • M.K. GUPTA

    The quest for error correction in biology

    IEEE Eng. Med. Biol. Mag.

    (2006)
  • G.L. Rosen, J.D. Moore, Investigation of coding structure in DNA, 2003 IEEE Int. Conf. Acoust. Speech, Signal Process....
  • N. Stambuk

    On circular coding properties of gene and protein sequences

    Croat. Chem. Acta

    (1999)
  • Mohammad F. Al Bataineh was born in Irbid, Jordan in 1979. He received his B.S. degree in Telecommunications Engineering with high honors from Yarmouk University, Jordan, in 2003. He received his M.S. and PhD degrees in Electrical Engineering with excellent distinction from Illinois Institute of Technology (IIT) in 2006 and 2010, respectively. His research interests are focused in the application of communications, coding theory, and information theory concepts to the interpretation and understanding of information flow in biological systems such as gene expression. Since September 2010, Mohammad Al Bataineh has been with the Telecommunications Engineering Department at Yarmouk University, Jordan, where he is currently an assistant professor. He teaches undergraduate courses in Signals and Systems, Analog Communications, Digital Communications, Probability and Random Processes, Digital Signal Processing for the graduate level, and Information Theory and Coding for the graduate level.

    Zouhair Al-qudah was born in Irbid, Jordan in 1979. He received his B.S. degree in Telecommunications Engineering from Yarmouk University, Jordan, in 2002. He received his M.S. degree in Electrical Engineering, with emphasis on digital communications and signal processing for wireless communication, from Kalmar University College, Sweden in 2006. He received his PhD degree in Electrical Engineering from Southern Methodist University at Dallas, Texas in 2013. Since August 2013, he has been with Al-Hussein Bin Talal University at Ma'an, Jordan, where he is currently an Assistant Professor. His research interest span various aspects of multipath fading channels, including Multiuser information theory, interference cancellation techniques, and practical coding techniques for Dirty Paper problem.

    View full text