A multi-approaches-guided genetic algorithm with application to operon prediction

https://doi.org/10.1016/j.artmed.2007.07.010Get rights and content

Summary

Objective

The prediction of operons is critical to the reconstruction of regulatory networks at the whole genome level. Multiple genome features have been used for predicting operons. However, multiple genome features are usually dealt with using only single method in the literatures. The aim of this paper is to develop a combined method for operon prediction by using different methods to preprocess different genome features in order for exerting their unique characteristics.

Methods

A novel multi-approach-guided genetic algorithm for operon prediction is presented. We exploit different methods for intergenic distance, cluster of orthologous groups (COG) gene functions, metabolic pathway and microarray expression data. A novel local-entropy-minimization method is proposed to partition intergenic distance. Our program can be used for other newly sequenced genomes by transferring the knowledge that has been obtained from Escherichia coli data. We calculate the log-likelihood for COG gene functions and Pearson correlation coefficient for microarray expression data. The genetic algorithm is used for integrating the four types of data.

Results

The proposed method is examined on E. coli K12 genome, Bacillus subtilis genome, and Pseudomonas aeruginosa PAO1 genome. The accuracies of prediction for these three genomes are 85.9987%, 88.296%, and 81.2384%, respectively.

Conclusion

Simulated experimental results demonstrate that in the genetic algorithm the preprocessing for genome data using multiple approaches ensures the effective utilization of different biological characteristics. Experimental results also show that the proposed method is applicable for predicting operons in prokaryote.

Introduction

Operon is a string of one or more genes, which is transcribed as a fundamental unit and is on the same strand of a genomic sequence. Thus understanding the operon maps of the whole genome is critical to the reconstruction of regulatory networks and the research on the whole genome. The functions of operons reveal valuable information for studying protein function and drug design [1], [2], [3]. However, the identification of operons by only using experimental methods is difficult, so it is important to develop efficient computational methods where the available biological information can be used as more as possible.

In general, genes within an operon have the following properties: they have much shorter intergenic distances than genes at the borders of transcription units, usually belong to the same functional category, or show highly correlated expression patterns in microarray expression data [4], or their functions are related, and the adjacent gene pair in an operon tends to be well conserved across phylogenetically related species. These properties can be exploited for the prediction of an operon.

Many computational algorithms for operon prediction have been developed in the past decade [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20]. Yada et al. [5] proposed a method to detect the promoter and terminator sequences at the operon boundaries. Salgado et al. [6] developed an approach, which utilized the feature of intergenic distance between gene pairs within operon and transcription unit borders. Overbeek et al. [7] presented a method to search for conserved gene pairs across multiple genomes. Zheng et al. [8] applied graph representations to biochemical pathways to automatically detect neighboring enzyme clusters which are candidate operons in the genome. Sabatti et al. [9] used a Bayesian classification method to gene microarray expression data. Westover et al. [10] proposed a method without a training set. Edwards et al. [11] suggested a new graph algorithmic approach of operon prediction based on comparative genomics using conserved genomic context.

With the increase of available data sources, Crave et al. [12] developed naive Bayes models, which used a rich variety of data types including sequence data, gene expression data, and functional annotations associated with genes to estimate the probability that a given sequence of genes constitutes an operon. A dynamic programming method was applied to construct an operon map for an entire genome or part of it. But the data used in this method are merely included in the single genome, so it is not suitable for predicting operons in other genome.

Chen et al. [13] developed an approach based on a comparative genomics approach and applied neural network to intergenic distance, COG function, and phylogenetic profile. Firstly, they estimated a log-likelihood distribution for each type of genomic data, and then used neural network to discriminate pairs of adjacent genes within operons (WO pairs) from those across transcription unit borders (TUB pairs). This method indicated that phylogenetic profiles are useful data for predicting operons. Dam and co-workers [14] improved this method by incorporating several features of experimentally verified gene pairs in Escherichia coli including the ratio of gene lengths, the frequency of G and TT in the intergenic regions, the phylogenetic profiles, the conservation of gene neighborhood across multiple genomes, the correlation of gene expression profiles, and the functional relationship between genes in a pair examined. This approach improves the accuracy of gene pair classification by as much as 7–8%.

In a recent paper, Jacob et al. [15] proposed a fuzzy-guided genetic algorithm for operon prediction. This method used a genetic algorithm to evolve an initial population, which presents a putative operon map of a genome. Each putative operon is scored using intuitive fuzzy rules and the high accuracy of prediction can be obtained.

In Jacob's approach, however, the fuzzy rules used are intuitive. Hence, it is difficult to create its fuzzy rules for non-specialists. Moreover, the biological characteristics for genome data cannot be explored well by using the same method to assess each genome data. In this paper, we propose a novel method for assessing different features by using different algorithms. We utilize intergenic distance, participation in the same metabolic pathway, COG gene functions and microarray expression data to predict operons. A novel local-entropy-minimization method (LEM) is utilized for assessing the “fitness” of each adjacent gene pair based on the intergenic distance. LEM divides the intergenic distances into several intervals and assigns a score for each interval. COG function log-likelihood is computed for adjacent gene pair. Correlation coefficient of microarray expression value is calculated. Genetic algorithm is used for evolving an initial population. Each individual is created by clustering based on intergenic distances with different thresholds.

Like most predictors, we use E. coli K12 genome and Bacillus subtilis genome to examine the prediction ability of the presented method. Besides, we also test our predictor on Pseudomonas aeruginosa PAO1. P. aeruginosa is a versatile Gram-negative bacterium that grows in soil, marshes, and coastal marine habitats, as well as on plant and animal tissues. It is noted for its environmental versatility, ability to cause disease in particular susceptible individuals, and its resistance to antibiotics. P. aeruginosa has been formerly studied by many scientists [21]. Researches on this genome are valuable for developing new antibacterial drugs to successfully treat infections by bacteria like P. aeruginosa that are resistant to many popular antibiotics. Our work on operon prediction could be useful on understanding the function and regulation of prokaryote.

Section snippets

Data preparation

All the completed microbial genomes data can be downloaded from GenBank database (http://www.ncbi.nlm.nih.gov/genomes/MICROBES/Complete.html (accessed: 1 March 2006)). The operon databases can be obtained from RegulonDB (http://regulondb.ccg.unam.mx/index.html (accessed: 25 May 2006)) [22] and ODB (http://odb.kuicr.kyoto-u.ac.jp/ (accessed: 30 October 2006)) [23]. The related genomic information on the three databases is listed in Table 1.

We use SQL Server 2000 and perl program to extract gene

Experimental results

We develop a Visual C++ program to implement the proposed algorithm. To examine the performance of the method, we apply our predictor to three bacterial genomes, E. coli K12 genome, B. subtilis, and P. aeruginosa PAO1. Because the size of the predicted and experimental operons may be different, it is unreasonable to compare each operon. Therefore, we use the performance measurement that was used by many earlier researchers [13], [15], [16] to examine the proposed method. In this paper, we only

Conclusions and discussions

It has been reported that an effective way for predicting operons is to use different kinds of biological information. In this paper, we apply different approaches to different genome information for exploiting their unique biological characteristics. It is different from the conventional methods where all kinds of biological information are dealt with by using only the same method. We develop a local-entropy-minimization-based method to evaluate the intergenic distance. The score of intervals

Acknowledgements

The authors would like to thank members of Bioinformatics Group of JLU and UGA for the invaluable assistance and discussions. The authors are grateful to the support of the National Natural Science Foundation of China (60433020, 60673023, 60673099), the science–technology development project of Jilin Province of China (20050705-2), “985” project of Jilin University of China and Science Foundation for Yong Teachers of Northeast Normal University (20070104).

References (26)

  • P. Yeh et al.

    Functional classification of drugs by properties of their pairwise interactions

    Nat Genet

    (2006)
  • P. Aloy et al.

    Structural systems biology: modelling protein interactions

    Nat Rev Mol Cell Biol

    (2006)
  • S. Gon et al.

    A novel regulatory mechanism couples deoxyribonucleotide synthesis and DNA replication in Escherichia coli

    EMBO J

    (2006)
  • X. Chen et al.

    Operon prediction by comparative genomics: an application to the Synechococcus sp. WH8102 genome

    Nucleic Acids Res

    (2004)
  • T. Yada et al.

    Modeling and predicting transcriptional units of Escherichia coli genes using Hidden Markov models

    Bioinformatics

    (1999)
  • H. Salgado et al.

    Operons in Escherichia coli: genomic analyses and predictions

    Proc Natl Acad Sci

    (2000)
  • R. Overbeek et al.

    The use of gene clusters to infer functional coupling

    Proc Natl Acad Sci

    (1999)
  • Y. Zheng et al.

    Computational identification of operons in microbial genomes

    Genome Res

    (2002)
  • C. Sabatti et al.

    Co-expression pattern from DNA microarray experiments as a tool for operon prediction

    Nucleic Acids Res

    (2002)
  • B.P. Westover et al.

    Operon prediction without a training set

    Bioinformatics

    (2005)
  • M.T. Edwards et al.

    A universally applicable method of operon map prediction on minimally annotated genomes using conserved genomic context

    Nucleic Acids Res

    (2005)
  • M. Craven et al.

    A probabilistic learning approach to whole-genome operon prediction

  • X. Chen et al.

    Computational prediction of operons in Synechococcus sp. WH8102

    Genome Inform

    (2004)
  • Cited by (41)

    • Application of ML & AI to model petrophysical and geomechanical properties of shale reservoirs – A systematic literature review

      2022, Petroleum
      Citation Excerpt :

      Hence, AI and ML can be defined together as a set of instructions, logics and reactions are AI algorithms while ML is a machine’s ability to execute those algorithms and make it self-learned with defined logic and situations. AI is broadly classified into multiple categories that include reasoning, programming, artificial life, distributed AI, expert systems, belief revision, data mining, genetic algorithms, systems, neural networks, theorem proving, constraint satisfaction, knowledge representation, natural language understanding, the theory of computation, machine learning, etc. [23], [24]. AI and ML frameworks learn from the initially given information and logics in a procedure called as preparing or figuring out how to accordingly predict the new information or the most expected sequence of information [12,13,25].

    • Reconstruction, optimization, and design of heterogeneous materials and media: Basic principles, computational algorithms, and applications

      2021, Physics Reports
      Citation Excerpt :

      The second algorithm combines the SA and the genetic algorithm, another method for determining the GME. Wang et al. (2007) proposed an algorithm based on a combination of the two methods, which we describe shortly. In thermodynamics simulated annealing (TSA) an annealing schedule is used that is derived based on the laws of thermodynamics (de Vicente et al., 2003).

    • Preventive SNP-SNP interactions in the mitochondrial displacement loop (D-loop) from chronic dialysis patients

      2013, Mitochondrion
      Citation Excerpt :

      The powerful and robust mathematical theory behind it leads to its wide application in many topical subjects in scientific and engineering fields since it was first developed in 1975 (Holland, 1975). GA has been applied in many bioinformatics fields, such as primer design (Yang et al., 2010), sequence alignment (Taheri and Zomaya, 2009), operon prediction (Wang et al., 2007), gene selection (Chuang et al., 2009), tag SNP selection (Mahdevar et al., 2010), cancer prognosis (Yang et al., 2012), and other medical problems (Hoh et al., 2012). However, the GA may occasionally get trapped in a local solution and then has no way of escaping from one of the local optima to the global optima (Grefenstette, 1992).

    • Single nucleotide polymorphism barcoding to evaluate oral cancer risk using odds ratio-based genetic algorithms

      2012, Kaohsiung Journal of Medical Sciences
      Citation Excerpt :

      Therefore, the analysis of the SNP–SNP interaction in terms of combinations of SNPs in relation to their genotypes remains a challenge. The genetic algorithm (GA) [14] has been successful in solving many problems [15–20]. It involves a randomized search and optimization technique that derives its working principles from natural genetics.

    • Organization virtualization driven by artificial intelligence

      2022, Systems Research and Behavioral Science
    View all citing articles on Scopus
    View full text