A multi-approaches-guided genetic algorithm with application to operon prediction
Introduction
Operon is a string of one or more genes, which is transcribed as a fundamental unit and is on the same strand of a genomic sequence. Thus understanding the operon maps of the whole genome is critical to the reconstruction of regulatory networks and the research on the whole genome. The functions of operons reveal valuable information for studying protein function and drug design [1], [2], [3]. However, the identification of operons by only using experimental methods is difficult, so it is important to develop efficient computational methods where the available biological information can be used as more as possible.
In general, genes within an operon have the following properties: they have much shorter intergenic distances than genes at the borders of transcription units, usually belong to the same functional category, or show highly correlated expression patterns in microarray expression data [4], or their functions are related, and the adjacent gene pair in an operon tends to be well conserved across phylogenetically related species. These properties can be exploited for the prediction of an operon.
Many computational algorithms for operon prediction have been developed in the past decade [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20]. Yada et al. [5] proposed a method to detect the promoter and terminator sequences at the operon boundaries. Salgado et al. [6] developed an approach, which utilized the feature of intergenic distance between gene pairs within operon and transcription unit borders. Overbeek et al. [7] presented a method to search for conserved gene pairs across multiple genomes. Zheng et al. [8] applied graph representations to biochemical pathways to automatically detect neighboring enzyme clusters which are candidate operons in the genome. Sabatti et al. [9] used a Bayesian classification method to gene microarray expression data. Westover et al. [10] proposed a method without a training set. Edwards et al. [11] suggested a new graph algorithmic approach of operon prediction based on comparative genomics using conserved genomic context.
With the increase of available data sources, Crave et al. [12] developed naive Bayes models, which used a rich variety of data types including sequence data, gene expression data, and functional annotations associated with genes to estimate the probability that a given sequence of genes constitutes an operon. A dynamic programming method was applied to construct an operon map for an entire genome or part of it. But the data used in this method are merely included in the single genome, so it is not suitable for predicting operons in other genome.
Chen et al. [13] developed an approach based on a comparative genomics approach and applied neural network to intergenic distance, COG function, and phylogenetic profile. Firstly, they estimated a log-likelihood distribution for each type of genomic data, and then used neural network to discriminate pairs of adjacent genes within operons (WO pairs) from those across transcription unit borders (TUB pairs). This method indicated that phylogenetic profiles are useful data for predicting operons. Dam and co-workers [14] improved this method by incorporating several features of experimentally verified gene pairs in Escherichia coli including the ratio of gene lengths, the frequency of G and TT in the intergenic regions, the phylogenetic profiles, the conservation of gene neighborhood across multiple genomes, the correlation of gene expression profiles, and the functional relationship between genes in a pair examined. This approach improves the accuracy of gene pair classification by as much as 7–8%.
In a recent paper, Jacob et al. [15] proposed a fuzzy-guided genetic algorithm for operon prediction. This method used a genetic algorithm to evolve an initial population, which presents a putative operon map of a genome. Each putative operon is scored using intuitive fuzzy rules and the high accuracy of prediction can be obtained.
In Jacob's approach, however, the fuzzy rules used are intuitive. Hence, it is difficult to create its fuzzy rules for non-specialists. Moreover, the biological characteristics for genome data cannot be explored well by using the same method to assess each genome data. In this paper, we propose a novel method for assessing different features by using different algorithms. We utilize intergenic distance, participation in the same metabolic pathway, COG gene functions and microarray expression data to predict operons. A novel local-entropy-minimization method (LEM) is utilized for assessing the “fitness” of each adjacent gene pair based on the intergenic distance. LEM divides the intergenic distances into several intervals and assigns a score for each interval. COG function log-likelihood is computed for adjacent gene pair. Correlation coefficient of microarray expression value is calculated. Genetic algorithm is used for evolving an initial population. Each individual is created by clustering based on intergenic distances with different thresholds.
Like most predictors, we use E. coli K12 genome and Bacillus subtilis genome to examine the prediction ability of the presented method. Besides, we also test our predictor on Pseudomonas aeruginosa PAO1. P. aeruginosa is a versatile Gram-negative bacterium that grows in soil, marshes, and coastal marine habitats, as well as on plant and animal tissues. It is noted for its environmental versatility, ability to cause disease in particular susceptible individuals, and its resistance to antibiotics. P. aeruginosa has been formerly studied by many scientists [21]. Researches on this genome are valuable for developing new antibacterial drugs to successfully treat infections by bacteria like P. aeruginosa that are resistant to many popular antibiotics. Our work on operon prediction could be useful on understanding the function and regulation of prokaryote.
Section snippets
Data preparation
All the completed microbial genomes data can be downloaded from GenBank database (http://www.ncbi.nlm.nih.gov/genomes/MICROBES/Complete.html (accessed: 1 March 2006)). The operon databases can be obtained from RegulonDB (http://regulondb.ccg.unam.mx/index.html (accessed: 25 May 2006)) [22] and ODB (http://odb.kuicr.kyoto-u.ac.jp/ (accessed: 30 October 2006)) [23]. The related genomic information on the three databases is listed in Table 1.
We use SQL Server 2000 and perl program to extract gene
Experimental results
We develop a Visual C++ program to implement the proposed algorithm. To examine the performance of the method, we apply our predictor to three bacterial genomes, E. coli K12 genome, B. subtilis, and P. aeruginosa PAO1. Because the size of the predicted and experimental operons may be different, it is unreasonable to compare each operon. Therefore, we use the performance measurement that was used by many earlier researchers [13], [15], [16] to examine the proposed method. In this paper, we only
Conclusions and discussions
It has been reported that an effective way for predicting operons is to use different kinds of biological information. In this paper, we apply different approaches to different genome information for exploiting their unique biological characteristics. It is different from the conventional methods where all kinds of biological information are dealt with by using only the same method. We develop a local-entropy-minimization-based method to evaluate the intergenic distance. The score of intervals
Acknowledgements
The authors would like to thank members of Bioinformatics Group of JLU and UGA for the invaluable assistance and discussions. The authors are grateful to the support of the National Natural Science Foundation of China (60433020, 60673023, 60673099), the science–technology development project of Jilin Province of China (20050705-2), “985” project of Jilin University of China and Science Foundation for Yong Teachers of Northeast Normal University (20070104).
References (26)
- et al.
Functional classification of drugs by properties of their pairwise interactions
Nat Genet
(2006) - et al.
Structural systems biology: modelling protein interactions
Nat Rev Mol Cell Biol
(2006) - et al.
A novel regulatory mechanism couples deoxyribonucleotide synthesis and DNA replication in Escherichia coli
EMBO J
(2006) - et al.
Operon prediction by comparative genomics: an application to the Synechococcus sp. WH8102 genome
Nucleic Acids Res
(2004) - et al.
Modeling and predicting transcriptional units of Escherichia coli genes using Hidden Markov models
Bioinformatics
(1999) - et al.
Operons in Escherichia coli: genomic analyses and predictions
Proc Natl Acad Sci
(2000) - et al.
The use of gene clusters to infer functional coupling
Proc Natl Acad Sci
(1999) - et al.
Computational identification of operons in microbial genomes
Genome Res
(2002) - et al.
Co-expression pattern from DNA microarray experiments as a tool for operon prediction
Nucleic Acids Res
(2002) - et al.
Operon prediction without a training set
Bioinformatics
(2005)
A universally applicable method of operon map prediction on minimally annotated genomes using conserved genomic context
Nucleic Acids Res
A probabilistic learning approach to whole-genome operon prediction
Computational prediction of operons in Synechococcus sp. WH8102
Genome Inform
Cited by (41)
Application of ML & AI to model petrophysical and geomechanical properties of shale reservoirs – A systematic literature review
2022, PetroleumCitation Excerpt :Hence, AI and ML can be defined together as a set of instructions, logics and reactions are AI algorithms while ML is a machine’s ability to execute those algorithms and make it self-learned with defined logic and situations. AI is broadly classified into multiple categories that include reasoning, programming, artificial life, distributed AI, expert systems, belief revision, data mining, genetic algorithms, systems, neural networks, theorem proving, constraint satisfaction, knowledge representation, natural language understanding, the theory of computation, machine learning, etc. [23], [24]. AI and ML frameworks learn from the initially given information and logics in a procedure called as preparing or figuring out how to accordingly predict the new information or the most expected sequence of information [12,13,25].
Reconstruction, optimization, and design of heterogeneous materials and media: Basic principles, computational algorithms, and applications
2021, Physics ReportsCitation Excerpt :The second algorithm combines the SA and the genetic algorithm, another method for determining the GME. Wang et al. (2007) proposed an algorithm based on a combination of the two methods, which we describe shortly. In thermodynamics simulated annealing (TSA) an annealing schedule is used that is derived based on the laws of thermodynamics (de Vicente et al., 2003).
Preventive SNP-SNP interactions in the mitochondrial displacement loop (D-loop) from chronic dialysis patients
2013, MitochondrionCitation Excerpt :The powerful and robust mathematical theory behind it leads to its wide application in many topical subjects in scientific and engineering fields since it was first developed in 1975 (Holland, 1975). GA has been applied in many bioinformatics fields, such as primer design (Yang et al., 2010), sequence alignment (Taheri and Zomaya, 2009), operon prediction (Wang et al., 2007), gene selection (Chuang et al., 2009), tag SNP selection (Mahdevar et al., 2010), cancer prognosis (Yang et al., 2012), and other medical problems (Hoh et al., 2012). However, the GA may occasionally get trapped in a local solution and then has no way of escaping from one of the local optima to the global optima (Grefenstette, 1992).
Single nucleotide polymorphism barcoding to evaluate oral cancer risk using odds ratio-based genetic algorithms
2012, Kaohsiung Journal of Medical SciencesCitation Excerpt :Therefore, the analysis of the SNP–SNP interaction in terms of combinations of SNPs in relation to their genotypes remains a challenge. The genetic algorithm (GA) [14] has been successful in solving many problems [15–20]. It involves a randomized search and optimization technique that derives its working principles from natural genetics.
Organization virtualization driven by artificial intelligence
2022, Systems Research and Behavioral Science