Abstract
We present a new method of segmentation of nucleotide sequences into regions with different average composition. The sequence is modelled as a series of segments; within each segment the sequence is considered as a random sequence of independent and identically distributed variables. The partition algorithm includes two stages. In the first stage the optimal partition is found, which maximises the overall product of marginal likelihoods calculated for each segment. To prevent segmentation into short segments, the border insertion penalty may be introduced. In the next stage segments with close compositions are merged. Filtration is performed with the help of partition function calculated for all possible subsets of boundaries that belong to the optimal partition. The long sequences can be segmented by dividing sequences and segmenting those parts separately. The contextual effects of repeats, genes and other genomic elements are readily visualised.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Karlin, S., Brendel, V.: Patchiness and correlation in DNA sequences. Science 259 (1993) 677–680.
Li, W.: The study of correlation structure of DNA sequences: a critical review. Computer & Chemistry 21(4) (1997) 257–278.
Bernardi, G.: The isochore organization of the human genome. Annual Review of Genetics 23 (1989) 637–661.
D’Onofrio, G., Mouchiroud, D., Aissani, B., Gautier, C., Bernardi, G.: Correlation between the compositional properties of human genes, codon usage, and amino acid composition of proteins. J. Mol. Evol. 32 (1991) 504–510.
Guigo, R. Fickett, J. W.: Distinctive sequence features in protein coding, genic noncoding and intergenic human DNA. J. Mol. Biol. 253 (1995) 51–60.
Herzel, H., Grosse, I.: Correlation in DNA sequences: The role of protein coding segments. Phys. Rev. E. 55 (1997) 800–810.
Li, W., Kaneko, V.: DNA Correlations. Nature 360 (1992) 635–636.
Gelfand, M. S.: Prediction of function in DNA sequence analysis. Journal of Computational Biology 2 (1995) 87–117.
Gelfand, M. S., Koonin, E. V.: Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes. Nucl. Acid. Res 27 (1995) 2430–2439.
Pedersen, A. G., Baldi, P., Chauvin, Y. Brunak, S.: The biology of eukaryotic promoter prediction. Computer & Chemistry 23 (1999) 191–207.
Krogh, A., Mian, I. S. Haussler, D.: A hidden Markov model that finds genes in E.coli DNA. Nucl. Acid. Res 22 (1994) 4768–4778.
Liu, S. L., Lawrence, C. E.: Bayesian Inference of Biopolymer Models. Bioinformatics 15 (1999) 38–52.
Lawrence, C. E.: Bayesian Bioinformatics. 5th international conference on intelligent systems for molecular biology, Halkidiki, Greece (1997).
Liu, S. L., Lawrence, C. E.: Bayesian inference of biopolymer models, Stanford Statistical Department Technical Report (1998).
Roman-Roldan, R., Bernaola-Galvan, P. and Oliver, J. L.: Sequence compositional complexity of DNA through an entropic segmentation method. Phys. Rev. Lett. 80 (1998) 1344.
Churchill, G. A.: Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51 (1989) 79–94.
Durbin, R., Eddy, Y. S., Krogh, A. Mitchison, G.: Biological Sequence Analysis. Cambridge, Cambirdge University Press (1998).
Muri, F., Chauveau, D., Cellier, D.: Convergence assessment in latent variable models: DNA applications. In C. P. Robert (ed.) Lectural Notes in Statistics, Vol. 135, Discretization and MCMC convergence assessment., Springer. (1998) 127–146.
Wolpert, D. H., Wolf, D. R.: Estimating functions of probability distributions from a finite set of samples. Phys. Rev. E. 52 (1995) 6841–6854.
Rozanov, Y. M.: Teoriya veroyatnosti, sluchainye processy i matematicheskaya statistika (russ: Probability Theory, Stochastic Processes and Mathematical Statisitics). Moscow, Nauka (1985).
Ramensky, V.E., Makeev, V.Ju., Roytberg, M.A., Tumanyan, V.G.: DNA segmentation through the bayesian approach. Journal of Computational Biology., 7 (2000), 215–231.
Shaeffer, G. (1999) Personal communication.
Finkelstein, A. V., Roytberg, M. A.: Computation of biopolymers: A general approach to different problems. BioSystems 30 (1993) 1–19.
Ossadnik, S.M., Buldyrev, S.V., Goldberger, A.L., Havlin, S., Mantegna, R.N., Peng, C.-K., Simons, M., Stanley, H.E.: Correlation approach to identify coding regions in DNA sequences. Biophysical Journal 67 (1994) 64–70.
Bernaola-Galván, P., Grosse, I., Carpena, P., Oliver, J., Román-Roldán, R., Stanley, H.: Finding borders between coding and noncoding DNA regions by an entropic segmentation method. Phys. Rev. Let., 85, (2000) 1342–1345.
Ono, S.: Evolution by gene duplication. Springer. (1970)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Makeev, V., Ramensky, V., Gelfand, M., Roytberg, M., Tumanyan, V. (2001). Bayesian Approach to DNA Segmentation into Regions with Different Average Nucleotide Composition. In: Gascuel, O., Sagot, MF. (eds) Computational Biology. JOBIM 2000. Lecture Notes in Computer Science, vol 2066. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45727-5_6
Download citation
DOI: https://doi.org/10.1007/3-540-45727-5_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42242-6
Online ISBN: 978-3-540-45727-5
eBook Packages: Springer Book Archive