Abstract
In this paper, we develop a new approach for analyzing DNA sequences in order to detect regions with similar nucleotide composition. Our algorithm, which we call composition alignment or, more whimsically, scrambled alignment, employs the mechanisms of string matching and string comparison yet avoids the overdependence of those methods on position-by-position matching. In composition alignment, we extend the matching concept to composition matching. Two strings have a composition match if their lengths are equal and they have the same nucleotide content.
We define the composition alignment problem and give a dynamic programming solution. We explore several composition match weighting functions and show that composition alignment with one class of these can be computed in O(nm) time, the same as for standard alignment. We discuss statistical properties of composition alignment scores and demonstrate the ability of the algorithm to detect regions of similar composition in eukaryotic promoter sequences in the absence of detectable similarity through standard alignment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Amir, A., Aumann, Y., Landau, G., Lewenstein, M., Lewenstein, N.: Pattern matching with swaps. J. Algorithms 37, 247–266 (2000)
Amir, A., Cole, R., Hariharan, R., Lewenstein, M., Porat, E.: Overlap matching. In: Proc. 12th ACM-SIAM Sym. on Discrete Algorithms, pp. 279–288 (2001)
Amir, A., Lewenstein, M., Porat, E.: Approximate swapped matching. Information Processing Letters 83, 33–39 (2002)
Arratia, R., Waterman, M.: A phase transition for the score in matching random sequences allowing deletions. Ann. Appl. Prob. 4, 200–225 (1994)
Benham, C.J.: Duplex destabilization in superhelical DNA is predicted to occur at specific transcriptional regulatory regions. J. Mol. Biol. 255, 425–434 (1996)
Benham, C.J.: The topologically driven strand separation transition in DNAmethods of analysis and biological significance. DIMACS Series in Discrete Mathematics and Theoretical Computer Science 47, 173–198 (1999)
Bernardi, G.: The isochore organization of the human genome. Annu. Rev. Genet. 23, 637–661 (1989)
Bernardi, G.: The human genome: Organization and evolutionary history. Annu. Rev. Genet. 29, 445–476 (1995)
Bucher, P.: Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol. 212, 563–578 (1990)
Cormen, T., Leiserson, C., Rivest, R.: Introduction to Algorithms. MIT Press, Cambridge (1990)
Doerfler, W.: DNA methylation and gene activity. Ann. Rev. Biochem. 52, 93–124 (1983)
Felsenfeld, G., McGhee, J.: Methylation and gene activity (1982)
Garden, M.G., Frommer, M.: CpG islands in vertebrate genomes. J.Mol. Biol. 196, 261–282 (1987)
Goodsell, D.S., Dickerson, R.E.: Bending and curvature calculations in B-DNA. Nucleic Acids Research 22, 5497–5503 (1994)
Gotoh, O.: An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705–708 (1982)
Heinemeyer, T., Chen, X., Karas, H., Kel, A., Kel, O., Liebich, I., Meinhardt, T., Reuter, I., Schacherer, F., Wingender, E.: Expanding the TRANSFAC database towards an expert system of regulatory molecular mechanisms. Nucleic Acids Res. 27, 318–322 (1999)
Karlin, S., Altschul, S.: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87, 2264–2268 (1990)
Koo, H.-S., Wu, H.-M., Crothers, D.M.: DNA bending at adenine - thymine tracts. Nature 320, 501–506 (1986)
Lewis, M., Chang, G., Horton, N.C., Kercher, M.A., Pace, H.C., Schumacher, M.A., Brennan, R.G., Lu, P.: Crystal structure of the lactose operon repressor and its complexes with DNA and inducer. Science 271, 1247–1254 (1996)
Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theor. 37, 145–151 (1991)
Lowrance, R., Wagner, R.A.: An extension of the string-to-string correction problem. JACM 22, 177–183 (1975)
Needleman, S., Wunch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)
Périer, R., Praz, V., Junier, T., Bonnard, C., Bucher, P.: The Eukaryotic Promoter Database (EPD). Nucleic Acids Research 28, 302–303 (2000)
Schultz, S.C., Shields, G.C., Steitz, T.A.: Crystal structure of a CAP-DNA complex: The DNA is bent by 90 degrees. Science 253, 1001–1007 (1991)
Smit, A.: The origin of interspersed repeats in the human genome. Curr. Opin. Genet. Dev. 6, 743–748 (1996)
Smith, T., Waterman, M.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Vingron, M., Waterman, M.: Sequence alignment and penalty choice: review of concepts, case studies and implications. J. Mol. Biol. 235, 1–12 (1994)
Wagner, R.A.: On the complexity of the extended string-to-string correction problem. In: Proceedings 7th ACM STOC, pp. 218–223 (1975)
Waterman, M., Gordon, L., Arratia, R.: Phase transitions in sequence matches and nucleic acid structure. Proc. Natl. Acad. Sci. USA 84, 1239–1243 (1987)
Yeramian, E.: Genes and the physics of the DNA double-helix. Gene 255, 139–150 (2000)
Yeraminan, E., Bonnefoy, S., Langsley, G.: Physics-based gene identification: proof of concept for Plasmodium falciparum. Bioinformatics 18, 190–193 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Benson, G. (2003). Composition Alignment. In: Benson, G., Page, R.D.M. (eds) Algorithms in Bioinformatics. WABI 2003. Lecture Notes in Computer Science(), vol 2812. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39763-2_32
Download citation
DOI: https://doi.org/10.1007/978-3-540-39763-2_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20076-5
Online ISBN: 978-3-540-39763-2
eBook Packages: Springer Book Archive