Abstract
Populations of biased, non-random sequences may cause standard alignment algorithms to yield false-positive matches and false-negative misses. A standard significance test based on the shuffling of sequences is a partial solution, applicable to populations that can be described by simple models. Masking-out low information content intervals throws information away. We describe a new and general method, modelling-alignment: Population models are incorporated into the alignment process, which can (and should) lead to changes in the rank-order of matches between a query sequence and a collection of sequences, compared to results from standard algorithms. The new method is general and places very few conditions on the nature of the models that can be used with it. We apply modelling-alignment to local alignment, global alignment, optimal alignment, and the relatedness problem.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Allison, L.: Normalization of affine gap costs used in optimal sequence alignment. Journal of Theoretical Biology 161, 263–269 (1993)
Allison, L., Powell, D.R., Dix, T.I.: Compression and approximate matching. The Computer Journal 42(1), 1–10 (1999)
Allison, L., Powell, D.R., Dix, T.I.: Modelling is more versatile than shuffling. Technical report, Monash University, School of Computer Science and Software Engineering (2000)
Allison, L., Wallace, C.S., Yee, C.N.: Finite-state models in the alignment of macromolecules. Journal of Molecular Evolution 35, 77–89 (1992)
Altschul, S.F., Erickson, B.W.: Significance of nucleotide sequence alignments: A method for random sequence permutation that preserves dinucleotide and codon usage. Mol. Biol. Evol. 2(6), 526–538 (1985)
Bishop, M.J., Thompson, E.A.: Maximum likelihood alignment of DNA sequences. J. Mol. Biol. 190, 159–165 (1986)
Brenner, S.E., Chothia, C., Hubbard, T.J.P.: Assessing sequence comparison methods with reliable structurally identifed distant evolutionary relationships. Proc. Natl. Acad. Sci. 95, 6073–6078 (1998)
Claverie, J.-M., States, D.J.: Information enhancement methods for large scale sequence analysis. Comp. Chem 17(2), 191–201 (1993)
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5, 345–352 (1978)
Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14, 755–763 (1998)
Fitch, W.M.: Random sequences. Journal of Molecular Biology 163, 171–176 (1983)
Georgeff, M.P., Wallace, C.S.: A general selection criterion for inductive inference. In: European Conf. on Artificial Intelligence, pp. 473–482 (1984)
Gotoh, O.: An improved algorithm for matching biological sequences. Journal of Molecular Biology 162, 705–708 (1982)
Gribskov, M., Robinson, N.L.: Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Computers and Chemistry 20(1), 25–33 (1996)
Grumbach, S., Tahi, F.: A new challenge for compression algorithms: genetic sequences. Inf. Proc. and Management 30(6), 875–886 (1994)
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Academy Science 89(10), 915–919 (1992)
Huestis, R., Fischer, K.: Prediction of many new exons and introns in Plasmodium falciparum chromosome 2. Molecular and Biochemical Parasitology 118, 187–199 (2001)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
Loewenstern, D.M., Yianilos, P.N.: Significantly lower entropy estimates for natural DNA sequences. Technical Report 96-51, DIMACS (December 1996)
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. The Journal of Chemical Physics 21(6), 1087–1092 (1953)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453 (1970)
Pearson, W.R.: Effective protein sequence comparison. Meth. Enzymol. 266, 227–258 (1996)
Pearson, W.R., Lipman, D.J.: Improved tools for biological comparison. Proc. Natl. Acad. Sci. USA 85, 2444–2448 (1988)
Rivals, E., Delgrange, O., Delahaye, J.-P., Dauchet, M., Delorme, M.-O., Hénaut, A., Ollivier, E.: Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences. CABIOS 13(2), 131–136 (1997)
Sellers, P.H.: On the theory and computation of evolutionary distances. SIAM J. Appl. Math. 26(4), 787–793 (1974)
Shannon, C.E., Weaver, W.: The Mathematical Theory of Communication. U. of Illinois Press (1949)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Wallace, C.S., Freeman, P.R.: Estimation and inference by compact coding. Journal of the Royal Statistical Society series B 49(3), 240–265 (1987)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Powell, D.R., Allison, L., Dix, T.I. (2004). Modelling-Alignment for Non-random Sequences. In: Webb, G.I., Yu, X. (eds) AI 2004: Advances in Artificial Intelligence. AI 2004. Lecture Notes in Computer Science(), vol 3339. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30549-1_19
Download citation
DOI: https://doi.org/10.1007/978-3-540-30549-1_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24059-4
Online ISBN: 978-3-540-30549-1
eBook Packages: Computer ScienceComputer Science (R0)