ABSTRACT
Protein sequence alignments are more reliable the shorter the evolutionary distance. Here, we align distantly related proteins using many closely spaced intermediate sequences as stepping stones. Such transitive alignments can be generated between any two proteins in a connected set, whether they are direct or indirect sequence neighbours in the underlying library of pairwise alignments. We have implemented a greedy algorithm, MaxFlow, using a novel consistency score to estimate the relative likelihood of alternative paths of transitive alignment. In contrast to traditional profile models of amino acid preferences, MaxFlow models the probability that two positions are structurally equivalent and retains high information content across large distances in sequence space. Thus, MaxFlow is able to identify sparse and narrow active-site sequence signatures which are embedded in high-entropy sequence segments in the structure-based multiple alignment of large diverse enzyme superfamilies. In a challenging benchmark, MaxFlow yields better reliability and double coverage compared to available sequence alignment software. This promises to increase information returns from functional and structural genomics, where reliable sequence alignment is a bottleneck to transferring the functional or structural characterization of model proteins to entire protein families and superfamilies.
- Sander C, Schneider R (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 19, 56--68.Google ScholarCross Ref
- Lindahl E, Elofsson A. Identification of related proteins on family, superfamily and fold level. J Mol Biol 2000, 295:613--625.Google Scholar
- Dietmann S, Park J, Notredame C, Heger A, Lappe M, Holm L (2001) A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3. Nucl Acids Res 29, 55--57.Google ScholarCross Ref
- Holm L, Sander C (1997) An evolutionary treasure: unification of a broad set of amidohydrolases related to urease. Proteins 28, 72--82.Google ScholarCross Ref
- Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29, 2994--3005.Google ScholarCross Ref
- Altschul SF (1991) Amino acid matrices from an information theoretic perspective. J. Mol. Biol. 219, 555--565.Google ScholarCross Ref
- Notredame C (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 3, 131--144.Google ScholarCross Ref
- Vingron M, Argos P (1991) Motif recognition and alignment for many sequences by comparison of dot-matrices. J. Mol. Biol. 218, 33--43.Google ScholarCross Ref
- Notredame C, Higgins DG, Heringa J (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205--217.Google ScholarCross Ref
- Kececioglu J (1993) The maximum weight trace problem in multiple sequence alignment. In Proceedings of the 4th Symposium on Combinatorial Pattern Matching, No. 684 in Lect. Notes Comput. Sci., Springer, Berlin, pp. 106--119. Google ScholarDigital Library
- Grundy WN, Bailey TL, Elkan CP, Baker ME (1997) Meta-MEME: motif-based hidden Markov models of protein families. CABIOS 5, 211--221.Google Scholar
- Morgenstern B (1999) DIALIGN2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15, 211--218.Google ScholarCross Ref
- Bork P, Holm L, Koonin E, Sander C (1995) The cytidylyltransferase superfamily: identification of the nucleotide-binding site and fold prediction. Proteins 22, 259--266.Google ScholarCross Ref
- Flohil JA, Vriend G, Berendsen HJC (2002) Completion and refinement of 3-D homology models with restricted molecular dynamics: Application to targets 47, 58, and 111 in the CASP modeling competition and posterior analysis. Proteins 48, 593--604.Google ScholarCross Ref
- Madabushi S, Yao H, Marsh M, Kristensen DM, Philippi A, Sowa ME, Lichtarge O (2002) Structural clusters of evolutionary trace residues are statistically significant and common in proteins. J Mol Biol 316, 139--154.Google ScholarCross Ref
- Casari G, Sander C, Valencia A (1995) A method to predict functional residues in proteins. Nat Struct Biol 2, 171--178.Google ScholarCross Ref
Index Terms
- Accurate detection of very sparse sequence motifs
Recommendations
Prediction of the post-translational modification sites on dengue virus E protein and deciphering their role in pathogenesis
Dengue virus, a member of the flavivirus family, is a mosquito-borne viral pathogen for which any specific treatment or control of infection by vaccination is yet to be conclusive. The envelope glycoprotein, E, mediates viral entry by membrane fusion. ...
Systematic investigation of sequence and structural motifs that recognize ATP
Display Omitted(A) Superimposed cluster of ATP-binding site structures that belong to the "class II aminoacyl- tRNA synthetase" binding mode. (B) Structural motif identified by a clustering method for the "class II aminoacyl- tRNA synthetase" binding ...
In?uenza-specific Amino Acid Substitution Model
KSE '09: Proceedings of the 2009 International Conference on Knowledge and Systems EngineeringAmino acid substitution model is a crucial component in protein sequence comparative systems such as protein sequence similarity searching, protein sequence alignment, and protein phylogenetic analysis. Although several general amino acid substitution ...
Comments