Estimating Pairwise Statistical Significance of Protein Local Alignments Using a Clustering-Classification Approach Based on Amino Acid Composition

Agrawal, Ankit; Ghosh, Arka; Huang, Xiaoqiu

doi:10.1007/978-3-540-79450-9_7

Ankit Agrawal¹,
Arka Ghosh² &
Xiaoqiu Huang¹

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4983))

Included in the following conference series:

International Symposium on Bioinformatics Research and Applications

962 Accesses
1 Citations

Abstract

A central question in pairwise sequence comparison is assessing the statistical significance of the alignment. The alignment score distribution is known to follow an extreme value distribution with analytically calculable parameters K and λ for ungapped alignments with one substitution matrix. But no statistical theory is currently available for the gapped case and for alignments using multiple scoring matrices, although their score distribution is known to closely follow extreme value distribution and the corresponding parameters can be estimated by simulation. Ideal estimation would require simulation for each sequence pair, which is impractical. In this paper, we present a simple clustering-classification approach based on amino acid composition to estimate K and λ for a given sequence pair and scoring scheme, including using multiple parameter sets. The resulting set of K and λ for different cluster pairs has large variability even for the same scoring scheme, underscoring the heavy dependence of K and λ on the amino acid composition. The proposed approach in this paper is an attempt to separate the influence of amino acid composition in estimation of statistical significance of pairwise protein alignments. Experiments and analysis of other approaches to estimate statistical parameters also indicate that the methods used in this work estimate the statistical significance with good accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Research 25(17), 3389–3402 (1997)
Article Google Scholar
Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147(1), 195–197 (1981)
Article Google Scholar
Sellers, P.H.: Pattern Recognition in Genetic Sequences by Mismatch Density. Bulletin of Mathematical Biology 46(4), 501–514 (1984)
MathSciNet MATH Google Scholar
Pearson, W.R.: Effective Protein Sequence Comparison. Methods in Enzymology 266, 227–259 (1996)
Article Google Scholar
Pearson, W.R.: Flexible Sequence Similarity Searching with the FASTA3 Program Package. Methods in Molecular Biology 132, 185–219 (2000)
Google Scholar
Huang, X., Chao, K.M.: A Generalized Global Alignment Algorithm. Bioinformatics 19(2), 228–233 (2003)
Article Google Scholar
Huang, X., Brutlag, D.L.: Dynamic Use of Multiple Parameter Sets in Sequence Alignment. Nucleic Acids Research 35(2), 678–686 (2007)
Article Google Scholar
Karlin, S., Altschul, S.F.: Methods for Assessing the Statistical Significance of Molecular Sequence Features by Using General Scoring Schemes. Proceedings of the National Academy of Sciences, USA 87(6), 2264–2268 (1990)
Article MATH Google Scholar
Pearson, W.R.: Empirical Statistical Estimates for Sequence Similarity Searches. Journal of Molecular Biology 276, 71–84 (1998)
Article Google Scholar
Mott, R., Tribe, R.: Approximate Statistics of Gapped Alignments. Journal of Computational Biology 6(1), 91–112 (1999)
Article Google Scholar
Mott, R.: Accurate Formula for P-values of Gapped Local Sequence and Profile Alignments. Journal of Molecular Biology 300, 649–659 (2000)
Article Google Scholar
Altschul, S.F., Bundschuh, R., Olsen, R., Hwa, T.: The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Research 29(2), 351–361 (2001)
Article Google Scholar
Schäffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., Altschul, S.F.: Improving the Accuracy of PSI-BLAST Protein Database Searches with Composition-based Statistics and Other Refinements. Nucleic Acids Research 29(14), 2994–3005 (2001)
Article Google Scholar
Bundschuh, R.: Rapid Significance Estimation in Local Sequence Alignment with Gaps. In: RECOMB 2001: Proceedings of the fifth annual International Conference on Computational biology, pp. 77–85. ACM, New York (2001)
Chapter Google Scholar
Poleksic, A., Danzer, J.F., Hambly, K., Debe, D.A.: Convergent Island Statistics: A Fast Method for Determining Local Alignment Score Significance. Bioinformatics 21(12), 2827–2831 (2005)
Article Google Scholar
Kschischo, M., Lässig, M., Yu, Y.: Toward an Accurate Statistics of Gapped Alignments. Bulletin of Mathematical Biology 67, 169–191 (2004)
Article Google Scholar
Grossmann, S., Yakir, B.: Large Deviations for Global Maxima of Independent Superadditive Processes with Negative Drift and an Application to Optimal Sequence Alignments. Bernoulli 10(5), 829–845 (2004)
Article MathSciNet MATH Google Scholar
Pearson, W.R., Wood, T.C.: Statistical Significance in Biological Sequence Comparison. In: Balding, D.J., Bishop, M., Cannings, C. (eds.) Handbook of Statistical Genetics, pp. 39–66. Wiley, Chichester (2001)
Google Scholar
Mott, R.: Alignment: Statistical Significance. Encyclopedia of Life Sciences (2005), http://mrw.interscience.wiley.com/emrw/9780470015902/els/article/a0005264/current/abstract
Mitrophanov, A.Y., Borodovsky, M.: Statistical Significance in Biological Sequence Analysis. Briefings in Bioinformatics 7(1), 2–24 (2006)
Article Google Scholar
Eddy, S.R.: Multiple Alignment Using Hidden Markov Models. In: Rawlings, C., Clark, D., Altman, R., Hunter, L., Lengauer, T., Wodak, S. (eds.) Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, pp. 114–120. AAAI Press, Menlo Park (1995)
Google Scholar
Eddy, S.R.: Maximum Likelihood Fitting of Extreme Value Distributions (1997), unpublished manuscript, citeseer.ist.psu.edu/370503.html
Agrawal, A., Brendel, V., Huang, X.: Pairwise Statistical Significance Versus Database Statistical Significance for Local Alignment of Protein Sequences. In: Măndoiu, I., Sunderraman, R., Zelikovsky, A. (eds.) ISBRA 2008. LNCS(LNBI), vol. 4983, pp. 50–61. Springer, Heidelberg (in press, 2008)
Google Scholar
Olsen, R., Bundschuh, R., Hwa, T.: Rapid Assessment of Extremal Statistics for Gapped Local Alignment. In: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, pp. 211–222. AAAI Press, Menlo Park (1999)
Google Scholar
Anderson, T.W.: An Introduction to Multivariate Statistical Analysis, 2nd edn. Wiley-Interscience, Chichester (2003)
MATH Google Scholar
Language, R.A.: Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2006)
Google Scholar
Huang, X., Miller, W.: A Time-efficient Linear-space Local Similarity Algorithm. Advances in Applied Mathematics 12(3), 337–357 (1991)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, IA 50011-1041, USA
Ankit Agrawal & Xiaoqiu Huang
Department of Statistics, Iowa State University, 303 Snedecor Hall, Ames, IA, 50011-1210, USA
Arka Ghosh

Authors

Ankit Agrawal
View author publications
You can also search for this author in PubMed Google Scholar
Arka Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoqiu Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Ion Măndoiu Raj Sunderraman Alexander Zelikovsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agrawal, A., Ghosh, A., Huang, X. (2008). Estimating Pairwise Statistical Significance of Protein Local Alignments Using a Clustering-Classification Approach Based on Amino Acid Composition. In: Măndoiu, I., Sunderraman, R., Zelikovsky, A. (eds) Bioinformatics Research and Applications. ISBRA 2008. Lecture Notes in Computer Science(), vol 4983. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-79450-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-540-79450-9_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-79449-3
Online ISBN: 978-3-540-79450-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics