Banishing bias from consensus sequences

Ben-Dor, Amir; Lancia, Giuseppe; Ravi, R.; Perone, Jennifer

doi:10.1007/3-540-63220-4_63

Amir Ben-Dor¹,
Giuseppe Lancia²,
R. Ravi² &
…
Jennifer Perone³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1264))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

204 Accesses
38 Citations

Abstract

With the exploding size of genome databases, it is becoming increasingly important to devise search procedures that extract relevant information from them. One such procedure is particularly effective in finding new, distant members of a given family of related sequences: start with a multiple alignment of the given members of the family and use an integral or fractional consensus sequence derived from the alignment to further probe the database. However, the multiple alignment constructed to begin with may be biased due to skew in the sample of sequences used to construct it.

We suggest strategies to overcome the problem of bias in building consensus sequences. When the intention is to build a fractional consensus sequence (often termed a profile), we propose assigning weights to the sequences such that the resulting fractional sequence has roughly the same similarity score against each of the sequences in the family. We call such fractional consensus sequences balanced profiles. On the other hand, when only regular sequences can be used in the search, we propose that the consensus sequence have minimum maximum distance from any sequence in the family to avoid bias. Such sequences are NP-hard to compute exactly, so we present an approximation algorithm with very good performance ratio based on randomized rounding of an integer programming formulation of the problem. We also mention applications of the rounding method to selection of probes for disease detection and to construction of consensus maps.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Optimization of sequence alignments according to the number of sequences vs. number of sites trade-off

Article Open access 09 June 2015

Large scale sequence alignment via efficient inference in generative models

Article Open access 04 May 2023

microTaboo: a general and practical solution to the k-disjoint problem

Article Open access 02 May 2017

References

Stephen F. Altschul, Raymond J. Carroll, and David J. Lipman. Weights for Data Related by a Tree. Journal of Molecular Biology, 207, 647–653, 1989.
Google Scholar
A. Bairoch. Prosite: A Dictionary of Sites and patterns in Proteins. Nucleic Acids Research, 20, 2019–2022, 1992.
Google Scholar
A. Bairoch, and B. Boeckmann. The SwissProt Protein Sequence Data Bank. Nucleic Acids Research, 20, 2019–2022, 1992.
Google Scholar
M.O. Dayhoff, W.C. Barker and L.T. Hunt. Establishing homologies in protein sequences. Methods Enzymol., 91:524–545, 1983.
Google Scholar
R. Dular, R. Kajioka, and S. Kasatiya. Comparison of gene-probe commercial kit and culture technique for the diagnosis of mycoplasma pneumoniae infection. J. of Clinical Microbiology, 26(5):1068–1069, May 1988.
Google Scholar
S.R. Eddy, G. Mitchison, and R. Durbin. Maximum discrimination hidden Markov models of sequence consensus. J. of Computational Biology, 2:9–23. 1995.
Google Scholar
M. Frances and A. Litman. On covering problems of codes. Technical Report 827, Technion, Israel, July 1994.
Google Scholar
Program Manual for the Wisconsin Package, Version 8, September 1994, Genetics Computer Group, 575 Science Drive, Madison, Wisconsin, USA 53711.
Google Scholar
M. Gribskov, A. D. McLachlan, and D. Eisenberg. Profile Analysis: Detection of Distantly Related Proteins. Proceedings of the National Academy of Science, U.S.A., 84, 4355–4358, 1987.
Google Scholar
M. Gerstein, E. Sonnhammer, and C. Chothia. Volume Changes in protein evolution. J. Mol. Biol., 235:1067–1078, 1994.
Google Scholar
Steven Henikoff and Jorja G. Henikoff. Position-based Sequence Weights. J. Mol. Biol., 243, 574–578, 1994.
Google Scholar
W. Hoeffding. Probability inequalities for sums of bound random variables. J. Amer. Statist. Assoc., 58:13–30, 1963.
Google Scholar
M. Ito, K. Shimizu, M. Nakanishi, and A. Hashimoto. Polynomial-time algorithms for computing characteristic strings. Proc. CPM 94, LNCS 807:274–288, 1994.
Google Scholar
N. Karmarkar. A new polynomial time algorithm for linear programming, Combinatorica, 4:373–395, 1984.
Google Scholar
A. Krogh, and G. Mitchison. Maximum entropy weighting of aligned sequences of protein or DNA, in Proc. Third Int. Conf. on Intelligent System for Mol. Biol., (C. Rawlings, D. Clark, R. Altman, L. Hunter, T. Lengauer, S. Wodak, eds.) pp. 215–221, AAAI Press, Menlo Park, CA, 1995.
Google Scholar
R. Luthy, I. Xenarios, and P. Bicher. Improving the sensitivity of the sequence profile method, Protein Science, 3:139–146, 1994.
Google Scholar
A.J.L. Macario and E.C.De. Macario. Gene Probes for Bacteria. Academic Press, 1990.
Google Scholar
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
Google Scholar
P. Raghavan. A probabilistic construction of deterministic algorithms: Approximating packing integer programs. Journal of Computer and System Sciences, 37:130–143, 1988.
Google Scholar
R. Ravi and J. D. Kececioglu. Approximation algorithms for multiple sequence alignment under a fixed evolutionary tree, Proc. CPM 95, LNCS 937:330–339, 1995.
Google Scholar
P. Raghavan and C.D. Thompson. Randomized rounding: a technique for provably good algorithms and algorithmic proofs, Combinatorica, 7:365–374, 1987.
Google Scholar
Peter R. Sibbald and Patrick Argos. Weighting Aligned Protein or Nucleic Acid Sequences to Correct for Unequal Representation. Journal of Molecular Biology, 216, 813–818, 1990.
Google Scholar
T.F. Smith and M.S. Waterman. Comparison of Biosequences. Adv. Appl. Math., 482–489, 1981.
Google Scholar
J.D. Thompson, D.G. Higgins and T.J. Gibson. Improved sensitivity of profile searches through the use of sequence weights and gap excision, Comput. Applic. Biosci., 10:19–29, 1994.
Google Scholar
M. Vingron and P. Argos. A fast and sensitive multiple sequence alignment algorithm. Comput. Appl. Biosci., 5:115–121, 1989.
Google Scholar
M. Vingron and P.R. Sibbald. Weighting in sequence space: A comparison of methods in terms of generalized sequences. Proc. Natl. Acad. Sci. USA, 90:8777–8781, 1993.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, Technion, 32000, Haifa, Israel
Amir Ben-Dor
GSIA, Carnegie Mellon University, 15213, Pittsburgh, PA
Giuseppe Lancia & R. Ravi
New York University School of Medicine, New York
Jennifer Perone

Authors

Amir Ben-Dor
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Lancia
View author publications
You can also search for this author in PubMed Google Scholar
R. Ravi
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Perone
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Alberto Apostolico Jotun Hein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ben-Dor, A., Lancia, G., Ravi, R., Perone, J. (1997). Banishing bias from consensus sequences. In: Apostolico, A., Hein, J. (eds) Combinatorial Pattern Matching. CPM 1997. Lecture Notes in Computer Science, vol 1264. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-63220-4_63

Download citation

DOI: https://doi.org/10.1007/3-540-63220-4_63
Published: 08 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-63220-7
Online ISBN: 978-3-540-69214-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics