An Approximate de Bruijn Graph Approach to Multiple Local Alignment and Motif Discovery in Protein Sequences

Patwardhan, Rupali; Tang, Haixu; Kim, Sun; Dalkilic, Mehmet

doi:10.1007/11960669_14

Rupali Patwardhan²¹,
Haixu Tang^21,22,
Sun Kim^21,22 &
…
Mehmet Dalkilic^21,22

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4316))

Included in the following conference series:

VLDB Workshop on Data Mining and Bioinformatics

559 Accesses
7 Citations

Abstract

Motif discovery is an important problem in protein sequence analysis. Computationally, it can be viewed as an application of the more general multiple local alignment problem, which often encounters the difficulty of computer time when aligning many sequences. We introduce a new algorithm for multiple local alignment for protein sequences, based on the de Bruijn graph approach first proposed by Zhang and Waterman for aligning DNA sequence. We generalize their approach to aligning protein sequences by building an approximate de Bruijn graph to allow gluing similar but not identical amino acids. We implement this algorithm and test it on motif discovery of 100 sets of protein sequences. The results show that our method achieved comparable results as other popular motif discovery programs, while offering advantages in terms of speed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28–36. AAAI Press, Menlo Park (1994)
Google Scholar
Lawrence, C., Altschul, S., Bogouski, M., Liu, J., Neuwald, A., Wooten, J.: Detecting subtle sequence signals: A gibbs sampling strategy for multiple alignment. Science 262, 208–214 (1993)
Article Google Scholar
Henikoff, S., Henikoff, J.G., Alford, W.J., Pietrokovski, S.: Automated construction and graphical presentation of protein blocks from unaligned sequences. Gene 163, GC17–GC26 (1995)
Article Google Scholar
Zhang, Y., Waterman, M.S.: An Eulerian path approach to local multiple alignment for DNA sequences. PNAS 102, 1285–1290 (2005)
Article MathSciNet Google Scholar
Zhang, Y., Waterman, M.S.: An eulerian path approach to global multiple alignment for DNA sequences. Journal of Computational Biology 10, 803–819 (2003)
Article Google Scholar
Dayhoff, M., Schwartz, R., Orcutt, B.: A model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, vol. 5(3), pp. 345–352 (1978)
Google Scholar
Henikoff, S., Henikoff, J.: Amino Acid Substitution Matrices from Protein Blocks. PNAS 89, 10915–10919 (1992)
Article Google Scholar
Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C., Hofmann, K., Bairoch, A.: The prosite database, its status in 2002. Nucleic Acids Res. 30, 235–238 (2002)
Article Google Scholar
Jonassen, I.: Efficient discovery of conserved patterns using a pattern graph. CABIOS 13, 509–522 (1997)
Google Scholar
van Lint, J., Wilson, R.: A Course in Combinatorics, 2nd edn. Cambridge University Press, Cambridge (2001)
MATH Google Scholar
Myers, E.W., Miller, W.: Optimal alignments in linear space. CABIOS 4, 11–17 (1988)
Google Scholar
Smith, T., Waterman, M.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Article Google Scholar
Hart, R., Royyuru, A., Stolovitzky, G., Califano, A.: Systematic and fully automated identification of protein sequence patterns. Journal of Computational Biology 7(3-4), 585–600 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Center for Genomics and Bioinformatics, Indiana University, 1001 E. 3rd Street, Bloomington, IN, 47405, USA
Rupali Patwardhan, Haixu Tang, Sun Kim & Mehmet Dalkilic
School of Informatics, Indiana University, 901 E. 10th Street, Bloomington, IN, 47408, USA
Haixu Tang, Sun Kim & Mehmet Dalkilic

Authors

Rupali Patwardhan
View author publications
You can also search for this author in PubMed Google Scholar
Haixu Tang
View author publications
You can also search for this author in PubMed Google Scholar
Sun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Mehmet Dalkilic
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Informatics, Indiana University, 901 E. 10th Street, 47408, Bloomington, IN,
Mehmet M. Dalkilic & Sun Kim &
EECS Department, Case Western Reserve Univ., 10900 Euclid Ave, 44106, Cleveland, OH, USA
Jiong Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Patwardhan, R., Tang, H., Kim, S., Dalkilic, M. (2006). An Approximate de Bruijn Graph Approach to Multiple Local Alignment and Motif Discovery in Protein Sequences. In: Dalkilic, M.M., Kim, S., Yang, J. (eds) Data Mining and Bioinformatics. VDMB 2006. Lecture Notes in Computer Science(), vol 4316. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11960669_14

Download citation

DOI: https://doi.org/10.1007/11960669_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68970-6
Online ISBN: 978-3-540-68971-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics