Protein Sequence Motif Discovery on Distributed Supercomputer

Challa, Santan; Thulasiraman, Parimala

doi:10.1007/978-3-540-68083-3_24

Santan Challa¹ &
Parimala Thulasiraman¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5036))

Included in the following conference series:

International Conference on Grid and Pervasive Computing

635 Accesses
3 Citations

Abstract

The motif discovery problem has gained lot of significance in biological science over the past decade. Recently, various approaches have been used successfully to discover motifs. Some of them are based on probabilistic approach and others on combinatorial approach. We follow a graph-based approach to solve this problem, in particular, using the idea of de Bruijn graphs. The de Bruijn graph has been successfully adopted in the past to solve problems such as local multiple alignment and DNA fragment assembly. The proposed algorithm harnesses the power of the de Bruijn graph to discover the conserved regions such as motifs in a protein sequence. The sequential algorithm has 70% matches of the motifs with the MEME and 65% pattern matches with the Gibbs motif sampler. The motif discovery problem is data intensive requiring substantial computational resources and cannot be solved on a single system. In this paper, we use the distributed supercomputers available on the Western Canada Research Grid (WestGrid) to implement the distributed graph based approach to the motif discovery problem and study its performance analysis. We use the available resources efficiently to distribute data among the multicore nodes in the machine and redesign the algorithm to suit the architecture. We show that a pure distributed implementation is not efficient for this problem. We develop a hybrid algorithm that uses fine grain parallelism within the nodes and coarse grain parallelism across the nodes. Experiments show that this hybrid algorithm runs 3 times faster than the pure distributed memory implementation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. Readings in uncertain reasoning, 452–472 (1990)
Google Scholar
Hartley, H.O.: Maximum likelihood estimation from incomplete data. Biometrics 14(2), 174–194 (1958)
Article MATH Google Scholar
Thompson, W., Rouchka, E.C., Lawrence, C.E.: Gibbs recursive sampler: finding transcription factor binding sites. Nucleic Acids Research 31(13), 3580–3585 (2003)
Article Google Scholar
Roth, F.P., Hughes, J.D., Estep, P.W., Church, G.M.: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnology 16(10), 939–945 (1998)
Article Google Scholar
Siddharthan, R., Siggia, E.D., van Nimwegen, E.: Phylogibbs: A gibbs sampler incorporating phylogenetic information. PLoS Computational Biology 1(7), 534–556 (2005)
Article Google Scholar
Liu, X., Liu, J.S., Brutlag, D.L.: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. In: Pacific Symposium on Biocomputing, pp. 127–138 (2001)
Google Scholar
Pevzner, P.A., Tang, H., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States 98(17), 9748–9753 (2001)
Article MATH MathSciNet Google Scholar
Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14(1), 56–57 (1998)
Article Google Scholar
Buhler, J., Tompa, M.: Finding motifs using random projections. In: RECOMB 2001: Proceedings of the fifth annual international conference on Computational biology, pp. 69–76. ACM Press, New York (2001)
Chapter Google Scholar
Pevzner, P.A., Sze, S.H.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, La Jolla, CA, USA, pp. 269–278. AAAI Press, Menlo Park (2000)
Google Scholar
Keich, U., Pevzner, P.A.: Finding motifs in the twilight zone. Bioinformatics 18(10), 1374–1381 (2002)
Article Google Scholar
Setubal, J.C., Meidanis, J.: Introduction to Computational Molecular Biology. PWS Publishing Company, Boston (1997)
Google Scholar
Westgrid Canada Research Grid, http://www.westgrid.ca/home.html
Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: The Second International Conference on Intelligent Systems for Molecular Biology, Stanford, CA, USA, pp. 28–36. AAAI Press, Menlo Park (1994)
Google Scholar
Challa, S., Thulasiraman, P.: A graph based approach to discover conserved regions in DNA and protein sequences. In: Symposium on Bioinformatics and Life Science Computing, pp. 672–677 (2007)
Google Scholar
Grundy, W.N., Bailey, T.L., Elkan, C.P.: ParaMEME: a parallel implementation and a web interface for a DNA and protein motif discovery tool. Bioinformatics 12(4), 303–310 (1996)
Article Google Scholar
Sutou, T., Tamura, K., Mori, Y., Kitakami, H.: Design and implementation of parallel modified prefixspan method. In: International Symposium on High Performance Computing, pp. 412–422 (2003)
Google Scholar
Baldwin, N.E., Collins, R.L., Langston, M.A., Symons, C.T., Leuze, M.R., Voy, B.H.: High performance computational tools for motif discovery. In: Proceedings of the 18th IPDPS, Eldorado Hotel, Santa Fe, NM, USA, April 2004, IEEE, Los Alamitos (2004)
Google Scholar
ParSeq: A Software Tool for Searching Motifs with Structural and Biochemical Properties in Biological Sequences (September 2005), http://www-pr.informatik.uni-tuebingen.de/parseq/
Cytochrome P450 cysteine heme-iron ligand signature, http://www.expasy.org/cgi-bin/nicedoc.pl?PDOC00081

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada, R3T 2N2
Santan Challa & Parimala Thulasiraman

Authors

Santan Challa
View author publications
You can also search for this author in PubMed Google Scholar
Parimala Thulasiraman
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Song Wu Laurence T. Yang Tony Li Xu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Challa, S., Thulasiraman, P. (2008). Protein Sequence Motif Discovery on Distributed Supercomputer. In: Wu, S., Yang, L.T., Xu, T.L. (eds) Advances in Grid and Pervasive Computing. GPC 2008. Lecture Notes in Computer Science, vol 5036. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68083-3_24

Download citation

DOI: https://doi.org/10.1007/978-3-540-68083-3_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68081-9
Online ISBN: 978-3-540-68083-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics