ABSTRACT
Profile Hidden Markov models (HMMs) are a powerful approach to describing biologically significant functional units, or motifs, in protein sequences. Entire databases of such models are regularly compared to large collections of proteins to recognize motifs in them. Exponentially increasing rates of genome sequencing have caused both protein and model databases to explode in size, placing an ever-increasing computational burden on users of these systems.Here, we describe an accelerated search system that exploits parallelism in a number of ways. First, the application is functionally decomposed into a pipeline, with distinct compute resources executing each pipeline stage. Second, the first pipeline stage is deployed on a systolic array, which yields significant fine-grained parallelism. Third, for some instantiations of the design, parallel copies of the first pipeline stage are used, further increasing the level of coarse-grained parallelism.A naïve parallelization of the first stage computation has serious repercussions for the sensitivity of the search. We present a pair of remedies to this dilemma and quantify the regions of interest within which each approach is most effective. Analytic performance models are used to assess the overall speedup that can be attained relative to a single-processor software solution. Performance improvements of 1 to 2 orders of magnitude are predicted.
- A. Bateman, L. Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E. L. L. Sonnhammer, D. J. Studholme, C. Yeats, and S. R. Eddy. The Pfam protein families database. Nucleic Acids Research, 32:D138--41, 2004.Google ScholarCross Ref
- B. Boeckmann, A. Bairoch, R. Apweiler, M. C. Blatter, A. Estreicher, E. Gasteiger, M. J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, and M. Schneider. The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research, 31:365--70, 2003.Google ScholarCross Ref
- R. Chamberlain and B. Shands. Streaming data from disk store to application. In Proc. 3rd Int'l Workshop on Storage Network Architecture and Parallel I/Os, pages 17--23, St. Louis, MO, 2005.Google Scholar
- R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis. Cambridge University Press, New York, 1998.Google ScholarCross Ref
- S. Eddy. HMMER: Sequence analysis using profile hidden Markov models, 2004. http://hmmer.wustl.edu.Google Scholar
- D. T. Hoang. Searching genetic databases on Splash 2. In Proc. of IEEE Workshop on Field-Programmable Custom Computing Machines, pages 185--192, 1993.Google ScholarCross Ref
- D. R. Horn, M. Houston, and P. Hanrahan. ClawHMMER: a streaming HMMer-search implementation. In Proc. IEEE Supercomputing 2005, Seattle, WA, 2005. Google ScholarDigital Library
- R. Hughey and A. Krogh. Hidden Markov models for sequence analysis: extension and analysis of the basic method. CABIOS, 12:95--107, 1996.Google Scholar
- S. Karlin and S. F. Altschul. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Nat'l Acad. Sci., 87(6):2264--2268, Mar. 1990.Google ScholarCross Ref
- A. Krogh, M. Brown, I. S. Mian, K. Sjölander, and D. Haussler. Hidden Markov models in computational biology: applications to protein modeling. Journal of Molecular Biology, 235:1501--31, 1994.Google ScholarCross Ref
- National Center for Biological Information. Growth of GenBank, 2005. http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html.Google Scholar
- T. Oliver and B. Schmidt. High performance biosequence database scanning on reconfigurable platforms. In Proc. of 4th IEEE Int'l Workshop on High Performance Computational Biology, Apr. 2004.Google ScholarCross Ref
- T. Oliver, B. Schmidt, and D. Maskell. Hyper customized processors for bio-sequence database scanning on FPGAs. In Proc. of ACM/SIGDA 13th Int'l Symp. on Field-Programmable Gate Arrays, pages 229--237, Feb. 2005. Google ScholarDigital Library
- D. Outston et al. Application of hidden Markov models to detecting multi-stage network attacks. In Proc. 36th Hawaii Int'l Conf. on System Sciences, pages 334--44, 2003. Google ScholarDigital Library
- L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77:257--86, 1989.Google ScholarCross Ref
- T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195--97, Mar. 1981.Google ScholarCross Ref
- Timelogic DeCypherHMM solution, 2004. http://www.timelogic.com/decypher_hmm.htm.Google Scholar
- T. VanCourt and M. C. Herbordt. Families of FPGA-based algorithms for approximate string matching. In Proc. of 15th IEEE Int'l Conf. on Application-Specific Systems, Architectures, and Processors, pages 354--364, Sept. 2004. Google ScholarDigital Library
- J. Vlontzos and S. Kung. Hidden Markov models for character recognition. IEEE Transactions on Image Processing, 1(4), 1992.Google ScholarDigital Library
- B. West, R. D. Chamberlain, R. S. Indeck, and Q. Zhang. An FPGA-based search engine for unstructured database. In Proc. of 2nd Workshop on Application Specific Processors, pages 25--32, Dec. 2003.Google Scholar
- B. Wun, J. Buhler, and P. Crowley. Exploiting coarse-grained paralellism to accelerate protein motif finding with a network processor. In Proc. 14th Int'l Conf. Parallel Architectures and Compilation Techniques, pages 173--84, St. Louis, MO, 2005. IEEE. Google ScholarDigital Library
- Y. Yamaguchi, T. Maruyama, and A. Konagaya. High speed homology search with FPGAs. In Proc. of Pacific Symp. on Biocomputing, pages 271--282, 2002.Google Scholar
Index Terms
- Accelerator design for protein sequence HMM search
Recommendations
An Efficient Parallel Implementation of the Hidden Markov Methods for Genomic Sequence-Search on a Massively Parallel System
Bioinformatics databases used for sequence comparison and sequence alignment are growing exponentially.This has popularized programs that carry out database searches. Current implementations of sequence alignmentmethods based on hidden Markov models (...
Protein homology detection by HMM--HMM comparison
Motivation: Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction and evolution.
Results: We have generalized the alignment of protein sequences with a profile hidden Markov model (...
Fine-Scale Recombination Mapping of High-Throughput Sequence Data
BCB'13: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical InformaticsIn this paper, we contrast the resolution and accuracy of determining recombination boundaries using genotyping arrays compared to high-throughput sequencing. In addition, we consider the impacts of sequence coverage and genetic diversity on localizing ...
Comments