Skip to main content

Using Bloom Filters for Large Scale Gene Sequence Analysis in Haskell

  • Conference paper
Practical Aspects of Declarative Languages (PADL 2009)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 5418))

Included in the following conference series:

Abstract

Analysis of biological data often involves large data sets and computationally expensive algorithms. Databases of biological data continue to grow, leading to an increasing demand for improved algorithms and data structures. Despite having many advantages over more traditional indexing structures, the Bloom filter is almost unused in bioinformatics. Here we present a robust and efficient Bloom filter implementation in Haskell, and implement a simple bioinformatics application for indexing and matching sequence data. We use this to index the chromosomes that make up the human genome, and map all available gene sequences to it. Our experiences with developing and tuning our application suggest that for bioinformatics applications, Haskell offers a compelling combination of rapid development, quality assurance, and high performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing Suffix Trees with Enhanced Suffix Arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  2. Abouelhoda, M.I., Ohlebusch, E., Kurtz, S.: Optimal Exact String Matching Based on Suffix Arrays. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 31–43. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  3. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: A basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (1990)

    Article  Google Scholar 

  4. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970)

    Article  MATH  Google Scholar 

  5. Broder, A., Mitzenmacher, M.: Network applications of Bloom filters: A survey. Internet Mathematics 1(4), 636–646 (2003)

    MathSciNet  MATH  Google Scholar 

  6. Claessen, K., Hughes, J.: QuickCheck: a lightweight tool for random testing of Haskell programs. In: ACM SIGPLAN Notices, pp. 268–279. ACM Press, New York (2000)

    Google Scholar 

  7. Cloonan, N., Forrest, A.R.R., Kolle, G., Gardiner, B.B.A., Faulkner, G.J., Brown, M.K., Taylor, D.F., Steptoe, A.L., Wani, S., Bethel, G., Robertson, A.J., Perkins, A.C., Bruce, S.J., Lee, C.C., Ranade, S.S., Peckham, H.E., Manning, J.M., McKernan, K.J., Grimmond, S.M.: Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature Methods 5(7), 613–619 (2008)

    Article  Google Scholar 

  8. Coutts, D., Stewart, D., Leshchinskiy, R.: Rewriting haskell strings. In: Hanus, M. (ed.) PADL 2007. LNCS, vol. 4354, pp. 50–64. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  9. Dillinger, P.C., Manolios, P.: Bloom filters in probabilistic verification. In: Hu, A.J., Martin, A.K. (eds.) FMCAD 2004. LNCS, vol. 3312, pp. 367–381. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  10. Galperin, M.Y.: The molecular biology database collection: 2008 update. Nucleic Acids Research 36, D2–D4 (2008)

    Article  Google Scholar 

  11. Gotoh, O.: An improved algorithm for matching biological sequences. Journal of Molecular Biology 162, 705–708 (1982)

    Article  Google Scholar 

  12. Kalyanaraman, A., Aluru, S., Brendel, V., Kothari, S.: Space and time efficient parallel algorithms and software for EST clustering. IEEE Transactions on Parallel and Distributed Systems 14(12), 1209–1221 (2003)

    Article  Google Scholar 

  13. Kent, W.J.: BLAT—the BLAST-like alignment tool. Genome Research 12(4), 656–664 (2002)

    Article  Google Scholar 

  14. Kirsch, A., Mitzenmacher, M.: Less hashing, same performance: Building a better bloom filter. In: Azar, Y., Erlebach, T. (eds.) ESA 2006. LNCS, vol. 4168, pp. 456–467. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  15. Launchbury, J., Jones, S.L.P.: Lazy functional state threads. In: Programming Languages Design and Implementation, pp. 24–35. ACM Press, New York (1994)

    Google Scholar 

  16. Malde, K., Coward, E., Jonassen, I.: Fast sequence clustering using a suffix array algorithm. Bioinformatics 19(10), 1221–1226 (2003)

    Article  Google Scholar 

  17. Malde, K., Schneeberger, K., Coward, E., Jonassen, I.: RBR: Library-less repeat detection for ESTs. Bioinformatics 22(18), 2232–2236 (2006)

    Article  Google Scholar 

  18. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  19. Marcel Margulies, Michael Egholm, William E. Altman, Said Attiya, Joel S. Bader, Lisa A. Bemben Jan Berka, Michael S. Braverman, Yi-Ju Chen, Zhoutao Chen, Scott B. Dewell, Lei Du, Joseph M. Fierro, Xavier V. Gomes, Brian C. Godwin, Wen He, Scott Helgesen, Chun He Ho, Gerard P. Irzyk, Szilveszter C. Jando, Maria L. I. Alenquer, Thomas P. Jarvie, Kshama B. Jirage, Jong-Bum Kim, James R. Knight, Janna R. Lanza, John H. Leamon, Steven M. Lefkowitz, Ming Lei, Jing Li, Kenton L. Lohman, Hong Lu, Vinod B. Makhijani, Keith E. McDade, Michael P. McKenna, Eugene W. Myers2, Elizabeth Nickerson, John R. Nobile, Ramona Plant, Bernard P. Puc, Michael T. Ronan, George T. Roth, Gary J. Sarkis, Jan Fredrik Simons, John W. Simpson, Maithreyan Srinivasan, Karrie R. Tartaro, Alexander Tomasz3, Kari A. Vogt, Greg A. Volkmer, Shally H. Wang, Yong Wang, Michael P. Weiner4, Pengguang Yu, Richard F. Begley, and Jonathan M. Rothberg. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376–80, 2005.

    Google Scholar 

  20. Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3), 443–453 (1970)

    Article  Google Scholar 

  21. O’Sullivan, B., Stewart, D., Goerzen, J.: Real World Haskell. In: Profiling and optimization, ch. 25. O’Reilly Media, Sebastopol (2008)

    Google Scholar 

  22. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147, 195–197 (1981)

    Article  Google Scholar 

  23. Steemers, F.J., Gunderson, K.L.: Illumina profile: technology and assays. Pharmacogenomics 6(7), 777–782 (2005)

    Article  Google Scholar 

  24. Valle, G.: Discover 1: a new program to search for unusually represented DNA motifs. Nucleic Acids Research 21(22), 5152–5156 (1993)

    Article  Google Scholar 

  25. Weiner, P.: Linear pattern matching algorithms. In: Proceedings of 14th IEEE Symposium on Foundations of Computer Science (FOCS), pp. 1–11 (1973)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Malde, K., O’Sullivan, B. (2008). Using Bloom Filters for Large Scale Gene Sequence Analysis in Haskell. In: Gill, A., Swift, T. (eds) Practical Aspects of Declarative Languages. PADL 2009. Lecture Notes in Computer Science, vol 5418. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-92995-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-92995-6_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-92994-9

  • Online ISBN: 978-3-540-92995-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics