Skip to main content

Distributed String Mining for High-Throughput Sequencing Data

  • Conference paper
Algorithms in Bioinformatics (WABI 2012)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 7534))

Included in the following conference series:

Abstract

The goal of frequency constrained string mining is to extract substrings that discriminate two (or more) datasets. Known solutions to the problem range from an optimal time algorithm to different time–space tradeoffs. However, all of the existing algorithms have been designed to be run in a sequential manner and require that the whole input fits the main memory. Due to these limitations, the existing algorithms are practical only up to a few gigabytes of input. We introduce a distributed algorithm that has a novel time–space tradeoff and, in practice, achieves a significant reduction in both memory and time compared to state-of-the-art methods. To demonstrate the feasibility of the new algorithm, our study includes comprehensive tests on large-scale metagenomics data. We also study the cost of renting the required infrastructure from, e.g. Amazon EC2. Our distributed algorithm is shown to be practical on terabyte-scale inputs and affordable on rented infrastructure.

Funded by the Academy of Finland grant 118653 (ALGODAN), and Helsinki Doctoral Programme in Computer Science (HECSE).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  2. Birzele, F., Kramer, S.: A new representation for protein secondary structure prediction based on frequent patterns. Bioinformatics 22(24), 2628–2634 (2006)

    Article  Google Scholar 

  3. Chan, S., Kao, B., Yip, C.L., Tang, M.: Mining emerging substrings. In: Proc. DASFAA, pp. 119–126. IEEE (2003)

    Google Scholar 

  4. Devroye, L., Szpankowski, W., Rais, B.: A note on the height of suffix trees. SIAM J. Comput. 21(1), 48–53 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  5. Dhaliwal, J., Puglisi, S.J., Turpin, A.: Practical efficient string mining. IEEE Transactions on Knowledge and Data Engineering 24(4), 735–744 (2012)

    Article  Google Scholar 

  6. Fischer, J., Heun, V., Kramer, S.: Optimal String Mining Under Frequency Constraints. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 139–150. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  7. Fischer, J., Mäkinen, V., Välimäki, N.: Space-efficient string mining under frequency constraints. In: Proc. ICDM, pp. 193–202. IEEE (2008)

    Google Scholar 

  8. Hui, L.C.K.: Color Set Size Problem with Application to String Matching. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 230–243. Springer, Heidelberg (1992)

    Chapter  Google Scholar 

  9. Amazon Inc. Amazon elastic compute cloud (Amazon EC2), http://aws.amazon.com/ec2/#pricing (accessed May 2012)

  10. Jacquet, P., Szpankowski, W.: Autocorrelation on words and its applications - analysis of suffix trees. Journal of Combinatorial Theory A 66, 237–269 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  11. Kärkkäinen, J., Manzini, G., Puglisi, S.J.: Permuted Longest-Common-Prefix Array. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009. LNCS, vol. 5577, pp. 181–192. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  12. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53, 918–936 (2006)

    Article  MathSciNet  Google Scholar 

  13. Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  14. Kügel, A., Ohlebusch, E.: A space efficient solution to the frequent string mining problem for many databases. Data Mining and Knowl. Discovery 17, 24–38 (2008)

    Article  Google Scholar 

  15. Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  16. Qin, J., et al.: A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464(7285), 59–65 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Välimäki, N., Puglisi, S.J. (2012). Distributed String Mining for High-Throughput Sequencing Data. In: Raphael, B., Tang, J. (eds) Algorithms in Bioinformatics. WABI 2012. Lecture Notes in Computer Science(), vol 7534. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33122-0_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33122-0_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33121-3

  • Online ISBN: 978-3-642-33122-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics