Distributed String Mining for High-Throughput Sequencing Data

Välimäki, Niko; Puglisi, Simon J.

doi:10.1007/978-3-642-33122-0_35

Niko Välimäki^21,22 &
Simon J. Puglisi²²

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 7534))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

2270 Accesses
3 Citations

Abstract

The goal of frequency constrained string mining is to extract substrings that discriminate two (or more) datasets. Known solutions to the problem range from an optimal time algorithm to different time–space tradeoffs. However, all of the existing algorithms have been designed to be run in a sequential manner and require that the whole input fits the main memory. Due to these limitations, the existing algorithms are practical only up to a few gigabytes of input. We introduce a distributed algorithm that has a novel time–space tradeoff and, in practice, achieves a significant reduction in both memory and time compared to state-of-the-art methods. To demonstrate the feasibility of the new algorithm, our study includes comprehensive tests on large-scale metagenomics data. We also study the cost of renting the required infrastructure from, e.g. Amazon EC2. Our distributed algorithm is shown to be practical on terabyte-scale inputs and affordable on rented infrastructure.

Funded by the Academy of Finland grant 118653 (ALGODAN), and Helsinki Doctoral Programme in Computer Science (HECSE).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)
Article MathSciNet MATH Google Scholar
Birzele, F., Kramer, S.: A new representation for protein secondary structure prediction based on frequent patterns. Bioinformatics 22(24), 2628–2634 (2006)
Article Google Scholar
Chan, S., Kao, B., Yip, C.L., Tang, M.: Mining emerging substrings. In: Proc. DASFAA, pp. 119–126. IEEE (2003)
Google Scholar
Devroye, L., Szpankowski, W., Rais, B.: A note on the height of suffix trees. SIAM J. Comput. 21(1), 48–53 (1992)
Article MathSciNet MATH Google Scholar
Dhaliwal, J., Puglisi, S.J., Turpin, A.: Practical efficient string mining. IEEE Transactions on Knowledge and Data Engineering 24(4), 735–744 (2012)
Article Google Scholar
Fischer, J., Heun, V., Kramer, S.: Optimal String Mining Under Frequency Constraints. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 139–150. Springer, Heidelberg (2006)
Chapter Google Scholar
Fischer, J., Mäkinen, V., Välimäki, N.: Space-efficient string mining under frequency constraints. In: Proc. ICDM, pp. 193–202. IEEE (2008)
Google Scholar
Hui, L.C.K.: Color Set Size Problem with Application to String Matching. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 230–243. Springer, Heidelberg (1992)
Chapter Google Scholar
Amazon Inc. Amazon elastic compute cloud (Amazon EC2), http://aws.amazon.com/ec2/#pricing (accessed May 2012)
Jacquet, P., Szpankowski, W.: Autocorrelation on words and its applications - analysis of suffix trees. Journal of Combinatorial Theory A 66, 237–269 (1994)
Article MathSciNet MATH Google Scholar
Kärkkäinen, J., Manzini, G., Puglisi, S.J.: Permuted Longest-Common-Prefix Array. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009. LNCS, vol. 5577, pp. 181–192. Springer, Heidelberg (2009)
Chapter Google Scholar
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53, 918–936 (2006)
Article MathSciNet Google Scholar
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
Chapter Google Scholar
Kügel, A., Ohlebusch, E.: A space efficient solution to the frequent string mining problem for many databases. Data Mining and Knowl. Discovery 17, 24–38 (2008)
Article Google Scholar
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
Qin, J., et al.: A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464(7285), 59–65 (2010)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Helsinki Institute for Information Technology, Finland
Niko Välimäki
Department of Computer Science, University of Helsinki, Finland
Niko Välimäki & Simon J. Puglisi

Authors

Niko Välimäki
View author publications
You can also search for this author in PubMed Google Scholar
Simon J. Puglisi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Brown University, P.O. Box 1910, 02912, Providence, CA, USA
Ben Raphael
Department of Computer Science and Engineering, University of South Carolina, 301 Main Street, 29208, Columbia, SC, USA
Jijun Tang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Välimäki, N., Puglisi, S.J. (2012). Distributed String Mining for High-Throughput Sequencing Data. In: Raphael, B., Tang, J. (eds) Algorithms in Bioinformatics. WABI 2012. Lecture Notes in Computer Science(), vol 7534. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33122-0_35

Download citation

DOI: https://doi.org/10.1007/978-3-642-33122-0_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33121-3
Online ISBN: 978-3-642-33122-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics