Abstract
The goal of frequency constrained string mining is to extract substrings that discriminate two (or more) datasets. Known solutions to the problem range from an optimal time algorithm to different time–space tradeoffs. However, all of the existing algorithms have been designed to be run in a sequential manner and require that the whole input fits the main memory. Due to these limitations, the existing algorithms are practical only up to a few gigabytes of input. We introduce a distributed algorithm that has a novel time–space tradeoff and, in practice, achieves a significant reduction in both memory and time compared to state-of-the-art methods. To demonstrate the feasibility of the new algorithm, our study includes comprehensive tests on large-scale metagenomics data. We also study the cost of renting the required infrastructure from, e.g. Amazon EC2. Our distributed algorithm is shown to be practical on terabyte-scale inputs and affordable on rented infrastructure.
Funded by the Academy of Finland grant 118653 (ALGODAN), and Helsinki Doctoral Programme in Computer Science (HECSE).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)
Birzele, F., Kramer, S.: A new representation for protein secondary structure prediction based on frequent patterns. Bioinformatics 22(24), 2628–2634 (2006)
Chan, S., Kao, B., Yip, C.L., Tang, M.: Mining emerging substrings. In: Proc. DASFAA, pp. 119–126. IEEE (2003)
Devroye, L., Szpankowski, W., Rais, B.: A note on the height of suffix trees. SIAM J. Comput. 21(1), 48–53 (1992)
Dhaliwal, J., Puglisi, S.J., Turpin, A.: Practical efficient string mining. IEEE Transactions on Knowledge and Data Engineering 24(4), 735–744 (2012)
Fischer, J., Heun, V., Kramer, S.: Optimal String Mining Under Frequency Constraints. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 139–150. Springer, Heidelberg (2006)
Fischer, J., Mäkinen, V., Välimäki, N.: Space-efficient string mining under frequency constraints. In: Proc. ICDM, pp. 193–202. IEEE (2008)
Hui, L.C.K.: Color Set Size Problem with Application to String Matching. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 230–243. Springer, Heidelberg (1992)
Amazon Inc. Amazon elastic compute cloud (Amazon EC2), http://aws.amazon.com/ec2/#pricing (accessed May 2012)
Jacquet, P., Szpankowski, W.: Autocorrelation on words and its applications - analysis of suffix trees. Journal of Combinatorial Theory A 66, 237–269 (1994)
Kärkkäinen, J., Manzini, G., Puglisi, S.J.: Permuted Longest-Common-Prefix Array. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009. LNCS, vol. 5577, pp. 181–192. Springer, Heidelberg (2009)
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53, 918–936 (2006)
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
Kügel, A., Ohlebusch, E.: A space efficient solution to the frequent string mining problem for many databases. Data Mining and Knowl. Discovery 17, 24–38 (2008)
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Qin, J., et al.: A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464(7285), 59–65 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Välimäki, N., Puglisi, S.J. (2012). Distributed String Mining for High-Throughput Sequencing Data. In: Raphael, B., Tang, J. (eds) Algorithms in Bioinformatics. WABI 2012. Lecture Notes in Computer Science(), vol 7534. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33122-0_35
Download citation
DOI: https://doi.org/10.1007/978-3-642-33122-0_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33121-3
Online ISBN: 978-3-642-33122-0
eBook Packages: Computer ScienceComputer Science (R0)