Lightweight BWT Construction for Very Large String Collections

Bauer, Markus J.; Cox, Anthony J.; Rosone, Giovanna

doi:10.1007/978-3-642-21458-5_20

Markus J. Bauer¹⁸,
Anthony J. Cox¹⁸ &
Giovanna Rosone¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6661))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

1395 Accesses
14 Citations
1 Altmetric

Abstract

A modern DNA sequencing machine can generate a billion or more sequence fragments in a matter of days. The many uses of the BWT in compression and indexing are well known, but the computational demands of creating the BWT of datasets this large have prevented its applications from being widely explored in this context.

We address this obstacle by presenting two algorithms capable of computing the BWT of very large string collections. The algorithms are lightweight in that the first needs O(m logm) bits of memory to process m strings and the memory requirements of the second are constant with respect to m.

We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. Although our tests were on collections of DNA sequences of uniform length, the algorithms themselves apply to any string collection over any alphabet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adjeroh, D., Bell, T., Mukherjee, A.: The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching, 1st edn. Springer, Heidelberg (2008)
Book Google Scholar
Bentley, D.R., et al.: Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456(7218), 53–59 (2008)
Article Google Scholar
Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 697–710. Springer, Heidelberg (2010)
Chapter Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, Washington, DC, USA, pages 390. IEEE Computer Society, Los Alamitos (2000)
Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52, 552–581 (2005)
Article MathSciNet MATH Google Scholar
National Center for Biotechnology Information. Sequence Read Archive, http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?
Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT Trees and PAT arrays, pp. 66–82. Prentice-Hall, Inc., Upper Saddle River (1992)
Google Scholar
Hon, W.K., Lam, T.W., Sadakane, K., Sung, W.K., Yiu, S.M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48, 23–36 (2007)
Article MathSciNet MATH Google Scholar
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53, 918–936 (2006)
Article MathSciNet MATH Google Scholar
Kim, D., Sim, J., Park, H., Park, K.: Linear-time construction of suffix arrays. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 186–199. Springer, Heidelberg (2003)
Chapter Google Scholar
Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms 3(2-4), 143–156 (2005)
Article MathSciNet MATH Google Scholar
Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the burrows wheeler transform and applications to sequence comparison and data compression. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 178–189. Springer, Heidelberg (2005)
Chapter Google Scholar
Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the Burrows-Wheeler Transform. Theor. Comput. Sci. 387(3), 298–312 (2007)
Article MathSciNet MATH Google Scholar
Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: A new combinatorial approach to sequence comparison. Theory Comput. Syst. 42(3), 411–429 (2008)
Article MathSciNet MATH Google Scholar
Metzker, M.L.: Sequencing technologies – the next generation. Nature Reviews Genetics 11(1), 31–46 (2009)
Article Google Scholar
Nong, G., Zhang, S., Chan, W.H.: Linear time suffix array construction using d-critical substrings. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009 Lille. LNCS, vol. 5577, pp. 54–67. Springer, Heidelberg (2009)
Chapter Google Scholar
Puglisi, S.J., Smyth, W.F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39 (July 2007)
Google Scholar
Walenz, B.P., Lippert, R.A., Mobarry, C.M.: A Space-Efficient Construction of the Burrows-Wheeler Transform for Genomic Data. Journal of Computational Biology 12(7), 943–951 (2005)
Article Google Scholar
Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)
Article Google Scholar
Sirén, J.: Compressed suffix arrays for massive data. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 63–74. Springer, Heidelberg (2009)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Illumina Cambridge Ltd., United Kingdom
Markus J. Bauer & Anthony J. Cox
Dipartimento di Matematica e Informatica, University of Palermo, Via Archirafi 34, 90123, Palermo, Italy
Giovanna Rosone

Authors

Markus J. Bauer
View author publications
You can also search for this author in PubMed Google Scholar
Anthony J. Cox
View author publications
You can also search for this author in PubMed Google Scholar
Giovanna Rosone
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Mathematics, Università degli Studi di Palermo, Via Archirafi 34, 90123, Palermo, Italy
Raffaele Giancarlo
Department of Computer Science, University of ’Piemonte Orientale’, Viale T. Michel 11, 15121, Alessandria, Italy
Giovanni Manzini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bauer, M.J., Cox, A.J., Rosone, G. (2011). Lightweight BWT Construction for Very Large String Collections. In: Giancarlo, R., Manzini, G. (eds) Combinatorial Pattern Matching. CPM 2011. Lecture Notes in Computer Science, vol 6661. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21458-5_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-21458-5_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21457-8
Online ISBN: 978-3-642-21458-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics