Skip to main content

Lightweight BWT Construction for Very Large String Collections

  • Conference paper
Combinatorial Pattern Matching (CPM 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6661))

Included in the following conference series:

Abstract

A modern DNA sequencing machine can generate a billion or more sequence fragments in a matter of days. The many uses of the BWT in compression and indexing are well known, but the computational demands of creating the BWT of datasets this large have prevented its applications from being widely explored in this context.

We address this obstacle by presenting two algorithms capable of computing the BWT of very large string collections. The algorithms are lightweight in that the first needs O(m logm) bits of memory to process m strings and the memory requirements of the second are constant with respect to m.

We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. Although our tests were on collections of DNA sequences of uniform length, the algorithms themselves apply to any string collection over any alphabet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adjeroh, D., Bell, T., Mukherjee, A.: The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching, 1st edn. Springer, Heidelberg (2008)

    Book  Google Scholar 

  2. Bentley, D.R., et al.: Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456(7218), 53–59 (2008)

    Article  Google Scholar 

  3. Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 697–710. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  4. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, Washington, DC, USA, pages 390. IEEE Computer Society, Los Alamitos (2000)

    Google Scholar 

  5. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52, 552–581 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  6. National Center for Biotechnology Information. Sequence Read Archive, http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?

  7. Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: PAT Trees and PAT arrays, pp. 66–82. Prentice-Hall, Inc., Upper Saddle River (1992)

    Google Scholar 

  8. Hon, W.K., Lam, T.W., Sadakane, K., Sung, W.K., Yiu, S.M.: A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48, 23–36 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  9. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53, 918–936 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  10. Kim, D., Sim, J., Park, H., Park, K.: Linear-time construction of suffix arrays. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 186–199. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  11. Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms 3(2-4), 143–156 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  12. Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the burrows wheeler transform and applications to sequence comparison and data compression. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 178–189. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  13. Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the Burrows-Wheeler Transform. Theor. Comput. Sci. 387(3), 298–312 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  14. Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: A new combinatorial approach to sequence comparison. Theory Comput. Syst. 42(3), 411–429 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  15. Metzker, M.L.: Sequencing technologies – the next generation. Nature Reviews Genetics 11(1), 31–46 (2009)

    Article  Google Scholar 

  16. Nong, G., Zhang, S., Chan, W.H.: Linear time suffix array construction using d-critical substrings. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009 Lille. LNCS, vol. 5577, pp. 54–67. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  17. Puglisi, S.J., Smyth, W.F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39 (July 2007)

    Google Scholar 

  18. Walenz, B.P., Lippert, R.A., Mobarry, C.M.: A Space-Efficient Construction of the Burrows-Wheeler Transform for Genomic Data. Journal of Computational Biology 12(7), 943–951 (2005)

    Article  Google Scholar 

  19. Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)

    Article  Google Scholar 

  20. Sirén, J.: Compressed suffix arrays for massive data. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 63–74. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bauer, M.J., Cox, A.J., Rosone, G. (2011). Lightweight BWT Construction for Very Large String Collections. In: Giancarlo, R., Manzini, G. (eds) Combinatorial Pattern Matching. CPM 2011. Lecture Notes in Computer Science, vol 6661. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21458-5_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-21458-5_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-21457-8

  • Online ISBN: 978-3-642-21458-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics