ABSTRACT
The throughput of biological sequencing technologies has led to the necessity for compressed and accessible sequencing formats. Recently, the Multi-String Burrows-Wheeler Transform (MSBWT) has risen in prevalence as a method for transforming sequence data to improve compression while providing access to the reads through an auxiliary FM-index. While there are many algorithms for building the MSBWT for a collection of strings, they do not scale well as the length of those strings increases.
We propose a new method for constructing the MSBWT for a collection of strings based on previous work for merging two or more MSBWTs. It requires O(N * LCPavg * log(m)) time and O(N) bits of memory for a collection of m strings composed of N symbols where the average longest common prefix of all suffixes is LCPavg. We evaluate the speed of the algorithm on multiple datasets that vary in both quantity of strings and string length.
Availability: https://code.google.com/p/msbwt/source/browse/MUSCython/MultimergeCython.pyx
- 1000 Genomes Project Consortium and others. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061--1073, 2010.Google ScholarCross Ref
- M. J. Bauer, A. J. Cox, and G. Rosone. Lightweight bwt construction for very large string collections. In Combinatorial Pattern Matching, pages 219--231. Springer, 2011. Google ScholarDigital Library
- M. J. Bauer, A. J. Cox, and G. Rosone. Lightweight algorithms for constructing and inverting the bwt of string collections. Theoretical Computer Science, 483:134--148, 2013. Google ScholarDigital Library
- M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. 1994.Google Scholar
- A. J. Cox, M. J. Bauer, T. Jakobi, and G. Rosone. Large-scale compression of genomic sequence databases with the burrows--wheeler transform. Bioinformatics, 28(11):1415--1419, 2012. Google ScholarDigital Library
- A. J. Cox, T. Jakobi, G. Rosone, and O. B. Schulz-Trieglaff. Comparing dna sequence collections by direct comparison of compressed text indexes. In Algorithms in Bioinformatics, pages 214--224. Springer, 2012. Google ScholarDigital Library
- J. Eid, A. Fehr, J. Gray, K. Luong, J. Lyle, G. Otto, P. Peluso, D. Rank, P. Baybayan, B. Bettman, et al. Real-time dna sequencing from single polymerase molecules. Science, 323(5910):133--138, 2009.Google ScholarCross Ref
- P. Ferragina and G. Manzini. An experimental study of an opportunistic index. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 269--278. Society for Industrial and Applied Mathematics, 2001. Google ScholarDigital Library
- J. Holt and L. McMillan. Merging of multi-string bwts with applications. HiTSeq 2014, Boston, MA.Google Scholar
- S. D. Kahn. On the future of genomic data. Science(Washington), 331(6018):728--729, 2011.Google Scholar
- B. Langmead, C. Trapnell, M. Pop, S. L. Salzberg, et al. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol, 10(3):R25, 2009.Google ScholarCross Ref
- H. Li and R. Durbin. Fast and accurate short read alignment with burrows--wheeler transform. Bioinformatics, 25(14):1754--1760, 2009. Google ScholarDigital Library
- S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the burrows wheeler transform and applications to sequence comparison and data compression. In Combinatorial Pattern Matching, pages 178--189. Springer, 2005. Google ScholarDigital Library
- J. T. Simpson and R. Durbin. Efficient construction of an assembly string graph using the fm-index. Bioinformatics, 26(12):i367--i373, 2010. Google ScholarDigital Library
- J. Sirén. Compressed suffix arrays for massive data. In String Processing and Information Retrieval, pages 63--74. Springer, 2009. Google ScholarDigital Library
Index Terms
- Constructing burrows-wheeler transforms of large string collections via merging
Recommendations
Burrows-Wheeler Transform of Words Defined by Morphisms
Combinatorial AlgorithmsAbstractThe Burrows-Wheeler transform (BWT) is a popular method used for text compression. It was proved that BWT has optimal performance on standard words, i.e. the building blocks of Sturmian words. In this paper, we study the application of BWT on more ...
Computation of the suffix array, Burrows-Wheeler transform and FM-index in V-order
AbstractV-order is a total order on strings that determines an instance of Unique Maximal Factorization Families (UMFFs), a generalization of Lyndon words. The fundamental V-comparison of strings can be done in linear time and ...
Lyndon fountains and the Burrows-Wheeler transform
CUBE '12: Proceedings of the CUBE International Information Technology ConferenceIn this paper we study Lyndon structures related to the Burrows-Wheeler Transform with potential application to bioinformatics. Next-Generation Sequencing techniques require the alignment of a large set of short reads (between dozens to hundreds of ...
Comments