skip to main content
10.1145/2649387.2649431acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
short-paper

Constructing burrows-wheeler transforms of large string collections via merging

Published: 20 September 2014 Publication History

Abstract

The throughput of biological sequencing technologies has led to the necessity for compressed and accessible sequencing formats. Recently, the Multi-String Burrows-Wheeler Transform (MSBWT) has risen in prevalence as a method for transforming sequence data to improve compression while providing access to the reads through an auxiliary FM-index. While there are many algorithms for building the MSBWT for a collection of strings, they do not scale well as the length of those strings increases.
We propose a new method for constructing the MSBWT for a collection of strings based on previous work for merging two or more MSBWTs. It requires O(N * LCPavg * log(m)) time and O(N) bits of memory for a collection of m strings composed of N symbols where the average longest common prefix of all suffixes is LCPavg. We evaluate the speed of the algorithm on multiple datasets that vary in both quantity of strings and string length.
Availability: https://code.google.com/p/msbwt/source/browse/MUSCython/MultimergeCython.pyx

References

[1]
1000 Genomes Project Consortium and others. A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061--1073, 2010.
[2]
M. J. Bauer, A. J. Cox, and G. Rosone. Lightweight bwt construction for very large string collections. In Combinatorial Pattern Matching, pages 219--231. Springer, 2011.
[3]
M. J. Bauer, A. J. Cox, and G. Rosone. Lightweight algorithms for constructing and inverting the bwt of string collections. Theoretical Computer Science, 483:134--148, 2013.
[4]
M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. 1994.
[5]
A. J. Cox, M. J. Bauer, T. Jakobi, and G. Rosone. Large-scale compression of genomic sequence databases with the burrows--wheeler transform. Bioinformatics, 28(11):1415--1419, 2012.
[6]
A. J. Cox, T. Jakobi, G. Rosone, and O. B. Schulz-Trieglaff. Comparing dna sequence collections by direct comparison of compressed text indexes. In Algorithms in Bioinformatics, pages 214--224. Springer, 2012.
[7]
J. Eid, A. Fehr, J. Gray, K. Luong, J. Lyle, G. Otto, P. Peluso, D. Rank, P. Baybayan, B. Bettman, et al. Real-time dna sequencing from single polymerase molecules. Science, 323(5910):133--138, 2009.
[8]
P. Ferragina and G. Manzini. An experimental study of an opportunistic index. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 269--278. Society for Industrial and Applied Mathematics, 2001.
[9]
J. Holt and L. McMillan. Merging of multi-string bwts with applications. HiTSeq 2014, Boston, MA.
[10]
S. D. Kahn. On the future of genomic data. Science(Washington), 331(6018):728--729, 2011.
[11]
B. Langmead, C. Trapnell, M. Pop, S. L. Salzberg, et al. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol, 10(3):R25, 2009.
[12]
H. Li and R. Durbin. Fast and accurate short read alignment with burrows--wheeler transform. Bioinformatics, 25(14):1754--1760, 2009.
[13]
S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the burrows wheeler transform and applications to sequence comparison and data compression. In Combinatorial Pattern Matching, pages 178--189. Springer, 2005.
[14]
J. T. Simpson and R. Durbin. Efficient construction of an assembly string graph using the fm-index. Bioinformatics, 26(12):i367--i373, 2010.
[15]
J. Sirén. Compressed suffix arrays for massive data. In String Processing and Information Retrieval, pages 63--74. Springer, 2009.

Cited By

View all
  • (2021)FrontierProceedings of the 12th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/3459930.3469545(1-10)Online publication date: 1-Aug-2021
  • (2021)A k-mer query tool for assessing population diversity in pangenomesProceedings of the 12th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/3459930.3469537(1-9)Online publication date: 1-Aug-2021
  • (2021)Space Efficient Merging of de Bruijn Graphs and Wheeler GraphsAlgorithmica10.1007/s00453-021-00855-284:3(639-669)Online publication date: 16-Jul-2021
  • Show More Cited By

Index Terms

  1. Constructing burrows-wheeler transforms of large string collections via merging

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        BCB '14: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
        September 2014
        851 pages
        ISBN:9781450328944
        DOI:10.1145/2649387
        • General Chairs:
        • Pierre Baldi,
        • Wei Wang
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 20 September 2014

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Burrows-Wheeler transform
        2. multi-string burrows-wheeler transform
        3. next generation sequencing
        4. text indices

        Qualifiers

        • Short-paper

        Funding Sources

        Conference

        BCB '14
        Sponsor:
        BCB '14: ACM-BCB '14
        September 20 - 23, 2014
        California, Newport Beach

        Acceptance Rates

        Overall Acceptance Rate 254 of 885 submissions, 29%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)10
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 17 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2021)FrontierProceedings of the 12th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/3459930.3469545(1-10)Online publication date: 1-Aug-2021
        • (2021)A k-mer query tool for assessing population diversity in pangenomesProceedings of the 12th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/3459930.3469537(1-9)Online publication date: 1-Aug-2021
        • (2021)Space Efficient Merging of de Bruijn Graphs and Wheeler GraphsAlgorithmica10.1007/s00453-021-00855-284:3(639-669)Online publication date: 16-Jul-2021
        • (2020)Space-efficient construction of compressed suffix treesTheoretical Computer Science10.1016/j.tcs.2020.11.024Online publication date: Nov-2020
        • (2019)External memory BWT and LCP computation for sequence collections with applicationsAlgorithms for Molecular Biology10.1186/s13015-019-0140-014:1Online publication date: 8-Mar-2019
        • (2019)ELITEProceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3307339.3342182(183-189)Online publication date: 4-Sep-2019
        • (2019)Lightweight merging of compressed indices based on BWT variantsTheoretical Computer Science10.1016/j.tcs.2019.11.001Online publication date: Nov-2019
        • (2019)Space-Efficient Merging of Succinct de Bruijn GraphsString Processing and Information Retrieval10.1007/978-3-030-32686-9_24(337-351)Online publication date: 3-Oct-2019
        • (2017)Lightweight BWT and LCP Merging via the Gap AlgorithmString Processing and Information Retrieval10.1007/978-3-319-67428-5_15(176-190)Online publication date: 6-Sep-2017
        • (2016)XBWT TricksString Processing and Information Retrieval10.1007/978-3-319-46049-9_8(80-92)Online publication date: 21-Sep-2016

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media