Journals & Magazines >IEEE/ACM Transactions on Comp... >Volume: 18 Issue: 6

Efficient Compression and Indexing for Highly Repetitive DNA Sequence Collections

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In this paper, we focus upon the important problem of indexing and searching highly repetitive DNA sequence collections. Given a collection

$\mathcal {G}$ of

$t$ sequence...Show More

Metadata

Abstract:

In this paper, we focus upon the important problem of indexing and searching highly repetitive DNA sequence collections. Given a collection

$\mathcal {G}$ of

$t$ sequences

$\mathcal {S}_{i}$ of length

$n$ each, we can represent

$\mathcal {G}$ succinctly in

$2n\mathcal {H}_{k}(\mathcal {T}) + \mathcal {O}(n^{\prime }\ {\log \log n}) + o(q n^{\prime }) + o(tn)$ bits using

$\mathcal {O}(t n^{2} + q n^{\prime })$ time, where

$\mathcal {H}_{k}(\mathcal {T})$ is the

$k$ th-order empirical entropy of the sequence

$\mathcal {T} \in \mathcal {G}$ that is used as the reference sequence,

$n^{\prime }$ is the total number of variations between

$\mathcal {T}$ and the sequences in

$\mathcal {G}$ , and

$q$ is a small fixed constant. We can restore any length

${len}$ substring

$\mathcal {S}[ {sp}, \dots, {sp} + {len}-1]$ of

$\mathcal {S} \in \mathcal {G}$ in

$\mathcal {O}\bigl (n_{s}^{\prime } + {len}(\log n)^{2} / {\log \log n}\bigr)$ time and report all positions where

$P$ occurs in

$\mathcal {G}$ in

$\mathcal {O}\bigl (m \cdot t + {occ} \cdot t \cdot (\log n)^{2}/\log \log n \bigr)$ time. In addition, we propose a dynamic programming method to find the variations between

$\mathcal {T}$ and the sequences in

$\mathcal {G}$ in a space-efficient way, with which we can build succinct structures to enable efficient search. For highly repetitive sequences, experimental results on the tested data demonstrate that the proposed method has significant advantages in space usage and retrieval time over the current state-of-the-art methods. The source code is available online.

Published in: IEEE/ACM Transactions on Computational Biology and Bioinformatics ( Volume: 18, Issue: 6, 01 Nov.-Dec. 2021)

Page(s): 2394 - 2408

Date of Publication: 22 January 2020

ISSN Information:

PubMed ID: 31985436

DOI: 10.1109/TCBB.2020.2968323

Funding Agency:

Contents

References is not available for this document.

Efficient Compression and Indexing for Highly Repetitive DNA Sequence Collections

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Efficient Compression and Indexing for Highly Repetitive DNA Sequence Collections

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?