Loading [a11y]/accessibility-menu.js
Probabilistic estimation of overlap graphs for large sequence datasets | IEEE Conference Publication | IEEE Xplore

Probabilistic estimation of overlap graphs for large sequence datasets


Abstract:

Sequence overlap graphs, constructed based on suffix-prefix relationships between pairs of sequences, are an important data structure in computational biology. High throu...Show More

Abstract:

Sequence overlap graphs, constructed based on suffix-prefix relationships between pairs of sequences, are an important data structure in computational biology. High throughput sequencers can read several million to a few billion DNA fragments in a single experiment, making the construction of overlap graphs for such datasets compute-intensive. In this paper, we present a Locality-Sensitive Hashing based parallel heuristic algorithm to construct overlap graphs for large genomic datasets. With reasonable assumptions on the characteristics of input sequences, we establish probabilistic bounds on the quality of the overlap graphs so produced. We demonstrate the validity and efficiency of our approach by comparing against true overlap graphs using datasets derived from small (E. coli) and large (H. sapiens) genomes.
Date of Conference: 13-16 November 2017
Date Added to IEEE Xplore: 18 December 2017
ISBN Information:
Conference Location: Kansas City, MO, USA

References

References is not available for this document.