research-article

Efficient DNA sequence partitioning using probabilistic subsets and hypergraphs

Authors:
Mahdi Belcaid

University of Hawaii at Manoa

University of Hawaii at Manoa
View Profile

,
Cedric Arisdakessian

University of Hawaii at Manoa

University of Hawaii at Manoa
View Profile

,
Yuliia Kravchenko

University of Hawaii at Manoa

University of Hawaii at Manoa
View Profile

SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied ComputingMarch 2021Pages 4–9https://doi.org/10.1145/3412841.3441851

Published:22 April 2021Publication History

SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied Computing

Pages 4–9

ABSTRACT

Sequence clustering is an important computational step in numerous bioinformatics applications such as high-throughput immune system characterization, marker-based diversity profiling, and reduced-complexity population genetic experiments. However, clustering very large datasets produced by high-throughput sequencers is CPU- and memory-intensive. As such, the analysis of large datasets is challenging and often requires compromising clustering sensitivity for speed. To address this limitation, we propose a new probabilistic data partitioning technique that divides large datasets of sequences of equal length into smaller, non-overlapping subsets that can subsequently be clustered using higher sensitivity algorithms at lower computational costs. In addition to substantially lowering resources requirements, this partitioning step provides an opportunity for effortless parallelization of DNA sequence clustering, and for the independent processing of each subset using more accurate algorithms. Our results show that our algorithm, which we implemented in a program called SLYMFAST can be used to cluster in a few hours datasets that would otherwise take weeks to cluster without partitioning first.

References

Mark Blaxter, Jenna Mann, Tom Chapman, Fran Thomas, Claire Whitton, Robin Floyd, and Eyualem Abebe. 2005. Defining operational taxonomic units using DNA barcode data. Philosophical Transactions of the Royal Society B: Biological Sciences 360, 1462 (2005), 1935--1943.Google ScholarCross Ref
Michael Brudno, Michael Chapman, Berthold Göttgens, Serafim Batzoglou, and Burkhard Morgenstern. 2003. Fast and sensitive multiple alignment of large genomic sequences. BMC bioinformatics 4, 1 (2003), 66.Google Scholar
John W Davey, Paul A Hohenlohe, Paul D Etter, Jason Q Boone, Julian M Catchen, and Mark L Blaxter. 2011. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nature Reviews Genetics 12, 7 (2011), 499--510.Google ScholarCross Ref
Robert C Edgar. 2004. Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research 32, 1 (2004), 380--385.Google ScholarCross Ref
Robert C Edgar. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 19 (2010), 2460--2461.Google ScholarDigital Library
Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 23 (2012), 3150--3152.Google ScholarDigital Library
Mohammadreza Ghodsi, Bo Liu, and Mihai Pop. 2011. DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC bioinformatics 12, 1 (2011), 271.Google Scholar
Anantharaman Kalyanaraman, Srinivas Aluru, Suresh Kothari, and Volker Brendel. 2003. Efficient clustering of large EST data sets on parallel computers. Nucleic Acids Research 31, 11 (2003), 2963--2974.Google ScholarCross Ref
Tarik A Khan, Simon Friedensohn, Arthur R Gorter de Vries, Jakub Straszewski, Hans-Joachim Ruscheweyh, and Sai T Reddy. 2016. Accurate and predictive antibody repertoire profiling by molecular amplification fingerprinting. Science advances 2, 3 (2016), e1501371.Google Scholar
Weizhong Li and Adam Godzik. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 13 (2006), 1658--1659.Google ScholarDigital Library
Weizhong Li, Lukasz Jaroszewski, and Adam Godzik. 2001. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17, 3 (2001), 282--283.Google ScholarCross Ref
Frédéric Mahé, Torbjørn Rognes, Christopher Quince, Colomban de Vargas, and Micah Dunthorn. 2014. Swarm: robust and fast clustering method for amplicon-based studies. PeerJ 2 (2014), e593.Google ScholarCross Ref
Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4 (2016), e2584.Google ScholarCross Ref
Heather C Rowe, Sébastien Renaut, and Alessia Guggisberg. 2011. RAD in the realm of next-generation sequencing technologies. Molecular ecology 20, 17 (2011), 3499--3502.Google Scholar
Abhinav Sarje, Jaroslaw Zola, and Srinivas Aluru. 2011. Accelerating pairwise computations on cell processors. Parallel and Distributed Systems, IEEE Transactions on 22, 1 (2011), 69--77.Google ScholarDigital Library
Robert Endre Tarjan. 1975. Efficiency of a good but not linear set union algorithm. Journal of the ACM (JACM) 22, 2 (1975), 215--225.Google ScholarDigital Library
Robert J Toonen, Jonathan B Puritz, Zac H Forsman, Jonathan L Whitney, Iria Fernandez-Silva, Kimberly R Andrews, and Christopher E Bird. 2013. ezRAD: a simplified method for genomic genotyping in non-model organisms. PeerJ 1 (2013), e203.Google ScholarCross Ref
Eduard Zorita, Pol Cusco, and Guillaume J Filion. 2015. Starcode: sequence clustering based on all-pairs search. Bioinformatics 31, 12 (2015), 1913--1919.Google ScholarCross Ref

Index Terms

Efficient DNA sequence partitioning using probabilistic subsets and hypergraphs
1. Applied computing
  1. Life and medical sciences
    1. Computational biology
      1. Computational genomics
      2. Molecular sequence analysis
2. Computing methodologies
  1. Modeling and simulation
    1. Simulation types and techniques
      1. Data assimilation

Recommendations

Dna sequence analysis: new applications with high throughput sequencing and new methods in studying gene families and human haplogroups
Read More
Genome-wide DNA sequence polymorphisms facilitate nucleosome positioning in yeast

Motivation: The intrinsic DNA sequence is an important determinant of nucleosome positioning. Some DNA sequence patterns can facilitate nucleosome formation, while others can inhibit nucleosome formation. Nucleosome positioning influences the overall ...
Read More
Efficient identification of DNA hybridization partners in a sequence database

Motivation: The specific hybridization of complementary DNA molecules underlies many widely used molecular biology assays, including the polymerase chain reaction and various types of microarray analysis. In order for such an assay to work well, the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied Computing
March 2021
2075 pages
ISBN:9781450381048
DOI:10.1145/3412841
Conference Chairs:
Chih-Cheng Hung
Kennesaw State University
,
Jiman Hong
Soongsil University, South Korea
,
Program Chairs:
Alessio Bechini
University of Pisa, Italy
,
Eunjee Song
Baylor University
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 April 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ACM proceedings
LATEX
text tagging
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 85
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient DNA sequence partitioning using probabilistic subsets and hypergraphs

SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Dna sequence analysis: new applications with high throughput sequencing and new methods in studying gene families and human haplogroups

Genome-wide DNA sequence polymorphisms facilitate nucleosome positioning in yeast

Efficient identification of DNA hybridization partners in a sequence database

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient DNA sequence partitioning using probabilistic subsets and hypergraphs

SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Dna sequence analysis: new applications with high throughput sequencing and new methods in studying gene families and human haplogroups

Genome-wide DNA sequence polymorphisms facilitate nucleosome positioning in yeast

Efficient identification of DNA hybridization partners in a sequence database

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media