skip to main content
10.1145/3412841.3441851acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Efficient DNA sequence partitioning using probabilistic subsets and hypergraphs

Published:22 April 2021Publication History

ABSTRACT

Sequence clustering is an important computational step in numerous bioinformatics applications such as high-throughput immune system characterization, marker-based diversity profiling, and reduced-complexity population genetic experiments. However, clustering very large datasets produced by high-throughput sequencers is CPU- and memory-intensive. As such, the analysis of large datasets is challenging and often requires compromising clustering sensitivity for speed. To address this limitation, we propose a new probabilistic data partitioning technique that divides large datasets of sequences of equal length into smaller, non-overlapping subsets that can subsequently be clustered using higher sensitivity algorithms at lower computational costs. In addition to substantially lowering resources requirements, this partitioning step provides an opportunity for effortless parallelization of DNA sequence clustering, and for the independent processing of each subset using more accurate algorithms. Our results show that our algorithm, which we implemented in a program called SLYMFAST can be used to cluster in a few hours datasets that would otherwise take weeks to cluster without partitioning first.

References

  1. Mark Blaxter, Jenna Mann, Tom Chapman, Fran Thomas, Claire Whitton, Robin Floyd, and Eyualem Abebe. 2005. Defining operational taxonomic units using DNA barcode data. Philosophical Transactions of the Royal Society B: Biological Sciences 360, 1462 (2005), 1935--1943.Google ScholarGoogle ScholarCross RefCross Ref
  2. Michael Brudno, Michael Chapman, Berthold Göttgens, Serafim Batzoglou, and Burkhard Morgenstern. 2003. Fast and sensitive multiple alignment of large genomic sequences. BMC bioinformatics 4, 1 (2003), 66.Google ScholarGoogle Scholar
  3. John W Davey, Paul A Hohenlohe, Paul D Etter, Jason Q Boone, Julian M Catchen, and Mark L Blaxter. 2011. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nature Reviews Genetics 12, 7 (2011), 499--510.Google ScholarGoogle ScholarCross RefCross Ref
  4. Robert C Edgar. 2004. Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research 32, 1 (2004), 380--385.Google ScholarGoogle ScholarCross RefCross Ref
  5. Robert C Edgar. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 19 (2010), 2460--2461.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 23 (2012), 3150--3152.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mohammadreza Ghodsi, Bo Liu, and Mihai Pop. 2011. DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC bioinformatics 12, 1 (2011), 271.Google ScholarGoogle Scholar
  8. Anantharaman Kalyanaraman, Srinivas Aluru, Suresh Kothari, and Volker Brendel. 2003. Efficient clustering of large EST data sets on parallel computers. Nucleic Acids Research 31, 11 (2003), 2963--2974.Google ScholarGoogle ScholarCross RefCross Ref
  9. Tarik A Khan, Simon Friedensohn, Arthur R Gorter de Vries, Jakub Straszewski, Hans-Joachim Ruscheweyh, and Sai T Reddy. 2016. Accurate and predictive antibody repertoire profiling by molecular amplification fingerprinting. Science advances 2, 3 (2016), e1501371.Google ScholarGoogle Scholar
  10. Weizhong Li and Adam Godzik. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 13 (2006), 1658--1659.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Weizhong Li, Lukasz Jaroszewski, and Adam Godzik. 2001. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17, 3 (2001), 282--283.Google ScholarGoogle ScholarCross RefCross Ref
  12. Frédéric Mahé, Torbjørn Rognes, Christopher Quince, Colomban de Vargas, and Micah Dunthorn. 2014. Swarm: robust and fast clustering method for amplicon-based studies. PeerJ 2 (2014), e593.Google ScholarGoogle ScholarCross RefCross Ref
  13. Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4 (2016), e2584.Google ScholarGoogle ScholarCross RefCross Ref
  14. Heather C Rowe, Sébastien Renaut, and Alessia Guggisberg. 2011. RAD in the realm of next-generation sequencing technologies. Molecular ecology 20, 17 (2011), 3499--3502.Google ScholarGoogle Scholar
  15. Abhinav Sarje, Jaroslaw Zola, and Srinivas Aluru. 2011. Accelerating pairwise computations on cell processors. Parallel and Distributed Systems, IEEE Transactions on 22, 1 (2011), 69--77.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Robert Endre Tarjan. 1975. Efficiency of a good but not linear set union algorithm. Journal of the ACM (JACM) 22, 2 (1975), 215--225.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Robert J Toonen, Jonathan B Puritz, Zac H Forsman, Jonathan L Whitney, Iria Fernandez-Silva, Kimberly R Andrews, and Christopher E Bird. 2013. ezRAD: a simplified method for genomic genotyping in non-model organisms. PeerJ 1 (2013), e203.Google ScholarGoogle ScholarCross RefCross Ref
  18. Eduard Zorita, Pol Cusco, and Guillaume J Filion. 2015. Starcode: sequence clustering based on all-pairs search. Bioinformatics 31, 12 (2015), 1913--1919.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Efficient DNA sequence partitioning using probabilistic subsets and hypergraphs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied Computing
          March 2021
          2075 pages
          ISBN:9781450381048
          DOI:10.1145/3412841

          Copyright © 2021 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 April 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,650of6,669submissions,25%
        • Article Metrics

          • Downloads (Last 12 months)6
          • Downloads (Last 6 weeks)0

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader