ABSTRACT
Sequence clustering is an important computational step in numerous bioinformatics applications such as high-throughput immune system characterization, marker-based diversity profiling, and reduced-complexity population genetic experiments. However, clustering very large datasets produced by high-throughput sequencers is CPU- and memory-intensive. As such, the analysis of large datasets is challenging and often requires compromising clustering sensitivity for speed. To address this limitation, we propose a new probabilistic data partitioning technique that divides large datasets of sequences of equal length into smaller, non-overlapping subsets that can subsequently be clustered using higher sensitivity algorithms at lower computational costs. In addition to substantially lowering resources requirements, this partitioning step provides an opportunity for effortless parallelization of DNA sequence clustering, and for the independent processing of each subset using more accurate algorithms. Our results show that our algorithm, which we implemented in a program called SLYMFAST can be used to cluster in a few hours datasets that would otherwise take weeks to cluster without partitioning first.
- Mark Blaxter, Jenna Mann, Tom Chapman, Fran Thomas, Claire Whitton, Robin Floyd, and Eyualem Abebe. 2005. Defining operational taxonomic units using DNA barcode data. Philosophical Transactions of the Royal Society B: Biological Sciences 360, 1462 (2005), 1935--1943.Google ScholarCross Ref
- Michael Brudno, Michael Chapman, Berthold Göttgens, Serafim Batzoglou, and Burkhard Morgenstern. 2003. Fast and sensitive multiple alignment of large genomic sequences. BMC bioinformatics 4, 1 (2003), 66.Google Scholar
- John W Davey, Paul A Hohenlohe, Paul D Etter, Jason Q Boone, Julian M Catchen, and Mark L Blaxter. 2011. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nature Reviews Genetics 12, 7 (2011), 499--510.Google ScholarCross Ref
- Robert C Edgar. 2004. Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research 32, 1 (2004), 380--385.Google ScholarCross Ref
- Robert C Edgar. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 19 (2010), 2460--2461.Google ScholarDigital Library
- Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 23 (2012), 3150--3152.Google ScholarDigital Library
- Mohammadreza Ghodsi, Bo Liu, and Mihai Pop. 2011. DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC bioinformatics 12, 1 (2011), 271.Google Scholar
- Anantharaman Kalyanaraman, Srinivas Aluru, Suresh Kothari, and Volker Brendel. 2003. Efficient clustering of large EST data sets on parallel computers. Nucleic Acids Research 31, 11 (2003), 2963--2974.Google ScholarCross Ref
- Tarik A Khan, Simon Friedensohn, Arthur R Gorter de Vries, Jakub Straszewski, Hans-Joachim Ruscheweyh, and Sai T Reddy. 2016. Accurate and predictive antibody repertoire profiling by molecular amplification fingerprinting. Science advances 2, 3 (2016), e1501371.Google Scholar
- Weizhong Li and Adam Godzik. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 13 (2006), 1658--1659.Google ScholarDigital Library
- Weizhong Li, Lukasz Jaroszewski, and Adam Godzik. 2001. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17, 3 (2001), 282--283.Google ScholarCross Ref
- Frédéric Mahé, Torbjørn Rognes, Christopher Quince, Colomban de Vargas, and Micah Dunthorn. 2014. Swarm: robust and fast clustering method for amplicon-based studies. PeerJ 2 (2014), e593.Google ScholarCross Ref
- Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4 (2016), e2584.Google ScholarCross Ref
- Heather C Rowe, Sébastien Renaut, and Alessia Guggisberg. 2011. RAD in the realm of next-generation sequencing technologies. Molecular ecology 20, 17 (2011), 3499--3502.Google Scholar
- Abhinav Sarje, Jaroslaw Zola, and Srinivas Aluru. 2011. Accelerating pairwise computations on cell processors. Parallel and Distributed Systems, IEEE Transactions on 22, 1 (2011), 69--77.Google ScholarDigital Library
- Robert Endre Tarjan. 1975. Efficiency of a good but not linear set union algorithm. Journal of the ACM (JACM) 22, 2 (1975), 215--225.Google ScholarDigital Library
- Robert J Toonen, Jonathan B Puritz, Zac H Forsman, Jonathan L Whitney, Iria Fernandez-Silva, Kimberly R Andrews, and Christopher E Bird. 2013. ezRAD: a simplified method for genomic genotyping in non-model organisms. PeerJ 1 (2013), e203.Google ScholarCross Ref
- Eduard Zorita, Pol Cusco, and Guillaume J Filion. 2015. Starcode: sequence clustering based on all-pairs search. Bioinformatics 31, 12 (2015), 1913--1919.Google ScholarCross Ref
Index Terms
- Efficient DNA sequence partitioning using probabilistic subsets and hypergraphs
Comments