Abstract
Detecting traces of positive selection in genomes carries theoretical significance and has practical applications from shedding light on the forces that drive adaptive evolution to the design of more effective drug treatments. The size of genomic datasets currently grows at an unprecedented pace, fueled by continuous advances in DNA sequencing technologies, leading to ever-increasing compute and memory requirements for meaningful genomic analyses. The majority of existing methods for positive selection detection either are not designed to handle whole genomes or scale poorly with the sample size; they inevitably resort to a runtime versus accuracy tradeoff, raising an alarming concern for the feasibility of future large-scale scans. To this end, we present RAiSD-X, a high-performance system that relies on a decoupled access-execute processing paradigm for efficient FPGA acceleration and couples a novel, to our knowledge, sliding-window algorithm for the recently introduced μ statistic with a mutation-driven hashing technique to rapidly detect patterns in the data. RAiSD-X achieves up to three orders of magnitude faster processing than widely used software implementations, and more importantly, it can exhaustively scan thousands of human chromosomes in minutes, yielding a scalable full-system solution for future studies of positive selection in species of flora and fauna.
- Nikolaos Alachiotis et al. 2012. OmegaPlus: A scalable tool for rapid detection of selective sweeps in whole-genome datasets. Bioinformatics 28, 17 (2012), 2274--2275.Google ScholarDigital Library
- Nikolaos Alachiotis and Pavlos Pavlidis. 2018. RAiSD detects positive selection based on multiple signatures of a selective sweep and SNP vectors. Commun. Biol. 1, 1 (2018), 79.Google ScholarCross Ref
- Nikolaos Alachiotis, Thom Popovici, and Tze Meng Low. 2016. Efficient computation of linkage disequilibria as dense linear algebra operations. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, 418--427.Google ScholarCross Ref
- Nikolaos Alachiotis, Charalampos Vatsolakis, Grigorios Chrysos, and Dionisios Pnevmatikatos. 2018. Accelerated inference of positive selection on whole genomes. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’18). IEEE, 1--4.Google ScholarCross Ref
- Nikolaos Alachiotis and Gabriel Weisz. 2016. High performance linkage disequilibrium: FPGAs hold the key. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’16). ACM, 118--127.Google ScholarDigital Library
- Md Tauqeer Alam et al. 2011. Selective sweeps and genetic lineages of Plasmodium falciparum drug-resistant alleles in Ghana. J. Infect. Dis. 203, 2 (2011), 220--227.Google ScholarCross Ref
- Dimitrios Bozikas et al. 2017. Deploying FPGAs to future-proof genome-wide analyses based on linkage disequilibrium. In Proceedings of the International Conference on Field-Programmable Logic and Applications (FPL’17). IEEE, 1--4.Google Scholar
- J. M. Braverman et al. 1995. The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics 140, 2 (Jun. 1995), 783--96.Google Scholar
- Christopher C. Chang, Carson C. Chow, Laurent C. A. M. Tellier, Shashaank Vattikuti, Shaun M. Purcell, and James J. Lee. 2015. Second-generation PLINK: Rising to the challenge of larger and richer datasets. Gigascience 4, 1 (2015).Google Scholar
- George Charitopoulos, Charalampos Vatsolakis, Grigorios Chrysos, and Dionisios Pnevmatikatos. 2018. A decoupled access-execute architecture for reconfigurable accelerators. In Proceedings of the Computing Frontiers Conference. ACM.Google ScholarDigital Library
- Tao Chen and G Edward Suh. 2016. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, 1--12.Google ScholarCross Ref
- Jessica L. Crisci, Yu-Ping Poh, Shivani Mahajan, and Jeffrey D. Jensen. 2013. The impact of equilibrium assumptions on tests of selection. Front. Genet. 4 (2013).Google Scholar
- Natasja G. De Groot and Ronald E. Bontrop. 2013. The HIV-1 pandemic: Does the selective sweep in chimpanzees mirror humankind’s future? Retrovirology 10, 1 (2013), 53.Google ScholarCross Ref
- Michael DeGiorgio et al. 2016. Sweepfinder2: Increased sensitivity, robustness and flexibility. Bioinformatics 32, 12 (2016), 1895--1897.Google ScholarCross Ref
- F. Depaulis and M. Veuille. 1998. Neutrality tests based on the distribution of haplotypes under an infinite-site model. Molec. Biol. Evol. 15, 12 (Dec. 1998), 1788--1790.Google ScholarCross Ref
- J. C. Fay and C. I. Wu. 2000. Hitchhiking under positive Darwinian selection. Genetics 155, 3 (Jul. 2000), 1405--13.Google Scholar
- Tom Feist. 2012. Vivado design suite. Xilinx, Inc., White Paper (2012), 30.Google Scholar
- P. Robbe, N. Popitsch, S. J. L. Knight, et al. 2018. Clinical whole-genome sequencing from routine formalin-fixed, paraffin-embedded specimens: pilot study for the 100,000 Genomes Project. Genet Med 20 (2018), 1196--1205. DOI:10.1038/gim.2017.241Google ScholarCross Ref
- Phillip B. Gibbons and Srikanta Tirthapura. 2002. Distributed streams algorithms for sliding windows. In Proceedings of the 14th Annual ACM Symposium on Parallel Algorithms and Architectures. ACM, 63--72.Google Scholar
- Sharon R. Grossman, Ilya Shylakhter, Elinor K. Karlsson, Elizabeth H. Byrne, Shannon Morales, Gabriel Frieden, Elizabeth Hostetter, Elaine Angelino, Manuel Garber, Or Zuk, et al. 2010. A composite of multiple signals distinguishes causal variants in regions of positive selection. Science 327, 5967 (2010), 883--886.Google Scholar
- W. G. Hill and Alan Robertson. 1968. Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38, 6 (1968), 226--231.Google ScholarCross Ref
- Richard R. Hudson. 2002. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18, 2 (2002), 337--8.Google ScholarCross Ref
- Hongshin Jun, Jinhee Cho, Kangseol Lee, Ho-Young Son, Kwiwook Kim, Hanho Jin, and Keith Kim. 2017. Hbm (high bandwidth memory) dram technology and architecture. In Proceedings of the 2017 IEEE International Memory Workshop (IMW’17). IEEE, 1--4.Google ScholarCross Ref
- Yuseob Kim and Rasmus Nielsen. 2004. Linkage disequilibrium as a signature of selective sweeps. Genetics 167, 3 (Jul. 2004), 1513--1524. DOI:https://doi.org/10.1534/genetics.103.025387Google ScholarCross Ref
- Motoo Kimura. 1969. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61, 4 (1969), 893.Google ScholarCross Ref
- R. C. Lewontin. 1964. The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49, 1 (1964), 49.Google Scholar
- Heng Li et al. 2009. The sequence alignment/map format and SAMtools. Bioinformatics 25, 16 (2009), 2078--2079.Google ScholarDigital Library
- Atabak Mahram and Martin C. Herbordt. 2015. NCBI BLASTP on high-performance reconfigurable computing systems. ACM Trans. Reconfig. Technol. Syst. 7, 4 (2015), 33.Google ScholarDigital Library
- Anna-Sapfo Malaspinas. 2016. Methods to characterize selective sweeps using time serial samples: An ancient DNA perspective. Molec. Ecol. 25, 1 (2016), 24--41.Google ScholarCross Ref
- J. Maynard Smith and J. Haigh. 1974. The hitch-hiking effect of a favourable gene. Gen. Res. 23, 1 (Feb. 1974), 23--35.Google Scholar
- Aaron McKenna et al. 2010. The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Gen. Res. 20, 9 (2010), 1297--1303.Google ScholarCross Ref
- Rasmus Nielsen et al. 2005. Genomic scans for selective sweeps using SNP data. Gen. Res. 15, 11 (Nov. 2005), 1566--1575. DOI:https://doi.org/10.1101/gr.4252305Google Scholar
- Rasmus Nielsen, Joshua S. Paul, Anders Albrechtsen, and Yun S. Song. 2011. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12, 6 (2011), 443.Google ScholarCross Ref
- Rasmus Nielsen, Scott Williamson, Yuseob Kim, Melissa J. Hubisz, Andrew G. Clark, and Carlos Bustamante. 2005. Genomic scans for selective sweeps using SNP data. Gen. Res. 15, 11 (Nov. 2005), 1566--75. DOI:https://doi.org/10.1101/gr.4252305Google ScholarCross Ref
- Tomoko Ohta. 1996. The neutral theory is dead. The current significance and standing of neutral and nearly neutral theories. BioEssays 18, 8 (1996), 673--677.Google ScholarCross Ref
- Pavlos Pavlidis et al. 2013. SweeD: Likelihood-based detection of selective sweeps in thousands of genomes. Molec. Biol. Evol. 30, 9 (2013), 2224--2234.Google ScholarCross Ref
- Pavlos Pavlidis and Nikolaos Alachiotis. 2017. A survey of methods and tools to detect recent and strong positive selection. J. Biol. Res. 24, 1 (2017), 7.Google Scholar
- Pavlos Pavlidis, Jeffrey D. Jensen, and Wolfgang Stephan. 2010. Searching for footprints of positive selection in whole-genome SNP data from nonequilibrium populations. Genetics 185, 3 (Jul. 2010), 907--22. DOI:https://doi.org/10.1534/genetics.110.116459Google ScholarCross Ref
- John E. Pool, Ines Hellmann, Jeffrey D. Jensen, and Rasmus Nielsen. 2010. Population genetic inference from genomic sequence variation. Gen. Res. 20, 3 (2010), 291--300.Google ScholarCross Ref
- Carlos Reaño, Javier Prades, and Federico Silla. 2018. Exploring the use of remote GPU virtualization in low-power systems for bioinformatics applications. In Proceedings of the 47th International Conference on Parallel Processing Companion. ACM, 8.Google ScholarDigital Library
- David Salomon. 2004. Data Compression: The Complete Reference. Springer Science 8 Business Media.Google ScholarDigital Library
- Daniel R. Schrider and Andrew D. Kern. 2016. S/HIC: Robust identification of soft and hard sweeps using machine learning. PLos Genet. 12, 3 (2016), e1005928.Google ScholarCross Ref
- Stephan C. Schuster. 2007. Next-generation sequencing transforms today’s biology. Nat. Methods 5, 1 (2007), 16.Google ScholarCross Ref
- James E. Smith. 1982. Decoupled access/execute computer architectures. In ACM SIGARCH Computer Architecture News, Vol. 10. IEEE Computer Society Press, 112--119.Google Scholar
- Peter H. Sudmant et al. 2015. An integrated map of structural variation in 2,504 human genomes. Nature 526, 7571 (2015), 75.Google Scholar
- A Surendar. 2017. FPGA based parallel computation techniques for bioinformatics applications. Int. J. Res. Pharm. Sci. 8, 2 (2017), 124--128.Google Scholar
- F. Tajima. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 3 (Nov. 1989), 585--595.Google Scholar
- Simon Tavaré. 1986. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17, 2 (1986), 57--86.Google Scholar
- B. Sharat Chandra Varma, Kolin Paul, and M Balakrishnan. 2016. Architecture Exploration of FPGA Based Accelerators for BioInformatics Applications. Springer.Google Scholar
- Anuradha Welivita, Indika Perera, and Dulani Meedeniya. 2017. An interactive workflow generator to support bioinformatics analysis through GPU acceleration. In Proceedings of the 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM’17). IEEE, 457--462.Google ScholarCross Ref
- Lars Wienbrandt, Jan Christian Kässens, Matthias Hübenthal, and David Ellinghaus. 2019. 1000× faster than PLINK: Combined FPGA and GPU accelerators for logistic regression-based detection of epistasis. J. Comput. Sci. 30 (2019), 183--193.Google ScholarCross Ref
- Xilinx. [n.d.]. Vivado Design Suite: User Guide. Retrieved from https://www.xilinx.com/support/documentation/sw_manuals/xilinx2016_3/ug910-vivado-getting-started.pdf.Google Scholar
- Xilinx. [n.d.]. Vivado Design Suite, High Level Synthesis: User Guide. Retrieved from https://www.xilinx.com/support/documentation/sw_manuals/xilinx2016_3/ug902-vivado-high-level-synthesis.pdf.Google Scholar
- Duo Xu et al. 2017. Archaic hominin introgression in Africa contributes to functional salivary MUC7 genetic variation. Molec. Biol. Evol. 34, 10 (2017), 2704--2715.Google ScholarCross Ref
Index Terms
- RAiSD-X: A Fast and Accurate FPGA System for the Detection of Positive Selection in Thousands of Genomes
Recommendations
Genes under positive selection in Mycobacterium tuberculosis
Highlights Genome-scale positive selection was detected for Mycobacterium tuberculosis. A total of 12 positively selected genes were found. Those genes might be involved in antigen variations and immune evasions. We employed an evolutionary genomics ...
Genome-wide evidence of positive selection in Bacteroides fragilis
Display Omitted We identified 1275 orthologous gene clusters present in all eight Bacteroides genomes.52 genes were identified as being under positive selection in the Bacteroides fragilis lineage.Many genes involved in the cell surface/membrane ...
SweepNet: A Lightweight CNN Architecture for the Classification of Adaptive Genomic Regions
PASC '23: Proceedings of the Platform for Advanced Scientific Computing ConferenceThe accurate identification of positive selection in genomes represents a challenge in the field of population genomics. Several recent approaches have cast this problem as an image classification task and employed Convolutional Neural Networks (CNNs)...
Comments