ABSTRACT
Genome analysis is becoming more important in the fields of forensic science, medicine, and history. Sequencing technologies such as High Throughput Sequencing (HTS) and Third Generation Sequencing (TGS) have greatly accelerated genome sequencing. However, genome read mapping remains significantly slower than sequencing. Because of the enormous amount of data needed, the speed of the data transfer between the memory and the processing unit limits the execution speed. In-memory computing can help address the memory-bandwidth bottleneck by minimizing data transfers. Ternary Content Addressable Memories (TCAMs) have been used in accelerators because of their fast searching capability for seed-and-extend, a popular read mapping approach. Seed-and-vote, another read mapping approach, is faster than the seed-and-extend approach but has lower accuracies when used with very short reads. Since sequencing technology is moving to longer reads, the seed-and-vote approach is becoming more viable. We propose a genome read mapping accelerator that uses approximate TCAM to execute the Fast Seed and Vote algorithm (FSVA) that can map both short and long reads. We achieved 400X acceleration compared to the seed-and-extend approach BWA-MEM on a CPU and 115X acceleration at 30X energy improvement compared to state-of-the-art in-memory accelerator using the seed-and-extend approach at 98.75% accuracy for 100bp reads.
- [n. d.]. An Introduction to Next-Generation Sequencing Technology, howpublished = https://www.illumina.com/Documents/products/Illumina_Sequencing_Introduction.pdf, note = Accessed: 2020-03-30.Google Scholar
- [n. d.]. Human Genome ERR168836. ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/.. Accessed: 2020-03-30.Google Scholar
- Donald Adjeroh, Timothy Bell, and Amar Mukherjee. 2008. The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching (1 ed.). Springer Publishing Company, Incorporated.Google Scholar
- Mohammed Alser, Hasan Hassan, Hongyi Xin, Oguz Ergin, Onur Mutlu, and Can Alkan. 2017. GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping. Bioinformatics (Oxford, England) 33, 21 (01 Nov 2017), 3355--3363. ]. Google ScholarCross Ref
- Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology 215, 3 (1990), 403 -- 410. Google ScholarCross Ref
- Shanika L. Amarasinghe, Shian Su, Xueyi Dong, Luke Zappia, Matthew E. Ritchie, and Quentin Gouil. 2020. Opportunities and challenges in long-read sequencing data analysis. Genome Biology 21, 1 (2020), 30. Google ScholarCross Ref
- Raja Appuswamy, Jacques Fellay, and Nimisha Chaturvedi. 2018. Sequence Alignment Through the Looking Glass. bioRxiv (2018). arXiv:https://www.biorxiv.org/content/early/2018/04/11/256859.full.pdf Google ScholarCross Ref
- Sam Behjati and Patrick S. Tarpey. 2013. What is next generation sequencing? Archives of disease in childhood. Education and practice edition 98, 6 (Dec 2013), 236--238. 23986538[pmid]. Google ScholarCross Ref
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation. San Francisco, CA, 137--150.Google ScholarDigital Library
- S. Gupta, M. Imani, B. Khaleghi, V. Kumar, and T. Rosing. 2019. RAPID: A ReRAM Processing in-Memory Architecture for DNA Sequence Alignment. In 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). 1--6.Google Scholar
- James Gurtowski, Michael C. Schatz, and Ben Langmead. 2012. Genotyping in the cloud with Crossbow. Current protocols in bioinformatics Chapter 15 (Sep 2012), Unit15.3--Unit15.3. ]. Google ScholarCross Ref
- J. Healy and D. Chambers. 2014. Approximate k-Mer Matching Using Fuzzy Hash Maps. IEEE/ACM Transactions on Computational Biology and Bioinformatics 11, 1 (2014), 258--264.Google ScholarDigital Library
- W. Huangfu, S. Li, X. Hu, and Y. Xie. 2018. RADAR: A 3D-ReRAM based DNA Alignment Accelerator Architecture. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). 1--6.Google Scholar
- Mohsen Imani, Shruti Patil, and Tajana S Rosing. 2016. MASC: Ultra-low energy multiple-access single-charge TCAM for approximate computing. In DATE. IEEE, 373--378.Google Scholar
- Roman Kaplan, Leonid Yavits, and Ran Ginosar. 2018. RASSA: Resistive Pre-Alignment Accelerator for Approximate DNA Long Read Mapping. arXiv:qbio.GN/1809.01127Google Scholar
- Roman Kaplan, Leonid Yavits, and Ran Ginosar. 2019. BioSEAL: In-Memory Biological Sequence Alignment Accelerator for Large-Scale Genomic Data. CoRR abs/1901.05959 (2019). arXiv:1901.05959 http://arxiv.org/abs/1901.05959Google Scholar
- S. Karen Khatamifard, Zamshed Chowdhury, Nakul Pande, Meisam Razaviyayn, Chris Kim, and Ulya R. Karpuzcu. 2017. A Non-volatile Near-Memory Read Mapping Accelerator. arXiv e-prints, Article arXiv:1709.02381 (Sep 2017), arXiv:1709.02381 pages. arXiv:cs.DC/1709.02381Google Scholar
- Jeremie S. Kim, Damla Senol Cali, Hongyi Xin, Donghyuk Lee, Saugata Ghose, Mohammed Alser, Hasan Hassan, Oguz Ergin, Can Alkan, and Onur Mutlu. 2018. GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies. BMC Genomics 19, S2 (May 2018). Google ScholarCross Ref
- Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10, 3 (2009), R25. Google ScholarCross Ref
- Heng Li and Nils Homer. 2010. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics 11, 5 (Sep 2010), 473--483. 20460430[pmid]. Google ScholarCross Ref
- Heng Li, Jue Ruan, and Richard Durbin. 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research 18, 11 (Nov 2008), 1851--1858. 18714091[pmid]. Google ScholarCross Ref
- J. Li, R. K. Montoye, M. Ishii, and L. Chang. 2014. 1 Mb 0.41 μm2 2T-2R Cell Nonvolatile TCAM With Two-Bit Encoding and Clocked Self-Referenced Sensing. IEEE Journal of Solid-State Circuits 49, 4 (2014), 896--907.Google ScholarCross Ref
- S. Li, L. Liu, Peng Gu, C. Xu, and Yuan Xie. 2016. NVSim-CAM: A circuit-level simulator for emerging nonvolatile memory based Content-Addressable Memory. In 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1--7.Google ScholarDigital Library
- Yang Liao, Gordon K. Smyth, and Wei Shi. 2013. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic acids research 41, 10 (01 May 2013), e108--e108. 23558742[pmid]. Google ScholarCross Ref
- C. Lin, J. Hung, W. Lin, C. Lo, Y. Chiang, H. Tsai, G. Yang, Y. King, C. J. Lin, T. Chen, and M. Chang. 2016. 7.4 A 256b-wordlength ReRAM-based TCAM with 1ns search-time and 14× improvement in wordlength-energyefficiency-density product using 2.5T1R cell. In 2016 IEEE International Solid-State Circuits Conference (ISSCC). 136--137.Google Scholar
- Song Liu, Yi Wang, and Fei Wang. 2016. A fast read alignment method based on seed-and-vote for next generation sequencing. BMC Bioinformatics 17, 17 (2016), 466. Google ScholarCross Ref
- Y. Liu and B. Schmidt. 2014. CUSHAW2-GPU: Empowering Faster Gapped Short-Read Alignment Using GPU Computing. IEEE Design Test 31, 1 (2014), 31--39.Google ScholarCross Ref
- Dianne I. Lou, Jeffrey A. Hussmann, Ross M. McBee, Ashley Acevedo, Raul Andino, William H. Press, and Sara L. Sawyer. 2013. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proceedings of the National Academy of Sciences 110, 49 (2013), 19872--19877. arXiv:https://www.pnas.org/content/110/49/19872.full.pdf Google ScholarCross Ref
- Ruibang Luo, Thomas Wong, Jianqiao Zhu, Chi-Man Liu, Xiaoqian Zhu, Edward Wu, Lap-Kei Lee, Haoxiang Lin, Wenjuan Zhu, David W. Cheung, Hing-Fung Ting, Siu-Ming Yiu, Shaoliang Peng, Chang Yu, Yingrui Li, Ruiqiang Li, and Tak-Wah Lam. 2013. SOAP3-dp: fast, accurate and sensitive GPU-based short read aligner. PloS one 8, 5 (31 May 2013), e65632--e65632. 23741504[pmid]. Google ScholarCross Ref
- Shoun Matsunaga, Akira Katsumata, Masanori Natsui, Tetsuo Endoh, Hideo Ohno, and Takahiro Hanyu. 2012. Design of a Nine-Transistor/Two-Magnetic-Tunnel-Junction-Cell-Based Low-Energy Nonvolatile Ternary Content-Addressable Memory. Japanese Journal of Applied Physics 51, 2 (feb 2012), 02BM06. Google ScholarCross Ref
- Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 3 (1970), 443 -- 453. Google ScholarCross Ref
- K. Pagiamtzis and A. Sheikholeslami. 2006. Content-addressable memory (CAM) circuits and architectures: a tutorial and survey. IEEE Journal of Solid-State Circuits 41, 3 (2006), 712--727.Google ScholarCross Ref
- Michael C. Schatz. 2009. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25, 11 (04 2009), 1363--1369. arXiv:https://academic.oup.com/bioinformatics/article-pdf/25/11/1363/950981/btp236.pdf Google ScholarDigital Library
- Sophie Schbath, Véronique Martin, Matthias Zytnicki, Julien Fayolle, Valentin Loux, and Jean-François Gibrat. 2012. Mapping reads on a Genomic Sequence: An Algorithmic Overview and a Practical Comparative Analysis. Journal of computational biology : A Journal of Computational Molecular Cell Biology 19, 6 (Jun 2012), 796--813. 22506536[pmid]. Google ScholarCross Ref
- T.F. Smith and M.S. Waterman. 1981. Identification of common molecular subsequences. Journal of Molecular Biology 147, 1 (1981), 195 -- 197. Google ScholarCross Ref
- Yatish Turakhia, Kevin Jie Zheng, Gill Bejerano, and William J. Dally. 2017. Darwin: A Hardware-acceleration Framework for Genomic Sequence Alignment. bioRxiv (2017). arXiv:https://www.biorxiv.org/content/early/2017/01/24/092171.full.pdf Google ScholarCross Ref
- Ryan R. Wick, Louise M. Judd, and Kathryn E. Holt. 2019. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biology 20, 1 (2019), 129. Google ScholarCross Ref
- Yuan Xie. 2013. Emerging Memory Technologies: Design, Architecture, and Applications. Springer Publishing Company, Incorporated.Google Scholar
- Hongyi Xin, John Greth, John Emmons, Gennady Pekhimenko, Carl Kingsford, Can Alkan, and Onur Mutlu. 2015. Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping. Bioinformatics (Oxford, England) 31, 10 (15 May 2015), 1553--1560. 25577434[pmid]. Google ScholarCross Ref
- X. Yin, K. Ni, D. Reis, S. Datta, M. Niemier, and X. S. Hu. 2019. An Ultra-Dense 2FeFET TCAM Design Based on a Multi-Domain FeFET Model. IEEE Transactions on Circuits and Systems II: Express Briefs 66, 9 (2019), 1577--1581.Google ScholarCross Ref
- Xunzhao Yin, Michael Niemier, and X Sharon Hu. 2017. Design and benchmarking of ferroelectric FET based TCAM. In Proceedings of the Conference on Design, Automation & Test in Europe. European Design and Automation Association, 1448--1453.Google ScholarCross Ref
Index Terms
- Seed-and-vote based in-memory accelerator for DNA read mapping
Recommendations
Anatomy of a hash-based long read sequence mapping algorithm for next generation DNA sequencing
Motivation: Recently, a number of programs have been proposed for mapping short reads to a reference genome. Many of them are heavily optimized for short-read mapping and hence are very efficient for shorter queries, but that makes them inefficient ...
GeNVoM: Read Mapping Near Non-Volatile Memory
DNA sequencing is the physical/biochemical process of identifying the location of the four bases (Adenine, Guanine, Cytosine, Thymine) in a DNA strand. As semiconductor technology revolutionized computing, modern DNA sequencing technology (termed Next ...
Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data
Motivation: Next-generation sequencing has become an important tool for genome-wide quantification of DNA and RNA. However, a major technical hurdle lies in the need to map short sequence reads back to their correct locations in a reference genome. ...
Comments