ABSTRACT
In this paper, we propose a reconfigurable processing-in-DRAM architecture named ReD-LUT leveraging the high density of commodity main memory to enable a flexible, general-purpose, and massively parallel computation. ReD-LUT supports lookup table (LUT) queries to efficiently execute complex arithmetic operations (e.g., multiplication, division, etc.) via only memory read operation. In addition, ReD-LUT enables bulk bit-wise in-memory logic by elevating the analog operation of the DRAM sub-array to implement Boolean functions between operands stored in the same bit-line beyond the scope of prior DRAM-based proposals. We explore the efficacy of ReD-LUT in two computationally-intensive applications, i.e., low-precision deep learning acceleration, and the Advanced Encryption Standard (AES) computation. Our circuit-to-architecture simulation results show that for a quantized deep learning workload, ReD-LUT reduces the energy consumption per image by a factor of 21.4× compared with the GPU and achieves ~37.8× speedup and 2.1× energy-efficiency over the best in-DRAM bit-wise accelerators. As for AES data-encryption, it reduces energy consumption by a factor of ~2.2× compared to an ASIC implementation.
- O. Mutlu et al., "Enabling practical processing in and near memory for data-intensive computing," in DAC, 2019, pp. 1--4.Google Scholar
- G. F. Oliveira et al., "Damov: A new methodology and benchmark suite for evaluating data movement bottlenecks," IEEE Access, vol. 9, 2021.Google Scholar
- S. Li et al., "Drisa: A dram-based reconfigurable in-situ accelerator," in MICRO. IEEE, 2017, pp. 288--301.Google Scholar
- V. Seshadri et al., "Ambit: In-memory accelerator for bulk bitwise operations using commodity dram technology," in Micro. ACM, 2017, pp. 273--287.Google Scholar
- S. Angizi and D. Fan, "Graphide: A graph processing accelerator leveraging in-dram-computing," in GLSVLSI, 2019, pp. 45--50.Google Scholar
- J. D. Ferreira et al., "pluto: In-dram lookup tables to enable massively parallel general-purpose computation," arXiv preprint arXiv:2104.07699, 2021.Google Scholar
- S. Angizi and D. Fan, "Redram: A reconfigurable processing-in-dram platform for accelerating bulk bit-wise operations," in ICCAD. IEEE, 2019, pp. 1--8.Google Scholar
- C. Eckert et al., "Neural cache: Bit-serial in-cache acceleration of deep neural networks," in ISCA. IEEE, 2018, pp. 383--396.Google Scholar
- Q. Deng et al., "Lacc: Exploiting lookup table-based fast and accurate vector multiplication in dram-based cnn accelerator," in DAC, 2019, pp. 1--6.Google Scholar
- R. Zhou et al., "Flexidram: A flexible in-dram framework to enable parallel general-purpose computation," in ISLPED, 2022, pp. 1--6.Google Scholar
- M. F. Ali et al., "In-memory low-cost bit-serial addition using commodity dram technology," IEEE TCAS I: Regular Papers, vol. 67, pp. 155--165, 2019.Google ScholarCross Ref
- N. Hajinazar et al., "Simdram: a framework for bit-serial simd processing using dram," in asplos, 2021, pp. 329--345.Google Scholar
- P. R. Sutradhar et al., "ppim: A programmable processor-in-memory architecture with precision-scaling for deep learning," IEEE CAL, vol. 19, 2020.Google Scholar
- V. Seshadri et al., "Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization," in Micro, 2013, pp. 185--197.Google Scholar
- G. Sideris, "Intel 1103-mos memory that defied cores," Electronics, vol. 46, pp. 108--113, 1973.Google Scholar
- T. Kuroda et al., "A 0.9-v, 150-mhz, 10-mw, 4 mm/sup 2/, 2-d discrete cosine transform core processor with variable threshold-voltage (vt) scheme," IEEE JSSC, vol. 31, pp. 1770--1779, 1996.Google ScholarCross Ref
- (2018) Parallel thread execution isa version 6.1. [Online]. Available: http://docs.nvidia.com/cuda/parallel-thread-execution/index.htmlGoogle Scholar
- (2011) Ncsu eda freepdk45. [Online]. Available: http://www.eda.ncsu.edu/wiki/FreePDK45:ContentsGoogle Scholar
- S. D. C. P. V.. Synopsys, Inc.Google Scholar
- S. Thoziyoor et al., "A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies," ACM SIGARCH Computer Architecture News, vol. 36, no. 3, pp. 51--62, 2008.Google ScholarDigital Library
- N. Binkert et al., "The gem5 simulator," ACM SIGARCH computer architecture news, vol. 39, pp. 1--7, 2011.Google ScholarDigital Library
- A. Krizhevsky et al., "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems, vol. 25, 2012.Google Scholar
- S. Zhou et al., "Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients," arXiv preprint arXiv:1606.06160, 2016.Google Scholar
- M. Rastegari et al., "Xnor-net: Imagenet classification using binary convolutional neural networks," in ECCV. Springer, 2016, pp. 525--542.Google Scholar
- Y. Wang et al., "Dw-aes: A domain-wall nanowire-based aes for high throughput and energy-efficient data encryption in non-volatile memory," IEEE TIFS, 2016.Google Scholar
- Z. Abid et al., "Efficient cmol gate designs for cryptography applications," IEEE TNANO, vol. 8, no. 3, pp. 315--321, 2009.Google ScholarDigital Library
- S. Li et al., "Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures," in MICRO. ACM, 2009, pp. 469--480.Google Scholar
Index Terms
- ReD-LUT: Reconfigurable In-DRAM LUTs Enabling Massive Parallel Computation
Recommendations
N-port memory mapping for LUT-based FPGAs
FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arraysAs current FPGAs grow in logic capacity, they are widely used to implement entire systems. In some specific applications, such as our embedded multi-core processor TriBA[1],user memory models are not limited to single-port or dual-port. Thus, we need a ...
An Almost Fully RRAM-Based LUT Design for Reconfigurable Circuits
Applied Reconfigurable Computing. Architectures, Tools, and ApplicationsAbstractIn the last decade, resistive random-access memory (RRAM) has been used in designing field-programmable gate arrays (FPGAs). The non-volatility of RRAM has made it a promising substitute for the traditional static random-access memory (SRAM) in ...
CAD Tool Flow for Variation-Tolerant Non-Volatile STT-MRAM LUT based FPGA
ICSCA '18: Proceedings of the 2018 7th International Conference on Software and Computer ApplicationsField Programmable Gate Array (FPGA) is a reconfigurable circuit and it is used for various applications such as image processing, digital signal processing and neural network. FPGA adopts a logic circuit called Look-Up Table (LUT) as a basic circuit ...
Comments