skip to main content
10.1145/3508352.3549469acmconferencesArticle/Chapter ViewAbstractPublication PagesiccadConference Proceedingsconference-collections
research-article
Public Access

ReD-LUT: Reconfigurable In-DRAM LUTs Enabling Massive Parallel Computation

Published:22 December 2022Publication History

ABSTRACT

In this paper, we propose a reconfigurable processing-in-DRAM architecture named ReD-LUT leveraging the high density of commodity main memory to enable a flexible, general-purpose, and massively parallel computation. ReD-LUT supports lookup table (LUT) queries to efficiently execute complex arithmetic operations (e.g., multiplication, division, etc.) via only memory read operation. In addition, ReD-LUT enables bulk bit-wise in-memory logic by elevating the analog operation of the DRAM sub-array to implement Boolean functions between operands stored in the same bit-line beyond the scope of prior DRAM-based proposals. We explore the efficacy of ReD-LUT in two computationally-intensive applications, i.e., low-precision deep learning acceleration, and the Advanced Encryption Standard (AES) computation. Our circuit-to-architecture simulation results show that for a quantized deep learning workload, ReD-LUT reduces the energy consumption per image by a factor of 21.4× compared with the GPU and achieves ~37.8× speedup and 2.1× energy-efficiency over the best in-DRAM bit-wise accelerators. As for AES data-encryption, it reduces energy consumption by a factor of ~2.2× compared to an ASIC implementation.

References

  1. O. Mutlu et al., "Enabling practical processing in and near memory for data-intensive computing," in DAC, 2019, pp. 1--4.Google ScholarGoogle Scholar
  2. G. F. Oliveira et al., "Damov: A new methodology and benchmark suite for evaluating data movement bottlenecks," IEEE Access, vol. 9, 2021.Google ScholarGoogle Scholar
  3. S. Li et al., "Drisa: A dram-based reconfigurable in-situ accelerator," in MICRO. IEEE, 2017, pp. 288--301.Google ScholarGoogle Scholar
  4. V. Seshadri et al., "Ambit: In-memory accelerator for bulk bitwise operations using commodity dram technology," in Micro. ACM, 2017, pp. 273--287.Google ScholarGoogle Scholar
  5. S. Angizi and D. Fan, "Graphide: A graph processing accelerator leveraging in-dram-computing," in GLSVLSI, 2019, pp. 45--50.Google ScholarGoogle Scholar
  6. J. D. Ferreira et al., "pluto: In-dram lookup tables to enable massively parallel general-purpose computation," arXiv preprint arXiv:2104.07699, 2021.Google ScholarGoogle Scholar
  7. S. Angizi and D. Fan, "Redram: A reconfigurable processing-in-dram platform for accelerating bulk bit-wise operations," in ICCAD. IEEE, 2019, pp. 1--8.Google ScholarGoogle Scholar
  8. C. Eckert et al., "Neural cache: Bit-serial in-cache acceleration of deep neural networks," in ISCA. IEEE, 2018, pp. 383--396.Google ScholarGoogle Scholar
  9. Q. Deng et al., "Lacc: Exploiting lookup table-based fast and accurate vector multiplication in dram-based cnn accelerator," in DAC, 2019, pp. 1--6.Google ScholarGoogle Scholar
  10. R. Zhou et al., "Flexidram: A flexible in-dram framework to enable parallel general-purpose computation," in ISLPED, 2022, pp. 1--6.Google ScholarGoogle Scholar
  11. M. F. Ali et al., "In-memory low-cost bit-serial addition using commodity dram technology," IEEE TCAS I: Regular Papers, vol. 67, pp. 155--165, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  12. N. Hajinazar et al., "Simdram: a framework for bit-serial simd processing using dram," in asplos, 2021, pp. 329--345.Google ScholarGoogle Scholar
  13. P. R. Sutradhar et al., "ppim: A programmable processor-in-memory architecture with precision-scaling for deep learning," IEEE CAL, vol. 19, 2020.Google ScholarGoogle Scholar
  14. V. Seshadri et al., "Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization," in Micro, 2013, pp. 185--197.Google ScholarGoogle Scholar
  15. G. Sideris, "Intel 1103-mos memory that defied cores," Electronics, vol. 46, pp. 108--113, 1973.Google ScholarGoogle Scholar
  16. T. Kuroda et al., "A 0.9-v, 150-mhz, 10-mw, 4 mm/sup 2/, 2-d discrete cosine transform core processor with variable threshold-voltage (vt) scheme," IEEE JSSC, vol. 31, pp. 1770--1779, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  17. (2018) Parallel thread execution isa version 6.1. [Online]. Available: http://docs.nvidia.com/cuda/parallel-thread-execution/index.htmlGoogle ScholarGoogle Scholar
  18. (2011) Ncsu eda freepdk45. [Online]. Available: http://www.eda.ncsu.edu/wiki/FreePDK45:ContentsGoogle ScholarGoogle Scholar
  19. S. D. C. P. V.. Synopsys, Inc.Google ScholarGoogle Scholar
  20. S. Thoziyoor et al., "A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies," ACM SIGARCH Computer Architecture News, vol. 36, no. 3, pp. 51--62, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. N. Binkert et al., "The gem5 simulator," ACM SIGARCH computer architecture news, vol. 39, pp. 1--7, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Krizhevsky et al., "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems, vol. 25, 2012.Google ScholarGoogle Scholar
  23. S. Zhou et al., "Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients," arXiv preprint arXiv:1606.06160, 2016.Google ScholarGoogle Scholar
  24. M. Rastegari et al., "Xnor-net: Imagenet classification using binary convolutional neural networks," in ECCV. Springer, 2016, pp. 525--542.Google ScholarGoogle Scholar
  25. Y. Wang et al., "Dw-aes: A domain-wall nanowire-based aes for high throughput and energy-efficient data encryption in non-volatile memory," IEEE TIFS, 2016.Google ScholarGoogle Scholar
  26. Z. Abid et al., "Efficient cmol gate designs for cryptography applications," IEEE TNANO, vol. 8, no. 3, pp. 315--321, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Li et al., "Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures," in MICRO. ACM, 2009, pp. 469--480.Google ScholarGoogle Scholar

Index Terms

  1. ReD-LUT: Reconfigurable In-DRAM LUTs Enabling Massive Parallel Computation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICCAD '22: Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design
        October 2022
        1467 pages
        ISBN:9781450392174
        DOI:10.1145/3508352

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 December 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate457of1,762submissions,26%

        Upcoming Conference

        ICCAD '24
        IEEE/ACM International Conference on Computer-Aided Design
        October 27 - 31, 2024
        New York , NY , USA

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader