Elsevier

Integration

Volume 88, January 2023, Pages 185-195
Integration

Plug N’ PIM: An integration strategy for Processing-in-Memory accelerators

https://doi.org/10.1016/j.vlsi.2022.09.016Get rights and content

Highlights

  • Processing in-Memory devices must integrate with current host architectures.

  • The integration can be achieved without host modifications.

  • There is room for software optimization.

  • Processing in-Memory devices benefit from host features.

Abstract

Processing-in-Memory (PIM) devices have reemerged as a promise to mitigate the memory-wall and the limitations of transferring massive amount of data from main memories to the host processors. Novel memory technologies and the advent of 3D-stacked integration have provided means to compute data in-memory, either by exploring the inherent analog capabilities or by tight-coupling logic and memory. However, allowing the effective use of a PIM device demands significant and costly modifications on the host processor to support instructions offloading, cache coherence, virtual memory management, and communication between different PIM instances. This paper tackles these challenges by presenting a set of solutions to couple host and PIM with no modifications at host side. Moreover, we highlight the limitations presented on modern host processors that may prevent full performance extraction of the PIM devices. This work presents Plug N’ PIM, a set of strategies and procedures to seamlessly couple host general-purpose processors and PIM devices. We show that with our techniques one can exploit the benefits of a PIM device with seamless integration between host and PIM, bypassing possible limitations on the host side.

Introduction

Complex cache memory hierarchies have mitigated the technological and performance gap between memory and logic. However, for specific applications that present low temporal data locality or streaming behavior, the cache memories might be undermined, incurring overall inefficiency [1], [2].

In the last decades, PIM, Near-Data Accelerator (NDA), and Computing-In-Memory (CIM) devices have been presented to mitigate the memory-wall and the technological limitations by exploiting the internally available memory bandwidth, thus avoiding unnecessary data movement through narrow buses. Supported by novel memory technologies [3], researchers have explored the inherent capabilities of such memories to compute data, mainly focused on specific domain applications [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14]. In the same way, industry and academy have presented a myriad of PIM designs supported by the 3D-stacked integration [15], [16].

Many of these designs increase performance and energy efficiency for different classes of applications, such as biological meaningful Neural Network (NN) [9], [17], Machine Learning and Convolutional Neural Network (CNN) [18], [19], Graph Traversing [13], [20], Database and MapReduce [21], [22], logical operations [14], [23], [24], among others can be listed. For this, two main strategies are commonly presented; the adoption of minimal hardware [20], [21], [25], [26], [27], [28], [29], [30], [31] and the implementation of full-core processors [18], [32], [33], [34], [35], [36]. Although the latter model can be seen as a multi-core/multi-processor design supported by well-known programming models, cache coherence protocols, and virtual memory mechanisms, the adoption of full-core processors is precluded by critical constraints violation (like maximum power and available area) and loss of overall energy efficiency [37], [38]. On the other hand, minimal hardware implementations can meet the required constraints and offer high performance and energy efficiency, at the cost of needing new designs and significant changes to the software domain stack [20], [21], [25]. The most prominent PIM designs placed a significant effort in the computational aspect while neglecting code offloading, cache memory coherence, and virtual memory capabilities [37], severely limiting its use in realistic environments. Moreover, commonly these designs reserve or lock, fully or partially, the memory device for the accelerator, which requires a costly mechanism or a software strategy to allow data sharing between host and PIM. Even when addressed, the solutions for these challenges demand extensive and costly modifications at the host processor side, given the required increase in design time and fabrication [39].

Memory technologies such as Resistive RAM (ReRAM), or even traditional and 3D-stacked Dynamic Random Access Memory (DRAM) can have an enormous range of applicability with PIM. Therefore it becomes essential to create a way to engage PIM and host without incurring overhead in terms of area, power, energy, and especially without requiring modifications to the well-established hosts.

This work is an extension of [40], where the Plug N’ Play mechanism seamlessly allows the adoption of PIM devices. This extension aims to exploit the resources on the host architecture to increase the performance and optimize the integration with the PIM device.

Our goal is to provide high performance without disturbing the software stack or requiring hardware modifications at the host by taking advantage of native instructions presented in the most modern General-Purpose Processor (GPP) (e.g., x86, ARM). Hence, at compile-time, Plug N’ PIM allows:

  • Fully-Compliant Main Memory Accelerator - Our techniques allow PIM to be placed within the system’s main memory without disturbing the memory hierarchy.

  • Code Offloading - a per instruction code offloading approach, automatically triggered by the compiler. This strategy adopts native non-temporal stores to emit instructions to each memory-mapped PIM unit.

  • Cache Coherence - a technique that identifies access to memory from PIM and triggers flush instructions to the exact addresses in order to keep cache-coherence between host and PIM devices.

  • Virtual memory - a design to allow host and PIM to share memory without Translation Look-aside Buffers (TLBs) replication or limitation of any kind.

  • Non-Blocking Mechanism - due to the non-invasive approach, the host continues to execute normally, being able to access the main memory module concurrently with the PIM.

  • Host-PIM Interface Limits - we size up the performance limits of the host and PIM coupling in a Plug N’ Play environment.

We show that by using the proposed strategy, without any overhead at the host side, we can take advantage of the PIM device improving the system’s overall performance. We experiment with the Yolo CNN to evaluate our approach. We show that up to 5.2× can be achieved when the host emits instructions to the PIM in a non-optimal way, and up to 13× when our optimized code offloading approach is adopted. We also evaluate the impact of different coherence levels, from no-coherence to coherence where needed. Moreover, the proposed techniques are suitable for any form of PIM accelerator that assumes the mentioned coupling style.

This work is organized as follows: Section 2 gives a brief introduction to our base terminology and the types of solutions presented for the memory-wall problem. Section 3 presents an overview of different strategies to solve the main challenges for PIM adoption. In Section 4, we present how our solution allows for Plug ’N Play integration of PIM units on current systems. In Section 5, we describe our experiments and evaluate the impact of Plug ’N PIM and its proposed optimizations. Finally, Section 6 concludes our discussions.

Section snippets

Background on PIM types

There have been several solutions presented in the literature for the memory-wall problem. Despite the similar objective, PIM, NDA, and CIM devices possess fundamentally different architecture approaches [41], [42]:

Full-Core Implementation: These Near-Data Accelerators, propose to bring entire computational units to the memory chip. While taking advantage of more common programming models and data coherence mechanisms, these solutions face substantial constraints from power and area in the

PIM integration RoadBlocks

PIM focuses on performance and energy efficiency for applications that cannot be efficiently accelerated by traditional hardware [37]. Although embracing PIM devices brings notorious advantages to modern computer systems, its adoption is a challenging scenario. To allow PIM use the designers commonly rely on modifications to established hardware, such as modifying the general-purpose processor’s pipeline [21], [46], [47], or by adding new instructions to the processor’s Instruction Set

Providing Plug N’ Play for PIM

This work’s primary focus is to provide a non-invasive environment to allow the adoption of PIM devices in a Plug N’ Play fashion. The present techniques consider those designs that implement simple FUs [19] or provide memory technology resources for logical operations [11]. To that aim, this work concerns code offloading, cache coherence, efficient communication, and support for virtual memory, providing the highest tightly-coupled level between host and PIM accelerators without any

Optimizing Plug N’ Play bottlenecks

Section 4 proposes the basis for a non-invasive solution to adopt PIM in a Plug N’ Play fashion by using native instructions available on the host. In this section, we evaluate the bottlenecks inherent to the host which may harm the performance of the presented techniques, and we show how to mitigate them.

Conclusions and future work

In this work, we presented Plug N’ PIM, a set of strategies to provide seamless integration between general-purpose hosts and PIM devices. Our solution leverages native host instructions in different architectures (e.g., X86, ARM, RISCV) to allow harmonious code offloading, cache coherence, and virtual memory support. We show that despite having some performance degradation due to code offloading and cache coherence mechanisms, these bottlenecks depend on the host’s system (architecture, cache,

CRediT authorship contribution statement

Paulo C. Santos: Conceptualization, Methodology, Software, Writing – original draft. Bruno E. Forlin: Conceptualization, Methodology, Software, Writing – original draft. Marco A.Z. Alves: Resources, Writing – review & editing. Luigi Carro: Supervision, Funding acquisition, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This study was financed in part by the FAPERGS, Brazil, CAPES, Brazil-Finance Code 001, CNPq, Brazil, Serrapilheira Institute, Brazil (grant number Serra-1709-16621).

References (64)

  • SantosP.C. et al.

    A technologically agnostic framework for cyber–physical and iot processing-in-memory-based systems simulation

    Microprocess. Microsyst.

    (2019)
  • SinghG. et al.

    Near-memory computing: Past present, and future

    Microprocess. Microsyst.

    (2019)
  • SantosP.C. et al.

    Exploring cache size and core count tradeoffs in systems with reduced memory access latency

  • A. Shahab, M. Zhu, A. Margaritov, B. Grot, Farewell my shared llc! a case for private die-stacked dram caches for...
  • ChenY.

    Reram: History status, and future

    IEEE Trans. Electron Dev.

    (2020)
  • A. Drebes, L. Chelini, O. Zinenko, A. Cohen, H. Corporaal, T. Grosser, K. Vadivel, N. Vasilache, Tc-cim: Empowering...
  • L. Xie, H. Cai, J. Yang, Real: Logic and arithmetic operations embedded in rram for general-purpose computing, in: 2019...
  • JainS. et al.

    Computing in memory with spin-transfer torque magnetic ram

    IEEE Trans. Very Large Scale Integr. (VLSI) Syst.

    (2018)
  • S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, R. Das, Compute caches, in: 2017 IEEE International...
  • P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, Prime: A novel processing-in-memory architecture for...
  • C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaaauw, R. Das, Neural cache: Bit-serial...
  • S. Li, D. Niu, K.T. Malladi, H. Zheng, B. Brennan, Y. Xie, Drisa: A dram-based reconfigurable in-situ accelerator, in:...
  • A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J.P. Strachan, M. Hu, R.S. Williams, V. Srikumar, Isaac: A...
  • L. Song, X. Qian, H. Li, Y. Chen, Pipelayer: A pipelined reram-based accelerator for deep learning, in: 2017 IEEE...
  • L. Song, Y. Zhuo, X. Qian, H. Li, Y. Chen, Graphr: Accelerating graph processing using reram, in: 2018 IEEE...
  • X. Xin, Y. Zhang, J. Yang, Elp2im: Efficient and low power bitwise operation processing in dram, in: 2020 IEEE...
  • D.U. Lee, Kim, et al. a. 25.2 a 1.2 v 8gb 8-channel 128gb/s high-bandwidth memory (hbm) stacked dram with effective...
  • Hybrid Memory Cube Consortium

    Hybrid memory cube specification rev 2.0.

    (2013)
  • OliveiraG.F. et al.

    Nim: An hmc-based machine for neuron computation

  • J. Liu, H. Zhao, M.A. Ogleari, D. Li, J. Zhao, Processing-in-memory for energy-efficient neural network training: A...
  • NaiL. et al.

    Graphpim: Enabling instruction-level pim offloading in graph computing frameworks

  • SantosP.C. et al.

    Operand size reconfiguration for big data processing in memory

  • S.H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, F. Li, Ndc:...
  • GaoF. et al.

    Computedram: In-memory compute using off-the-shelf drams

  • V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M.A. Kozuch, O. Mutlu, P.B. Gibbons, T.C. Mowry,...
  • V.T. Lee, A. Mazumdar, C.C. del Mundo, A. Alaghi, L. Ceze, M. Oskin, Application codesign of near-data processing for...
  • AhnJ. et al.

    Pim-enabled instructions: A low-overhead locality-aware processing-in-memory architecture

    SIGARCH Comput. Archit. News

    (2015)
  • D.S. Cali, G.S. Kalsi, C. Bingöl, L. Subramanian, J.S. Kim, R. Ausavarungnirun, M. Alser, J. Gomez-Luna, A. Boroumand,...
  • A. Farmahini-Farahani, J.H. Ahn, K. Morrow, N.S. Kim, Nda: Near-dram acceleration architecture leveraging commodity...
  • M. Gao, C. Kozyrakis, Hrl: Efficient and flexible reconfigurable logic for near-data processing, in: 2016 IEEE...
  • GaoM. et al.

    Tetris: Scalable and efficient neural network acceleration with 3d memory

    (2017)
  • D. Kim, J. Kung, S. Chai, S. Yalamanchili, S. Mukhopadhyay, Neurocube: A programmable digital neuromorphic architecture...
  • Cited by (0)

    1

    Paulo C. Santos and Bruno E. Forlin are co-primary authors.

    View full text