Plug N’ PIM: An integration strategy for Processing-in-Memory accelerators
Introduction
Complex cache memory hierarchies have mitigated the technological and performance gap between memory and logic. However, for specific applications that present low temporal data locality or streaming behavior, the cache memories might be undermined, incurring overall inefficiency [1], [2].
In the last decades, PIM, Near-Data Accelerator (NDA), and Computing-In-Memory (CIM) devices have been presented to mitigate the memory-wall and the technological limitations by exploiting the internally available memory bandwidth, thus avoiding unnecessary data movement through narrow buses. Supported by novel memory technologies [3], researchers have explored the inherent capabilities of such memories to compute data, mainly focused on specific domain applications [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14]. In the same way, industry and academy have presented a myriad of PIM designs supported by the 3D-stacked integration [15], [16].
Many of these designs increase performance and energy efficiency for different classes of applications, such as biological meaningful Neural Network (NN) [9], [17], Machine Learning and Convolutional Neural Network (CNN) [18], [19], Graph Traversing [13], [20], Database and MapReduce [21], [22], logical operations [14], [23], [24], among others can be listed. For this, two main strategies are commonly presented; the adoption of minimal hardware [20], [21], [25], [26], [27], [28], [29], [30], [31] and the implementation of full-core processors [18], [32], [33], [34], [35], [36]. Although the latter model can be seen as a multi-core/multi-processor design supported by well-known programming models, cache coherence protocols, and virtual memory mechanisms, the adoption of full-core processors is precluded by critical constraints violation (like maximum power and available area) and loss of overall energy efficiency [37], [38]. On the other hand, minimal hardware implementations can meet the required constraints and offer high performance and energy efficiency, at the cost of needing new designs and significant changes to the software domain stack [20], [21], [25]. The most prominent PIM designs placed a significant effort in the computational aspect while neglecting code offloading, cache memory coherence, and virtual memory capabilities [37], severely limiting its use in realistic environments. Moreover, commonly these designs reserve or lock, fully or partially, the memory device for the accelerator, which requires a costly mechanism or a software strategy to allow data sharing between host and PIM. Even when addressed, the solutions for these challenges demand extensive and costly modifications at the host processor side, given the required increase in design time and fabrication [39].
Memory technologies such as Resistive RAM (ReRAM), or even traditional and 3D-stacked Dynamic Random Access Memory (DRAM) can have an enormous range of applicability with PIM. Therefore it becomes essential to create a way to engage PIM and host without incurring overhead in terms of area, power, energy, and especially without requiring modifications to the well-established hosts.
This work is an extension of [40], where the Plug N’ Play mechanism seamlessly allows the adoption of PIM devices. This extension aims to exploit the resources on the host architecture to increase the performance and optimize the integration with the PIM device.
Our goal is to provide high performance without disturbing the software stack or requiring hardware modifications at the host by taking advantage of native instructions presented in the most modern General-Purpose Processor (GPP) (e.g., x86, ARM). Hence, at compile-time, Plug N’ PIM allows:
- •
Fully-Compliant Main Memory Accelerator - Our techniques allow PIM to be placed within the system’s main memory without disturbing the memory hierarchy.
- •
Code Offloading - a per instruction code offloading approach, automatically triggered by the compiler. This strategy adopts native non-temporal stores to emit instructions to each memory-mapped PIM unit.
- •
Cache Coherence - a technique that identifies access to memory from PIM and triggers flush instructions to the exact addresses in order to keep cache-coherence between host and PIM devices.
- •
Virtual memory - a design to allow host and PIM to share memory without Translation Look-aside Buffers (TLBs) replication or limitation of any kind.
- •
Non-Blocking Mechanism - due to the non-invasive approach, the host continues to execute normally, being able to access the main memory module concurrently with the PIM.
- •
Host-PIM Interface Limits - we size up the performance limits of the host and PIM coupling in a Plug N’ Play environment.
We show that by using the proposed strategy, without any overhead at the host side, we can take advantage of the PIM device improving the system’s overall performance. We experiment with the Yolo CNN to evaluate our approach. We show that up to 5.2 can be achieved when the host emits instructions to the PIM in a non-optimal way, and up to 13 when our optimized code offloading approach is adopted. We also evaluate the impact of different coherence levels, from no-coherence to coherence where needed. Moreover, the proposed techniques are suitable for any form of PIM accelerator that assumes the mentioned coupling style.
This work is organized as follows: Section 2 gives a brief introduction to our base terminology and the types of solutions presented for the memory-wall problem. Section 3 presents an overview of different strategies to solve the main challenges for PIM adoption. In Section 4, we present how our solution allows for Plug ’N Play integration of PIM units on current systems. In Section 5, we describe our experiments and evaluate the impact of Plug ’N PIM and its proposed optimizations. Finally, Section 6 concludes our discussions.
Section snippets
Background on PIM types
There have been several solutions presented in the literature for the memory-wall problem. Despite the similar objective, PIM, NDA, and CIM devices possess fundamentally different architecture approaches [41], [42]:
Full-Core Implementation: These Near-Data Accelerators, propose to bring entire computational units to the memory chip. While taking advantage of more common programming models and data coherence mechanisms, these solutions face substantial constraints from power and area in the
PIM integration RoadBlocks
PIM focuses on performance and energy efficiency for applications that cannot be efficiently accelerated by traditional hardware [37]. Although embracing PIM devices brings notorious advantages to modern computer systems, its adoption is a challenging scenario. To allow PIM use the designers commonly rely on modifications to established hardware, such as modifying the general-purpose processor’s pipeline [21], [46], [47], or by adding new instructions to the processor’s Instruction Set
Providing Plug N’ Play for PIM
This work’s primary focus is to provide a non-invasive environment to allow the adoption of PIM devices in a Plug N’ Play fashion. The present techniques consider those designs that implement simple FUs [19] or provide memory technology resources for logical operations [11]. To that aim, this work concerns code offloading, cache coherence, efficient communication, and support for virtual memory, providing the highest tightly-coupled level between host and PIM accelerators without any
Optimizing Plug N’ Play bottlenecks
Section 4 proposes the basis for a non-invasive solution to adopt PIM in a Plug N’ Play fashion by using native instructions available on the host. In this section, we evaluate the bottlenecks inherent to the host which may harm the performance of the presented techniques, and we show how to mitigate them.
Conclusions and future work
In this work, we presented Plug N’ PIM, a set of strategies to provide seamless integration between general-purpose hosts and PIM devices. Our solution leverages native host instructions in different architectures (e.g., X86, ARM, RISCV) to allow harmonious code offloading, cache coherence, and virtual memory support. We show that despite having some performance degradation due to code offloading and cache coherence mechanisms, these bottlenecks depend on the host’s system (architecture, cache,
CRediT authorship contribution statement
Paulo C. Santos: Conceptualization, Methodology, Software, Writing – original draft. Bruno E. Forlin: Conceptualization, Methodology, Software, Writing – original draft. Marco A.Z. Alves: Resources, Writing – review & editing. Luigi Carro: Supervision, Funding acquisition, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This study was financed in part by the FAPERGS, Brazil, CAPES, Brazil-Finance Code 001, CNPq, Brazil, Serrapilheira Institute, Brazil (grant number Serra-1709-16621).
References (64)
- et al.
A technologically agnostic framework for cyber–physical and iot processing-in-memory-based systems simulation
Microprocess. Microsyst.
(2019) - et al.
Near-memory computing: Past present, and future
Microprocess. Microsyst.
(2019) - et al.
Exploring cache size and core count tradeoffs in systems with reduced memory access latency
- A. Shahab, M. Zhu, A. Margaritov, B. Grot, Farewell my shared llc! a case for private die-stacked dram caches for...
Reram: History status, and future
IEEE Trans. Electron Dev.
(2020)- A. Drebes, L. Chelini, O. Zinenko, A. Cohen, H. Corporaal, T. Grosser, K. Vadivel, N. Vasilache, Tc-cim: Empowering...
- L. Xie, H. Cai, J. Yang, Real: Logic and arithmetic operations embedded in rram for general-purpose computing, in: 2019...
- et al.
Computing in memory with spin-transfer torque magnetic ram
IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
(2018) - S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, R. Das, Compute caches, in: 2017 IEEE International...
- P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, Prime: A novel processing-in-memory architecture for...
Hybrid memory cube specification rev 2.0.
Nim: An hmc-based machine for neuron computation
Graphpim: Enabling instruction-level pim offloading in graph computing frameworks
Operand size reconfiguration for big data processing in memory
Computedram: In-memory compute using off-the-shelf drams
Pim-enabled instructions: A low-overhead locality-aware processing-in-memory architecture
SIGARCH Comput. Archit. News
Tetris: Scalable and efficient neural network acceleration with 3d memory
Cited by (0)
- 1
Paulo C. Santos and Bruno E. Forlin are co-primary authors.