Exploring the Processing-in-Memory design space

https://doi.org/10.1016/j.sysarc.2016.08.001Get rights and content

Abstract

With the emergence of 3D-DRAM, Processing-in-Memory has once more become of great interest to the research community and industry. Here we present our observations on a subset of the PIM design space. We show how the architectural choices for PIM core frequency and cache sizes will affect the overall power consumption and energy efficiency. We include a detailed power consumption breakdown for an ARM-like core as a PIM core. We show the maximum possible number of PIM cores we can place in the logic layer with respect to a predefined power budget. Additionally, we catalog additional sources of power consumption in a system with PIM such as 3D-DRAM link power and discuss the possible power reduction techniques. We describe the shortcomings of using ARM-like cores for PIM and discuss other alternatives for the PIM cores. Finally, we explore the optimal design choices for the number of cores as a function of performance, utilization, and energy efficiency.

Introduction

Over the last decade, we have witnessed the Big Data processing evolution. Existing commodity systems, which are widely used in the Big Data processing community, are becoming less energy efficient and fail to scale in terms of power consumption and area [20] clearly shows that this is also true for any Scale-Out workloads in general. Therefore using hardware accelerators to aid the Big Data processing is becoming more and more prominent. With the evolution of new emerging DRAM technologies, in particular 3D-DRAM, Processing-in-Memory (PIM) has again become of great interest to the research community as well as the industry [14], [15]. When it comes to Big Data processing, systems with 3D-DRAM including PIM could prove to be more energy efficient and powerful than traditional commodity systems. Recent studies [7], [13], [18] have shown the potential use of PIM in 3D-DRAM chips. However, in order to prove the efficiency and usability of PIM, a much larger design space needs to be explored. This includes both software and hardware related design choices as well as tackling the challenges which arise from such a complex heterogeneous system. From a software perspective, challenges such as programmability, scalability, programming interfaces, and usability need to be explored. Major hardware challenges include PIM core micro-architecture, interconnection networks, and interfaces. Here we present our observations for a subset of architectural choices for the PIM cores, e.g. core architecture, frequency, and cache sizes to maximize energy efficiency. Our goal is to explore a part of the large design space and investigate the trade-offs between certain design choices. We focus on an ARM-like energy-efficient core as a PIM core and evaluate design choices for caches, core frequency, and number of cores for a set of Big Data analyses benchmarks based on MapReduce as well as scientific OpenMP benchmarks. Our findings and observation include:

  • How cache size and core frequency affect the performance of a single PIM core and total power consumption

  • How these parameters and metrics translate to overall energy efficiency

  • Power decomposition for different system components

  • Potential number of cores we can place in the logic layer with respect to a power budget

  • Possible design choices for number of cores as a function of frequency, utilization, and energy efficiency

  • Bandwidth consumption of the benchmarks and the impact on the decision about the number and type of PIM cores

  • Discussion on alternative choices for the PIM cores

Section snippets

3D-DRAM

3D-DRAM memory provides high memory bandwidth, which reduces average memory-access latency, and lower power consumption than traditional DRAM. A prototype of such 3D-DRAM is already available from Micron [21]. A group of different vendors, Hybrid Memory Cube Consortium (HMCC) [9], are working on expanding 3D-DRAM capabilities. Current prototype 3D-DRAM, known as Hybrid Memory Cube (HMC) has a capacity of 4–8GB and can provide maximum memory bandwidth of 480GB/s [9]. 3D-DRAM memory is typically

PIM integrated 3D-DRAM

In present data center systems we need to process large amounts of data as fast as possible. The main bottleneck in achieving higher speed processing is the gap between processor and memory speed, known commonly as the famous memory-wall. Here we discuss the two most important issues which create this problem, namely latency and bandwidth. Energy efficiency is another crucial requirement for today's data centers. 3D-DRAM memory cubes provide higher bandwidth and lower power consumption. PIM

Design space exploration

A general model of a PIM augmented architecture using 3D-DRAM has been proposed by Zhang et al. [15] and a similar model has been used in recent studies [7], [13] as well. We use the same model for our studies. The model consists of a host processor connected to one or many 3D-DRAM modules where each 3D-DRAM module has several PIM cores residing in the logic layer. The host processor views all the 3D-DRAM modules as one physical address space shared between the host processor and the PIM cores.

Methodology

We used the gem5 simulator [19] to capture the performance statistics needed for our power and energy efficiency evaluation. We used the “minor” CPU, an in-order, single-issue CPU model with support for ARM ISA. We are aware that this model is not as detailed, but it is the only available in-order model with ARM ISA support. We used a simple DRAM model with a fixed latency of 40 ns [13] to match the core latency of the 3D-DRAM. We ran four different microbenchmarks, written in the C programming

PIM cores frequency and cache sizes

We use the collected statistics from gem5 to evaluate what would be good architectural choices for cache sizes and core frequencies. In order to do that, we look at the overall energy efficiency for different cache size-frequency pairs. The goal is to find an optimal point where we get the most out of the PIM cores with lowest possible power consumption. For that, we take the total execution time obtained from gem5 and the power consumption of the core obtained from McPAT [25]. We include both

Power breakdown

We obtain the total PIM core power consumption from McPAT [25]. We scale the supply voltage to support various frequencies by using the voltage-frequency pairs as in [27]. We separate the power consumption into four different components: static core power, dynamic core power, static cache power, and dynamic cache power. The power consumption will depend on both frequency and supply voltage and, therefore, will scale exponentially. Fig. 4a and 4b show the breakdown of different power components

Number of PIM cores

The maximum number of PIM cores that can be placed in the logic layer of a 3D-DRAM will depend on the individual PIM core power consumption as well as the power limit of the logic layer. Researchers have proposed a conservative power budget of 10 W for the logic layer [18]. Fig. 5 shows the maximum number of cores, within that power budget, for different setups. For 800 MHz and 16 KB L1 cache, we can maximally put 26 cores in the logic layer, while at 1000 MHz and 64 KB we can put a maximum of 18

Bandwidth and link power

Fig. 7 shows the actual bandwidth consumption of wordcount (7a) and backprop (7b) when running on 16 PIM cores. None of the benchmarks exceed 20GB/s bandwidth consumption. This is also true for other workloads as well. Four PIM stacks, each with 16 PIM cores, would consume not more than 80GB/s which is relatively low to the total available bandwidth. However, the PIM stacks would still be more energy efficient then running the code on the host processor. This is due to the fact that the host

PIM core alternative

One of the design points for a PIM system is deciding on the type of the processing core for the PIM. Loh et al. [2] presented couple of possibilities for the PIM cores including ASICs and fixed function PIMs. Our study, as well as some of the previous studies, explored the case where we would employ energy efficient low-power ARM cores as PIM cores. An advantage of having ARM cores as PIMs is the high energy efficiency. Designed for low power, ARM cores can still provide a significant

Conclusion

In this paper, we presented our observations on a subset of architectural choices for PIM cores. As a use case, we have used map() phases of several MapReduce workloads and OpenMP scientific benchmarks. Our study shows that a PIM core running at 800 MHz clock frequency has the best energy efficiency. Smaller caches are more energy efficient for MapReduce workloads while more intensive scientific workloads benefit from larger caches. We have shown the power consumption components and calculated

Marko Scrbak is a Ph. D. student and research assistant at University of North Texas. His research interests revolve around computer architecture, specifically Processing-in-Memory (PIM) systems and memory system design and optimizations. He joined the Computer Systems Research Laboratory (CSRL) at UNT in 2012, after receiving his Bachelor's degree from University of Zagreb, Croatia. He received his MS degree in Computer Science from University of North Texas in 2015.

References (29)

  • M. Rezaei et al.

    Intelligent memory manager: Reducing cache pollution due to memory management functions

    J. Syst. Archit.

    (2006)
  • D.P. Zhang et al.

    A new perspective on processing-in-memory architecture design

  • G. Loh et al.

    A Processing in Memory Taxonomy and a Case for Studying Fixed-Function PIM

  • D.W. Chang et al.

    Reevaluating the latency claims of 3D stacked memories

  • A. Gara

    Energy efficiency challenges for exascale computing

  • S.W. Keckler et al.

    GPUs and the future of parallel computing

    IEEE Micro

    (2011)
  • M. Islam et al.

    Improving node-level mapreduce performance using processing-in-memory technologies

    In: to be appear in Workshop on UnConventional High Perform. Comput.

    (2014)
  • B. Black et al.

    Die stacking (3D) microarchitecture

    IEEE Micro

    (2006)
  • Hybrid Memory Cube Consortium,...
  • J. Draper et al.

    The architecture of the DIVA processing in-memory chip

  • JEDEC,...
  • D. Patterson et al.

    A case for intelligent RAM

  • S.H. Pugsley et al.

    NDC: analyzing the impact of 3D-stacked memory+logic devices on mapreduce workloads

  • J. Torrellas

    FlexRAM: toward an advanced intelligent memory system: a retrospective paper

  • Cited by (14)

    • Minimizing temperature and energy of real-time applications with precedence constraints on heterogeneous MPSoC systems

      2019, Journal of Systems Architecture
      Citation Excerpt :

      Increase in energy incurs serious technical and economic problems, while increase in temperature may greatly damage the performance, reliability and safety of the systems. Further, due to the positive mutual effects between temperature and power and the dominant position of leakage power in nano-era [2–4], it is necessary to study optimization techniques to alleviate the energy and thermal issues. Research work [1] categorizes the optimization techniques into three kinds: floor planning, memory management, and task scheduling.

    • A Technologically Agnostic Framework for Cyber-Physical and IoT Processing-in-Memory-based Systems Simulation

      2019, Microprocessors and Microsystems
      Citation Excerpt :

      Both HMC and HBM designs separate logic from memory, and deal with the old problem of using the same slow DRAM technology to build logic processing elements. Consequently, since 2013 3D-stacked PIMs have regained focus with different project approaches, varying from multicore systems placed into the logic layer as presented in [2,8,17,51,57], alternative cores [32,47], Single Instruction Multiple Data (SIMD) units [49,55], Graphics Processing Unit(GPUs) [60] to Coarse-Grain Reconfigurable Arrays (CGRAs) [21]. This section presents state-of-art PIM works regarding their feasibility for big sensor data applications realm, how to compile and simulate for PIM architecture design.

    • A Survey on optimized implementation of deep learning models on the NVIDIA Jetson platform

      2019, Journal of Systems Architecture
      Citation Excerpt :

      Both established and start-up companies have been recently launching novel embedded accelerators for machine-learning. Further, researchers have been exploring unconventional memories such as non-volatile memories and processing-in-memory approaches [125–127], which avoid data-transfer overheads altogether. Evidently, the platform of choice will be decided based on a variety of metrics such as cost, latency, throughput, energy-efficiency, support of software libraries, and more.

    • A survey of optimization techniques for thermal-aware 3D processors

      2019, Journal of Systems Architecture
      Citation Excerpt :

      Increase in energy consumption. In the nano-era, the leakage power is gradually dominating the overall power consumptions of ICs [46–50]. Since the leakage power is positively related to the chip operating temperature, high temperatures may incur extra energy consumption of 3D processors when executing a given task set.

    • A survey of techniques for improving error-resilience of DRAM

      2018, Journal of Systems Architecture
      Citation Excerpt :

      Also, due to voltage scaling, even lower-charge particles, which are far more numerous in atmosphere, can flip the stored bit and cause a soft error [5]. Further, to mitigate the challenges of memory wall, stacked DRAM has been used [6]. However, stacked DRAM has higher integration density and hence, higher operating temperature compared to 2D DRAM [7].

    • Exploiting Heterogeneity in PIM Architectures for Data-Intensive Applications

      2023, IFIP Advances in Information and Communication Technology
    View all citing articles on Scopus

    Marko Scrbak is a Ph. D. student and research assistant at University of North Texas. His research interests revolve around computer architecture, specifically Processing-in-Memory (PIM) systems and memory system design and optimizations. He joined the Computer Systems Research Laboratory (CSRL) at UNT in 2012, after receiving his Bachelor's degree from University of Zagreb, Croatia. He received his MS degree in Computer Science from University of North Texas in 2015.

    Mahzabeen Islam is a Ph.D. student and research assistant at University of North Texas. Her research interest focuses on different aspects of computer memory systems, ranging from memory systems optimization and processing-in-memory to emerging memory technologies. She received her MS degree in Computer Science and Engineering from University of North Texas in 2015.

    Dr. Krishna Kavi is currently a Professor of Computer Science and Engineering and the Director of the NSF Industry/University Cooperative Research Center for Net-Centric Software and Systems at the University of North Texas. During 2001–2009, he served as the Chair of the department. He also held an Endowed Chair Professorship in Computer Engineering at the University of Alabama in Huntsville, and served on the faculty of the University Texas at Arlington. He was a Scientific Program Manager at US National Science Foundation during 1993- 1995. He served on several editorial boards and program committees. His research is primarily on Computer Systems Architecture including multi-threaded and multicore processors, cache memories and hardware assisted memory managers. He also conducted research in the area of formal methods, parallel processing, and real-time systems. He published nearly 200 technical papers in these areas. He received more than US $6 M in research grants. He graduated 15 PhDs and more than 35 MS students. He received his PhD from Southern Methodist University in Dallas Texas and a BS in EE from the Indian Institute of Science in Bangalore, India.

    Michael Ignatowski is an AMD Fellow since 2010. Prior to joining AMD, Mike was a Research Senior Technical Staff member at the IBM Watson Research Center. His worked on IBM Power and blade systems, systems and data center power efficiency management. At AMD he is focusing his research on advanced memory technologies and high performance computing. Mike received his BS in Physics from Michigan State University and a MS in Computer Engineering the University of Michigan.

    Dr. Nuwan Jayasena is a Principal Member of the Technical Staff at AMD Research. Previously he was with NVIDIA and Stream Processors Inc. His current research focus is on advanced memory systems including processing in memory and multi-level memories. He also contributed to AMD Fusion Systems Architecture and Heterogeneous Systems Architecture. He received his MS and PhD in Electrical Engineering from Stanford University and a BS in Computer Engineering from the University of Southern California.

    View full text