# ALP: Alleviating CPU-Memory Data Movement Overheads in Memory-Centric Systems

Nika Mansouri Ghiasi, Nandita Vijaykumar, Geraldo F. Oliveira, Lois Orosa, Ivan Fernandez, Mohammad Sadrosadati, Konstantinos Kanellopoulos, Nastaran Hajinazar, Juan Gómez Luna, Onur Mutlu

Abstract—Recent advances in memory technology have enabled near-data processing (NDP) to tackle main memory bottlenecks in modern systems. Prior works partition applications into segments (e.g., instructions, loops, functions) and execute memory-bound segments of the applications on NDP computation units, while mapping the cache-friendly application segments to host CPU cores that access a deeper cache hierarchy. Partitioning applications between NDP and host cores causes inter-segment data movement overhead, which is the overhead from moving data generated from one segment and used in the consecutive segments. This overhead can be large if the segments map to cores in different parts of the system (i.e., host and NDP). Prior works take two approaches to the inter-segment data movement overhead when partitioning applications between NDP and host cores. The first class of works maps segments to NDP or host cores based on the properties of each segment, neglecting the performance impact of the inter-segment data movement. Such partitioning techniques suffer from inter-segment data movement overhead. The second class of works maps segments to host or NDP cores based on the overall memory bandwidth savings of each segment (which depends on the memory bandwidth savings within each segment and the inter-segment data movement overhead. Therefore these works do not offload each segment to the best-fitting core if they incur high inter-segment data movement overhead. Therefore these works miss some of the potential NDP performance benefits. We show that mapping each segment (here basic block) to its best-fitting core based on the properties of each segment (here basic block) to its best-fitting core based on the properties of each segment (here basic block) to its best-fitting core based on the properties of each segment (here basic block) to its best-fitting core based on the properties of each segment (here basic block) to its best-fitting core based on the properties of each segment, can provide substantial performan

To this end, we introduce ALP, a new programmer-transparent technique to leverage the performance benefits of NDP by *alleviating* the performance impact of inter-segment data movement between host and memory and enabling efficient partitioning of applications between host and NDP cores. ALP alleviates the inter-segment data movement overhead by *proactively and accurately* transferring the required data between the segments mapped on host and NDP cores. This is based on the key observation that the instructions that generate the inter-segment data stay the same across different executions of a program on different input sets. ALP uses a compiler pass to identify these instructions and uses specialized hardware support to transfer data between the host and NDP cores at runtime. Using both the compiler and runtime information, ALP efficiently maps application segments to either host or NDP cores considering 1) the properties of each segment, 2) the inter-segment data movement overhead between different segments, and 3) whether this inter-segment data movement overhead between different segments, and 3) whether this ner-segment data movement overhead between different segments, and 3) whether this ner-segment data movement overhead between different segments, and 3) whether this ner-segment data movement overhead between different segments, and 3) whether this ner-segment data movement overhead between different segments, and 3) whether this ner-segment data movement overhead compared to executing the application only on the host CPU or only the NDP cores, respectively.

Index Terms—Near-data processing, inter-segment data movement, application partitioning.

# **1** INTRODUCTION

**N** EAR data processing (NDP) paradigm improves overall system performance by alleviating the main memory bottlenecks [1]–[67]. While the cores in modern systems are provided with deep and large cache hierarchies, NDP computation units suffer from lack of such an advantage due to their limited area and thermal budget [9], [68], [69]. Accordingly, prior works partition applications into segments (e.g., instructions, loops, functions) and execute memory-bound segments of the applications on NDP computation units, and map the cache-friendly application segments to host CPU cores that access a deeper cache hierarchy.

If not done properly, partitioning an application's code into NDP-friendly and CPU-friendly segments can result in a large volume of inter-segment data movement (i.e., data generated from one segment and used in other segments). When the segments map to the computation units on host and NDP systems, the data movement between the segments, in turn, translates to data movement overhead between the host and the NDP units and amortizes parts of the performance benefits of NDP. Prior works take two approaches to inter-segment data movement when partitioning applications between the host and NDP computation units. The first class of works maps segments to the host or the NDP computation units based on the characteristics of each segment by considering the memory access behavior of each segment individually [20], [70], [71]. Such works offload the memory-bound application segments to the NDP computation units, and keep the more cache-friendly segments in the host CPU cores. Since these approaches consider the memory access behavior of each segment individually and isolated from the other segments, they suffer from intersegment data movement overhead between the host cores

<sup>•</sup> Nika Mansouri Ghiasi, Geraldo F. Oliveira, Lois Orosa, Mohammad Sadrosadati, Konstantinos Kanellopoulos, Nastaran Hajinazar, Juan Gómez Luna, and Onur Mutlu are with the Department of Information Technology and Electrical Engineering (D-ITET), ETH Zurich, 8092 Zürich, Switzerland.

Nandita Vijaykumar is with the Department of Computer Science, University of Toronto, Toronto, ON M5S 2B1, Canada.

Ivan Fernandez is with the Department of Computer Architecture, University of Malaga, 29016 Málaga, Spain.

and NDP computation units. The second class of works maps application segments to the host or NDP computation units based on the overall memory bandwidth savings of each segment, which depends on the memory bandwidth savings within each segment and the inter-segment data movement overhead between other segments [21], [72], [73]. These works do not offload each segment to the bestfitting core if they incur high inter-segment data movement overhead. Therefore, these works suffer from missing some of the potential NDP performance benefits.

To our knowledge, no prior work alleviates the cost of inter-segment data movement. We show that while mapping each segment to its best fitting computation unit<sup>1</sup> in the host or the NDP side provides significant benefits (on average 26.8% and up to 44.1% better than execution only on the host or NDP computation units), the inter-segment data movement overhead significantly reduces this potential and can even lead to slowdown compared to running the application only on the host computation units (on average 9.5% and up to 56.3% slower).

**Our goal** in this work is to alleviate the impact of inter-segment data movement to enable efficient partitioning of applications between NDP and host computation units. To this end, we propose **ALP**, a programmer-transparent hardware-software cooperative mechanism that <u>Alleviates data movement between different segments when</u> Partitioning applications between NDP and host computation units. The key idea of ALP is to alleviate the intersegment data movement overhead by *proactively and accurately* transferring the required data between the segments that are mapped to the host and NDP computation units. This is based on the key observation that the instructions that generate the inter-segment data remain the same across different executions of a program on different input sets [74].

Quantifying and alleviating the overhead of intersegment data movement while partitioning applications between the NDP and host units is challenging since they require 1) identifying the inter-segment data that would cause performance overhead during partitioning, 2) transferring the inter-segment data to the unit that will execute the next code segment before the next segment starts, and 3) mapping segments between the NDP and host CPU units based on both the segment's internal characteristics and the resulting inter-segment data movement (considering the timeliness of proactive inter-segment data transfers). Jointly considering these factors is challenging since they are impacted by the complex interaction between many features of the application, the input data, and the underlying architecture.

ALP leverages compile-time and runtime information and operates in three steps. We assume a system with cores in the host side connected to a 3D-stacked memory with cores on its logic layer (i.e., NDP cores). NDP cores access memory with higher bandwidth and lower latency compared to the host CPU cores. In the first step, the compiler identifies the segments that the data movement between them could potentially reduce performance if they map to different NDP/host cores. ALP marks these segments as *tightly-connected segments*.

In the second step, using compile-time profiling, ALP finds the instructions that generate the inter-segment data in the tightly-connected segments [74]. Then, ALP identifies clusters of tightly-integrated segments in which the inter-segment data can be proactively transferred from the generator segment to the producer segment in a timely manner. To do so, ALP identifies the cases where the time of transferring the inter-segment data can be fully hidden by other operations performed in the segments. We identify the cases in which the time for transferring the data written by these instructions can be overlapped with the execution of other instructions in the segments. This way, during the application's runtime, ALP's hardware can proactively transfer the inter-segment data to the core that consumes it, while hiding and eliminating the data movement overhead of this transfer. Using proactive data transfers, ALP enables starting the execution of the next segments of the applications as soon as their inter-segment data arrives. For example, instead of making the NDP computation units wait until the host writes back all the inter-segment data located in its large caches, ALP enables the NDP computation units to start execution as soon as the parts of the inter-segment data arrives.

In the third step, during runtime, ALP collects information regarding input data size, cache sizes, cache miss rates, and Instructions per Cycle (IPC) of the segments. ALP adds this information to the compile-time information about the tightly-connected segments collected in the first two steps. By collectively considering these factors, ALP efficiently incorporates the information regarding the inter-segment data movement overhead in making partitioning decisions. ALP can be tuned to enrich various partitioning techniques by alleviating their inter-segment data movement overhead.

We evaluate ALP across workloads from various domains (e.g., graph processing, graphics, machine learning, bioinformatics, and high-performance computing). For workloads whose data movement overhead can be alleviated via proactive data transfers, ALP achieves almost all the potential performance benefits of mapping each segment of the application to its best-fitting core, assuming no inter-segment data movement overhead. ALP performs on average 54.3% better than execution only on the host CPU cores and 45.4% better than execution only on the NDP cores. For workloads whose data cannot be proactively transferred, ALP does not incur any performance overhead by executing all segments in the NDP or host cores. ALP incurs a modest area overhead of 1.25KB and significantly improves the energy consumption.

We make the following contributions in this work:

- We identify and characterize a critical aspect of efficiently leveraging NDP: the impact of inter-segment data movement overhead between the NDP and CPU cores when the application is partitioned between them. We show that data movement overheads can significantly diminish the potential performance benefits of NDP.
- We propose ALP, a programmer-transparent mechanism that alleviates the performance impact of data movement when partitions applications between the NDP and host

<sup>1.</sup> As our NDP architecture, we consider a 3D-stacked memory with cores in its logic layer (called NDP cores), accessing memory with higher bandwidth and lower latency compared to the host CPU cores.

CPU cores. ALP identifies the application segments that would incur high inter-segment data movement overhead during partitioning, and alleviates this overhead by proactively and accurately transferring data between the segments.

 ALP orchestrates the compile-time and runtime information about the inter-segment data movement overhead. ALP factors in the characteristics of each segment, the estimated inter-segment data movement overhead, and the timeliness of proactive data transfers during partitioning.

# 2 BACKGROUND AND MOTIVATION

## 2.1 Baseline Architecture

The baseline system we assume in this work consists of a host CPU and a 3D-stacked memory module that supports processing data on the computation units in the logic layer. The logic layer and memory layers are connected using through-silicon vias (TSVs) [75], which provide lower latency and significantly higher bandwidth than a traditional off-chip interconnection between main memory and the host CPU cores [3], [4], [6]–[8], [10], [12], [13], [15], [20], [22], [24], [48], [76]. The host CPU and the NDP logic layer employ similar out-of-order (OoO) cores with different cache hierarchies. The host cores use a conventional three-level cache hierarchy, while the NDP cores use a single-level private cache. In this section, we model the same computation units in both host and NDP systems to decouple the effect of computation capabilities from memory hierarchy and data movement. We show the effects of different core types in Section 6. Section 5 shows the details of our system organization and evaluation methodology.

### 2.2 The Effect of Inter-Segment Data Movement

In this section, we show the performance impact of intersegment data movement as a result of code partitioning between NDP and CPU cores. Applications can have different characteristics across different segments (e.g., basic blocks, loops, functions) [20], [71]. Segments of applications that suffer from main memory bottleneck take advantage of NDP execution, while more cache-friendly application segments take advantage of host CPU cores that access a deeper cache hierarchy. Therefore, executing the whole application on the host or NDP cores without partitioning leads to missing opportunities of mapping each segment to the core it finds most beneficial. Partitioning applications between NDP and host cores causes inter-segment data movement overhead (i.e., overhead from moving data generated from one segment and used in the consecutive segments). This overhead can be large if the segments map to cores in different systems (i.e., host and NDP).

Prior works take two approaches to the inter-segment data movement overhead when partitioning applications between NDP and host cores. The first class of works [20], [70], [71] maps segments to NDP or host cores based on architecture suitability of each segment. Such partitioning techniques suffer from inter-segment data movement overhead. The second class of works [21], [72], [73] maps segments to host or NDP cores based on the overall memory bandwidth saving of each segment (which depends on the memory bandwidth saving within each segment and the inter-segment data movement overhead between other segments). These works conservatively do not offload the segments that would take advantage of NDP cores, but incur high inter-segment data movement overhead, missing some of potential NDP performance benefits.

Through an idealized study, we show the potential benefit of NDP and how the performance impact of intersegment data movement overhead on NDP benefits. To do so, we map each segment of the application to host or NDP systems based on the architectural suitability of each segment. We consider each basic block<sup>2</sup> as a segment and measure the execution time of each segment on an NDP core and on a CPU core to find out the best-fitting system (host or NDP) for each segment. Based on this oracle information, we map each segment to the core on which it performs best and compute the execution time of the programs *with* and *without* considering the performance impact of the intersegment data movement overhead between the blocks.

Figure 1 shows the speedup of 1) executing all segments of the application on the host CPU cores (CPU), 2) executing all the application on the NDP cores (NDP), 3) partitioning the application based on the architectural suitability of each segment with *zero* data movement cost (No\_DM), 4) partitioning the application based on the architectural suitability of each segment and with the cost of data movement included (Including\_DM). The speedup values are normalized to the host CPU core's performance. We make two observations based on this figure. First, No\_DM performs on average 26.8% (maximum 44.1%) better than the best average performance of only NDP or only CPU execution. Second, with Including\_DM, the average speedup drops to 9.5% (worst case 56.3%) worse than CPU.



Fig. 1: Performance effect of inter-segment data movement overhead.

### 2.3 Goal

Based on these observations, we conclude that even though the potential benefits of partitioning the applications to NDP and CPU cores in the absence of data movement overhead is very high, we see significant performance loss when we consider the effect of data movement in making offloading decisions. We emphasize that our baseline system in this study is already equipped with prefetching (Table 1). However, as we see in Figure 1, prefetching is not acting effectively in alleviating the data movement overhead because the access pattern of the data moved between the CPU and NDP cores due to code partitioning are typically

2. We choose this because we find individual instructions to be too fine-grained for our NDP cores. This study can be performed at other granularities too. irregular and non-repetitive. **Our goal** is to alleviate the impact of inter-segment data movement to enable efficient partitioning of applications between NDP and CPU cores

Prior works on partitioning [77]–[79] on heterogeneous core architectures do not study partitioning in the context of NDP and do not consider the asymmetry in the memory hierarchy. This leaves significant challenges to address in the NDP context. First, performance and energy overhead of communication between NDP and CPU is very high due to off-chip communication. Since the goal of NDP is reducing the overhead of data movement, this extra communication can amortize the potential benefits of NDP. This calls for a timely and proactive technique for addressing data movement issues in NDP. Second, prior works that propose techniques to alleviate data movement cost in a heterogeneous architecture [74] assume fixed and known partitioning between the segments. In our scenario, we statically do not know where each segment maps. Third, software or compiler-assisted prefetching [80], [81] techniques execute next to the code executing in a different core, and therefore, do not transfer the data proactively. This cannot be *timely* enough for NDP scenarios. Due to these factors, problem space of partitioning is very complex in this case because we need to consider (1) the advantage of NDP/CPU execution, (2) significantly more critical data movement cost, and (3) the potential for proactive data transfer.

# 3 ALP

This section describes the three steps of ALP. In the first step (Section 3.1), during compile time, ALP detects the segments of the applications that can have high inter-segment data movement overhead. In the second step (Section 3.2), during compile time, ALP marks the instructions that generate the inter-segment data. In the third step (Section 3.3), during runtime, ALP incorporates the information about input data and the underlying architecture with the information collected during compile time and 1) performs proactive data transfer and 2) partitions applications between the host and NDP cores.

### 3.1 Identifying Tightly-Connected Segments

The goal of this section is to identify the segments that the cost of data movement between them might amortize the cost of partitioning them. We refer to these segments as tightly-connected segments. After detecting the tightly connected segments, next steps of ALP try to reduce the overhead of inter-segment data between these segments. Listing 1 shows an example of two tightly connected segments, assuming each loop is one segment. In the first loop, the application accesses several input arrays with random indices and generates out, which is the inter-segment data between these two loops. out is then re-used by the next nested loop for n\_reuse iterations. The random index rand\_idx1 parameter in loop 1 is used to model random accesses to the data. Loop 1 accesses many data structures with random access patterns and maps better to NDP cores. However, if n\_reuse is high enough, and out is larger than what would fit in the smaller NDP caches, loop 2 will map better to the host CPU cores with larger caches. However, if we consider this mapping, the cost of transferring out to the host CPU amortizes some of the partitioning benefits.

Listing 1: Synthetic workload for data transfer.

The first step of ALP leverages compiler's assistance to detect if two segments are tightly-connect by calculating their connectivity. The connectivity between the two segments depends on the ratio of the inter-segment data over all the data that both segments consume and produce. Thus, the connectivity between two segments can be modeled as follows:

$$connectivity = \max\left(\frac{inter\_segment\_data}{reg\_in1 + reg\_out1}, \frac{inter\_segment\_data}{reg\_in2 + reg\_out2}\right),$$
(1)

where *reg\_in1* and *reg\_out1* are the number of live registers moving in and out of the first segment respectively. *reg\_in2* and *reg\_out2* are the number of registers moving in and out of the second segment. *inter\_segment\_data* is the number of the overlapping registers in *req\_out1* and *reg\_in2* sets, which refers to the live registers that pass from one segment to the other. The liveness analysis of the compiler provides information regarding the live registers. If connectivity exceeds an architecture-dependent threshold, the mechanism marks the two segments as tightlyconnected. This threshold depends on multiple architectural features that determine whether the overhead of intersegment data movement outweighs the benefits of partitioning the segments between NDP and host cores. These architectural features can affect the inter-segment data movement overhead and the execution time of segments on NDP or host cores. Such features are 1) the latency and bandwidth of the off-chip link between main memory and host system, 2) the internal latency and bandwidth of main memory, and 3) the latency, bandwidth, and size of NDP and host caches, and 4) NDP and host processor core features, such as their frequency and issue width.

This threshold is determined by a one-time offline profiling since, for a given system, this architecture-dependent threshold does not change. This threshold is calibrated by profiling a wide range of application segments with inter-segment data on a given system. We choose this threshold conservatively to mark the tightly-connected segments. ALP's runtime phase further considers applicationdependent and runtime information to decide whether two tightly-connected segments can be partitioned between NDP and host cores.

We calculate connectivity between application segments iteratively to find the segments of the code that have high data movement between each other. For example, in the control flow graph in Figure 3a, if segments A, B, and C form a cluster of tightly-connected segments, they might also form a larger cluster of tightly-connected segments with D. To model the data movement, we calculate the size



Fig. 3: Example of control flow divergence.

of the inter\_segment data between the aggregated cluster (composed of segments A, B, and C) and segment D. Figure 3b shows an example of how ALP handles the control flow divergence using an if-else statement. We analyze both sides of the branch and mark the segments with a large amount of inter-segment data movement in each side. In case the connectivity between the segments in *either* side is high, we mark the source and destination node of the control flow (A and D) and all intermediate segments as tightly-connected segments.

At the end of this stage, all the segments of the program are clustered with their tightly-connected segments. The data movement between the segments within a cluster might significantly reduce performance if the segments map to different NDP and CPU cores. Section 4 explains the implementation details of how ALP passes this clustering information to its subsequent phases.

## 3.2 Data Movement Alleviation

In this section, we explain how ALP alleviates the intersegment data movement overhead between the highly connected segments.

## 3.2.1 Basic Data Transfer

To illustrate our approach behind alleviating the intersegment data movement overhead, we show the execution timeline of the synthetic workload in Listing 1 in three different cases in Figure 2. As mentioned in Section 3.1, assuming no inter-segment data movement overhead, loop 1 would ideally map best to the NDP cores and loop 2 would map best to the host CPU cores. In case (a), both segments execute on the host CPU core. By the time loop 2 starts executing, out is present in the host caches and further accesses to it from loop 2 hit in the cache. Despite the efficient use of host CPU caches for accessing out, loop 1 suffers from memory bottleneck when running on the host



Fig. 4: Execution timeline with transferring inter-segment data and concurrent execution of the segments.

CPU cores. In case (b), the first segment executes in the NDP core, whereas the second segment executes in the CPU core. In this case, loop 1's memory bottleneck gets alleviated via NDP execution. However, in loop 2, all accesses to *out* miss in the CPU caches and incur significant data movement overhead from main memory. In case (c), we show ALP's approach to reducing the inter-segment data movement overhead which is enabled by proactively transferring the inter-segment data to the next segment, as soon as it is produced. Therefore, when loop 2 executes in the host CPU cores, it finds its needed data in host caches.

Reducing the performance overhead of inter-segment data movement using this proactive data transfer approach can be possible if the time for transferring the data can be mostly overlapped with other operations. This means there should be more instructions between the instruction that *generates* the inter-segment data and the instruction that *consumes* the data. For example, in Figure 2, transferring out[0] is overlapped with accessing in1[1], in[1], and in3[1].

# 3.2.2 Data Transfer with Concurrent Execution

In some workloads, some producer and consumer segments access the inter-segment data with the same access pattern. In such cases, after the proactive data transfer of each cache-line of inter-segment data, the next segment of the application can start execution on that data. This way, the segment that generates the inter-segment data elements can keep working on one core and the segment that consumes this data can run on another core, increasing the concurrency. Figure 4 shows an example execution timeline for this case. To detect this case, the compiler checks if (1) the first segment generates (writes) the data that second segment reads and (2) if the two segments access this data with the same access pattern.

## 3.2.3 Detecting Inter-Segment Data

In this section, we explain how ALP detects the intersegment data that needs to be proactively transferred between the two tightly-integrated segments.



Fig. 2: Timeline of a workload on (a) CPU, (b) NDP, (c) with proactive data transfer (the indices are not the vector indices. Number *i* inside the brackets refers to the *i*th cache line accessed through the execution.)

In most programs, the instructions that generate the inter-segment data are the same across different executions of the program for different input datasets [74]. Therefore, we leverage Data Marshalling technique [74] to identify the instructions that generate the inter-segment data through profiling the application as shown in Algorithm 1. This algorithm performs analysis on each two tightly-connected segments identified in the first step of ALP (Section 3.1). For every memory access in the current segment, the algorithm checks if the instruction that wrote to this address is from the previous segment. In that case, the last writer instruction to this address from the previous segment is marked as a generator instruction (Lines 1 to 5). For each write in the current segment, the algorithm collects the memory addresses, and the Programmer Counter (PC) in the current\_last\_writer\_list. This way, we can check if they are the generator instructions for the next segment (Lines 6 to 8). After the end of each segment, we empty the previous\_last\_writer\_list for the previous segment, and set the current segment's current\_last\_writer\_list to be previous segment's list previous\_last\_writer\_list (Lines 10 to 14).

| Algorithm 1: Detecting generator instructions                   |                                                                                          |  |  |  |  |
|-----------------------------------------------------------------|------------------------------------------------------------------------------------------|--|--|--|--|
|                                                                 | Result: generator_instruction_list                                                       |  |  |  |  |
| <b>Input</b> : Address of the memory accesses, Instruction PCs, |                                                                                          |  |  |  |  |
|                                                                 | previous_last_writer_list                                                                |  |  |  |  |
| 1                                                               | 1 for Every memory access in the current segment do                                      |  |  |  |  |
| 2                                                               | <b>if</b> (accessed cache-line is first read in current segment) <b>and</b> (the address |  |  |  |  |
|                                                                 | is in the previous_last_writer_list) then                                                |  |  |  |  |
| 3                                                               | Add the PC of the last writer instruction in                                             |  |  |  |  |
|                                                                 | previous_last_writer_list to generator_instruction_list                                  |  |  |  |  |
| 4                                                               | end                                                                                      |  |  |  |  |
| 5                                                               | if Memory access is store then                                                           |  |  |  |  |
| 6                                                               | Add the address and the PC to current_last_writer_list                                   |  |  |  |  |
| 7                                                               | end                                                                                      |  |  |  |  |
| 8                                                               | end                                                                                      |  |  |  |  |
| 9                                                               | for Every new segment start do                                                           |  |  |  |  |
| 10                                                              | Empty previous_last_writer_list                                                          |  |  |  |  |
| 11                                                              | Make current_last_writer_list to be previous_last_writer_list                            |  |  |  |  |
| 12                                                              | Make an empty list for current_last_writer_list                                          |  |  |  |  |
| 13                                                              | end                                                                                      |  |  |  |  |
|                                                                 |                                                                                          |  |  |  |  |

The compiler performs this profiling and marks the generator instructions with a new instruction added to Instruction Set Architecture (ISA). When the program executes these instructions, if the two segments map to different cores, ALP proactively transfers the data from one core's cache to the cache of the core executing the next segment. Section 4.2 provides more details about the ISA and hardware support for this step.

ALP can work on any compiler because it relies on basic compiler features, like liveness analysis available in offthe-shelf compilers. The data movement analysis is done before the register allocation, in the IR stage, with the code in the static single assignment form. The baseline contextsensitive interprocedural analysis is required to model the data movements across the program. We analyze the whole graph of the application. ALP can adopt other techniques for optimizing interprocedural analysis. We do not expect our proposed mechanisms to significantly increase the compile time because they build on top of the already existing steps of compilation, like liveness analysis. Any additional increase in the compilation time will also be amortized over many runs for compiled languages.

## 3.3 Incorporating Runtime Information

The goal of this step is to 1) collect the architecturedependent and runtime information and 2) together with the compile-time information (collected in the first two steps of ALP), assess and incorporate the impact of inter-segment data movement overhead during partitioning.

# 3.3.1 Offloading Metric

In this section, we explain the metrics that guide ALP to map segments to the host CPU or NDP cores. ALP can adopt other offloading metrics that can better suit different NDP architectures.

When an offload unit (i.e., a segment or a cluster of segments that need to run together on the same core) starts execution on a host CPU core, we measure these metrics over a small epoch of execution. If the ratio of the L1 cache misses to the ratio of the LLC misses is close to one, we offload the execution to an NDP core. The reason is in these scenarios, the large LLC does not efficiently serve more memory requests compared to the small L1 cache. Since the NDP cores also have a small L1 cache, but higher bandwidth connection to main memory, these segments will potentially take more advantage of NDP execution. After the NDP offload happens, the NDP core also measures the IPC over an epoch of execution. In case the IPC of NDP execution is lower than what was measured in the host CPU core before offloading, the execution migrates back to the CPU core. Section 4.3 describes the implementation details of Runtime Table that keeps track of the runtime information of different runtime units over different epochs of their execution.

# 3.3.2 Offloading Segments with Potential for Data Movement Alleviation

This phase of ALP incorporates the information about the input data and the architecture features for the cluster of segments with the potential for data movement alleviation. As discussed in Section 3.2.1, the data movement alleviation technique through proactive data transfer is effective if the time for transferring the inter-segment data can be mostly overlapped with other operations. This means there should be more instructions between the instruction that *generates* the inter-segment data and the instruction that *consumes* the data. We refer to the segments with these features as segments with potential for data movement alleviation.

Based on the size of the inter-segment data and the NDP and host cache sizes, and the characteristics of each segment, ALP maps the segments to the host CPU or NDP cores in two scenarios. First, if the size of the inter-segment data is too large to fit in the destination cache, the basic data transfer scheme (Section 3.2.2) will not improve the performance. The reason is that after some point, the new arriving parts of the inter-segment data will evict older parts. However, in case we can transfer data with concurrent execution (Section 3.2.2), the next segment uses the data when it arrives. Second, if the size of the inter-segment data is small (compared to the destination's cache size), we assume its cost of data movement will not affect the total execution time unless the segments happen more than once. Therefore, ALP can profile the tightly-connected segments over a few iterations and map them to the host CPU or

NDP cores based on the offloading metrics defined in Section 3.3.1.

Algorithm 2 shows how the runtime phase of ALP considers these different scenarios and maps these segments to NDP or CPU cores. Lines 1 to 21 in this algorithm handle the case where the inter-segment data in *each iteration* is smaller than the size of the CPU cache. Lines 22 to 30 handle segments with large inter-segment data.

| Algorithm 2: Offloading Segments with the Poten-        |                                                                 |  |  |  |  |  |
|---------------------------------------------------------|-----------------------------------------------------------------|--|--|--|--|--|
| tia                                                     | tial for Data Movement Alleviation                              |  |  |  |  |  |
| R                                                       | esult: The mapping of each segment within the cluster to NDP or |  |  |  |  |  |
|                                                         | CPU                                                             |  |  |  |  |  |
| Iı                                                      | nput : Begin and end PC address of the cluster                  |  |  |  |  |  |
|                                                         | The size of the inter-segment data                              |  |  |  |  |  |
| 1 if size of inter-segment data $< CPU$ cache size then |                                                                 |  |  |  |  |  |
| 2                                                       | profile the segments within the cluster on initial iterations;  |  |  |  |  |  |
| 3                                                       | if The producer segment shows high memory intensity then        |  |  |  |  |  |
| 4                                                       | if The consumer segment shows low memory intensity then         |  |  |  |  |  |
| 5                                                       | MAP(producer,NDP);                                              |  |  |  |  |  |
| 6                                                       | MAP(consumer, CPU);                                             |  |  |  |  |  |
| 7                                                       | Transfer the inter-segment data;                                |  |  |  |  |  |
| 8                                                       | else                                                            |  |  |  |  |  |
| 9                                                       | MAP(producer, NDP);                                             |  |  |  |  |  |
| 10                                                      | MAP(consumer,NDP);                                              |  |  |  |  |  |
| 11                                                      | end                                                             |  |  |  |  |  |
| 12                                                      | else                                                            |  |  |  |  |  |
| 13                                                      | if The consumer segment shows low memory intensity then         |  |  |  |  |  |
| 14                                                      | MAP(producer, CPU);                                             |  |  |  |  |  |
| 15                                                      | MAP(consumer, CPU);                                             |  |  |  |  |  |
| 16                                                      | else                                                            |  |  |  |  |  |
| 17                                                      | MAP(producer,CPU);                                              |  |  |  |  |  |
| 18                                                      | MAP(consumer,NDP);                                              |  |  |  |  |  |

| 22       | els | Se la                             |  |  |  |  |
|----------|-----|-----------------------------------------------------------------------|--|--|--|--|
| 23       |     | Profile the producer segment in CPU;                                  |  |  |  |  |
| 24       |     | if The producer segment shows high memory intensity and transfer with |  |  |  |  |
|          |     | concurrent execution mode then                                        |  |  |  |  |
| 25       |     | MAP(producer, NDP);                                                   |  |  |  |  |
| 26       |     | proactively transfer the inter-segment data to CPU;                   |  |  |  |  |
| 27       |     | else                                                                  |  |  |  |  |
| 28       |     | MAP(producer, CPU);                                                   |  |  |  |  |
| 29       |     | end                                                                   |  |  |  |  |
| 30       | en  | d                                                                     |  |  |  |  |
| 29<br>30 | en  | end<br>d                                                              |  |  |  |  |

Transfer the inter-segment data;

20

21

end

end

# *3.3.3 Offloading Segments without the Potential for Data Movement Alleviation*

This section explains how ALP maps the segments within the same cluster of tightly-connected segments to the NDP or CPU cores in case the segments do *not* take advantage of the proactive data transfer technique (as described in Section 3.2.1).

Based on the size of the inter-segment data and the NDP and host cache sizes, and the characteristics of each segment, ALP maps the segments to the host CPU or NDP cores in two scenarios. First, if the data moved from one segment to another is very large, the cache of the NDP or host CPU cores cannot capture the re-use of this data between the segments. Therefore, executing the tightly-connected segments of the cluster on the same core does **not** lead to higher performance. Second, if the data movement can be captured in the host caches, ALP makes offloading decision for them as a single unit. We call these segments *inseparable segments*.

Mapping inseparable segments is challenging because their optimal mapping depends on their own characteristics and the characteristics of their connected segments. We make the key observation that the important inseparable segments occur *jointly and repeatedly*. The reason is that if these segments do not occur frequently, they either (1) take small amount of execution time, therefore they do not contribute to the overall performance, or (2) they take large amount of time and pass large amount of data to each other which cannot be captured in caches for further reuse. Therefore, by leveraging the repeated nature of the inseparable segments, ALP's runtime mechanism profiles the **aggregated behavior** of them over an epoch to make offloading decision for both segments in their following iterations.

# **4 IMPLEMENTATION**

In this section, we provide details on ALP's implementation. Figure 5 shows how ALP's hardware units interface with the host CPU pipeline. We add an Offload Management Unit that resides on the host chip, and is responsible for handling the offload to the NDP cores. The Monitor Units in the NDP and CPU cores collect the necessary runtime information and populate the Offload Table. The Offload Table also holds the information that the compiler has gained about the segments of the code. Based on our analysis in Section 6.5, this table can be accessed within one clock cycle. The static and dynamic information in the Offload Table is the basis for decision making of the Offload Management Unit for mapping the offload units (segments or a cluster of segments) to different cores. The hardware components of ALP (i.e., Offload Management Unit and Monitor Units) are not in the critical path of any of the pipeline stages in the processor and reside next to the cores, and in each epoch, receive information about the execution of the application segment on the host CPU cores and the NDP cores. Therefore, ALP does not change the processor cores pipeline's depth and its frequency. This section explains the functionality of these units in detail.



Fig. 5: Overview of ALP's hardware units.

# 4.1 Identifying Clusters

In this section, we describe how the first step of ALP communicates the compile-time information about the segments (Section 3.1) for the use of the next step of ALP.

**Code Annotation and ISA Support.** The compiler passes the information about the segments to the runtime mechanism by two new instructions that we introduce to the Instruction Set Architecture (ISA): CLSTR.BEGIN and CLSTR.END. These instructions identify respectively the beginning and the end of the cluster of the tightly-connected segments and include an identifier to each cluster. These instructions enable detecting when the candidate offload units start and ends, without any programmer involvement. **Interface to the processor.** The compile-time information collected in the first step populates the *Offload Table* located in the Offload Management Unit (Figure 5) at the beginning of the program's execution, which is then accessed during runtime (during Phase 3). This table holds information about the segments, such as their ID, type (producer, consumer, or inseparable segments), and the relative intersegment data ratios (i.e., the ratio of inter-segment live registers over the total live registers in the tightly-connected segments). During the execution, after decoding the instruction for CLSTR.BEGIN, ALP searches the table with the Cluster ID to retrieve the data about the clusters.

## 4.2 Data Movement Overhead Alleviation

This section explains how ALP performs the proactive data transfer between different segments of the application as described in Section 3.2.

Code Annotation and ISA Support. After specifying the last writer of the inter-segment data, the compiler marks those instructions by adding a prefix called TRANSFER to them. This prefix indicates that the instruction's resulting data (inter-segment data) needs to be transferred to the next segment as soon as it is generated. Therefore, when this instruction executes, the Offload Management Unit realizes that it has to transfer the inter-segment data to the next segment in case the segments within the cluster of tightly-integrated segments are mapped to different cores. Using these instructions, ALP detects and transfers the intersegment data in a timely manner, regardless of the access pattern of the inter-segment data across different input sets and executions. The information collected during this step in the compile time also populates the Offload Table's entry, to indicate the segments types (producer of the intersegment data or its consumer).

Hardware Support. If the segments with a cluster of tightlyintegrated segments are mapped to different cores, the runtime phase of ALP handles the data transfer through a modified MESI cache coherency protocol. If the data needs to transfer from the host CPU cores to the NDP cores, the runtime system issues this transfer request, and makes the inter-segment cache-lines invalid in the host caches, transfers the data to the NDP cache, and change its status as modified. If the data transfers from the NDP cores to the host CPU cores, the cache-line gets invalidated in the NDP cache and transfers to the host CPU caches with modified state.

### 4.3 Runtime System

This section explains how the third phase of ALP makes the offloading decisions based on the static and dynamic information collected through its different steps. Figure 5 shows the *Offload Table*, which holds the information about different offload units, and the table's interface to CPU and NDP cores. First, the compiler-generated data populates the table with clusters information. The table also holds information about the cluster type to detect whether they have the potential for data transfer, and in case they do, it separates the producer and consumer segments in separate rows to profile them separately **①**. During the execution of the program, the Monitor Unit stores the information about the IPC and L1 and LLC cache miss rates of each offload unit in the Offload Table 2. It also detects the absolute size of the inter-segment data based the actual data sizes and the ratio of the inter-segment data determined during compile-time. After an epoch of execution, if the memoryintensity of an offload unit is high, the Monitor Unit records the Instruction per Cycle (IPC) of its execution so far in the Offload Table 3, and sends the offloading unit to the NDP cores. It also includes the required input live registers of the offload and its starting PC in the Offload Package 4. The Offload Management Unit informs the host CPU cores to stall **5** until it receives an acknowledgment that the NDP execution has finished **6**. Meanwhile, in case the producer and consumer blocks map to different cores, the Offload Management Unit issues the transfer signal to the respective coherency mechanism to move the inter-segment data from the producer to the consumer proactively **v**. After each epoch of execution, the NDP cores communicate the status of NDP execution by sending the IPC of the offload unit to the Offload Table. In case the IPC of the offload unit decreases from what has been measured in during its execution in the CPU, the offload unit transfers back to the host CPU for its remaining iterations 8.

## 4.4 Design Considerations

#### 4.4.1 Cache Coherence

In this work, we model a fine-grained coherency protocol between the NDP and host CPU cores. The host LLC is only inclusive of the host-side caches. Therefore back invalidation from LLC only affects the CPU caches. When a cacheline is evicted from LLC, in case it is also present in an NDP cache, its status will change to Exclusive in the NDP cache. If there is any coherency-related data movement between NDP and CPU caches, we model its cost by adding the cost of memory stack's off-chip link to the baseline coherency overhead modeled by our simulator.

Further optimization over the fine-grain coherency is possible by adopting more advanced coherency mechanisms such as [20]. The problem we address in this work is rather about the inter-segment data sharing between the segments, which exists even in the context of a single thread and is different from the coherency issue. Advanced coherency mechanisms do improve the performance of NDP execution, and their performance impact can be considered during ALP's runtime.

### 4.4.2 Virtual Memory

Translating virtual addresses to physical addresses when executing an application on the NDP side is challenging. If the NDP cores rely on the existing host-side translation mechanism, for every memory access, the NDP cores need to send a translation request to the host via low-bandwidth off-chip buses. This overhead can further increase if the translation requires page table walks that incur further data movement overhead between host and main memory. If the NDP cores naively duplicate TLB and page table walker, they can incur significant overhead due to 1) maintaining coherency with the host-side translation mechanism, and 2) additional area overhead.

Similar to previous works [82], in this work, we assume Direct Segments [83] as our virtual memory model and interface the memory as a primary region. Each direct segment maps a large range of contiguous virtual memory addresses to contiguous physical addresses using base, limit and offset information. If a virtual address is between the base and limit, its corresponding physical address is simply translated as that virtual address plus the offset. To support direct segment translation, the NDP cores require a small direct segment hardware (including registers to store base, limit, and offset values, and an adder) and need to receive base, limit, and offset values from the host at the beginning of NDP execution of each offloaded application segment. ALP can orthogonally adopt more advanced NDP-specific translation techniques [84], [85] for further performance benefits, which are beyond the scope of this work.

## 4.5 Multiple NDP Stacks

ALP extends to a system with multiple NDP stacks assuming a baseline mapping *between the stacks*. ALP addresses the problem of data sharing *between segments in the NDP and host CPU cores*. After offloading a segment to NDP cores, ALP's runtime structure (Section 4.3) monitors its execution on NDP cores. If the NDP stacks have high data movement between each other and that leads to lower performance than host execution, ALP can identify that and rollback the execution to CPU in case that leads to higher performance.

# 5 EVALUATION METHODOLOGY

# 5.1 Experimental Setup

We develop a system level simulator that accurately models the host CPU and NDP cores with the data movement between them. Our simulator uses ZSim [86] (https://github.com/s5z/zsim) to model the host and NDP cores and Ramulator [87] (https://github.com/CMU-SAFARI/ramulator) to model the 3D-stacked DRAM [19], [70]. We modify ZSim and Ramulator so that the host CPU cores have lower bandwidth connections to DRAM, while the NDP cores have higher bandwidth connections to DRAM. We use Pin [88] to obtain information about the registers of the segments for phase 1 of ALP. To model the operation of phase 2, we develop a profiling tool based on algorithm 1. We model the hardware structures and the runtime analysis of phase 3 in ZSim. We model a crossbar in our memory model in Ramulator to model the communication of data between different NDP cores in different vaults on the logic layer of the 3D-stacked system. Our proposals can also be adopted to other host + NDP architectures with asymmetric memory hierarchy properties.

Table 1 describes the system configuration we use to evaluate our proposal. Our system consists of x86 Out-of-Order (OoO) cores for both host and NDP sides.<sup>3</sup> The host and NDP cores have private L1 caches, but the host cores also leverage L2 and LLC which are only inclusive for the

host. The NDP and host CPU cores run at the same 2.4 GHz frequency, with the goal of decoupling the effects of data movement and the memory hierarchy from the processing capabilities. We demonstrate our analysis on a single core CPU and single core NDP to isolate other effects (e.g., the interactions between the threads and sharing resources) from the inter-segment data movement overhead. We also show the advantages of ALP with more cores (configuration with 8 NDP cores and 8 host CPU cores, and configuration with 32 NDP cores and 8 host CPU cores).

TABLE 1: Baseline, HMC and NDP configurations.

| <b>OoO Execution Cores</b> @ 2.4 GHz, 32 nm; 4-wide out-of-order;<br>128-entry ROB; 32-entry LQ and 32-entry SQ;<br>Branch predictor: Two-level GAs. 4,096 entry BTB; 1 branch per fetch; |  |  |  |  |  |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|
| L1 Data + Inst. Cache 32 KB, 8-way, 4-cycle; 64 B line; LRU policy; 5/33 pJ per hit/miss [89]                                                                                             |  |  |  |  |  |
| <b>L2 Cache (only CPU)</b> Private 256 KB, 8-way, 7-cycle; 64 B line; LRU polic<br>Prefetcher: Stream prefetcher with 16 entries;<br>6/93 pJ per hit/miss [89]                            |  |  |  |  |  |
| LLC Cache (only CPU) Shared 8 MB, 16-way, 27-cycle;<br>64 B line; LRU policy; Inclusive for CPU; MESI protocol;<br>945/1904 pJ per hit/miss [89]                                          |  |  |  |  |  |
| <b>3D-stacked DRAM</b> 4 GB, 32 vaults, 8 DRAM banks/vault;                                                                                                                               |  |  |  |  |  |

DRAM: CAS, RP, RCD, RAS and CWD latency (9-9-9-24-7 cycles); 2 pJ/bit SerDes links [90]; 2 pJ/bit internal, 8 pJ/bit logic layer [71];

# 5.2 Evaluated Applications

Table 2 shows the list of the workloads we use in this work and their respective input sizes. We select a broad range of applications from popular benchmark suites and various domains. To demonstrate different partitioning scenarios, the selected applications include a wide range of memoryintensive and compute-intensive workloads. The *common feature* of these workloads is that they have some segments that ideally would take advantage of execution on the NDP cores and some other segments that would take advantage of execution on the host CPU cores. Therefore, for each of these applications, the *potential* performance benefit from partitioning the application between the NDP and host CPU cores is high.

TABLE 2: Evaluated workloads and input sets.

| Application            | Benchmark Suite | Domain                     | Input Parameters           |
|------------------------|-----------------|----------------------------|----------------------------|
| KCore-Decomposition    | Ligra [91]      | Graph Processing           | rMat 1M [92]               |
| Radii                  | Ligra [91]      | Graph Processing           | rGnutella [93]             |
| RayTrace               | Parsec [94]     | Graphics                   | simlarge                   |
| Backpropagation        | Rodinia [95]    | Machine Learning           | 524288 elements            |
| Breadth-First Search   | Rodinia [95]    | Graph Processing           | graph1MW                   |
| Breadth-First Search   | Ligra [91]      | Graph Processing           | rMat 1M [92]               |
| Needleman-Wunsch       | Rodinia [95]    | Bioinformatics             | dimension 4096 penalty 10  |
| Particle Filter        | Rodinia [95]    | Statistics                 | x 128 y 128 z 10 np 400000 |
| Ocean (contiguous)     | Splash-2 [96]   | High-Performance Computing | 514×514 grid               |
| Ocean (non-contiguous) | Splash-2 [96]   | High-Performance Computing | 514×514 grid               |

To further analyze the workloads, we profile the applications to analyze their memory-bound behavior using Intel VTune [97] on an Intel Xeon E3-1240 processor with 4 cores. Table 3 shows the most memory bound function of each workload and the amount of time it takes in the whole application. Memory-bound measure refers to the ratio of cycles spent *waiting for memory accesses* over the total execution time. We observe that all our selected workloads have functions that take a notable amount of execution time

<sup>3.</sup> When using different processors or Instruction Set Architectures (ISA), the compiler can also generate appropriate code based on our techniques.

which are memory-bound. We conclude that these applications would ideally benefit from partitioning between the host and NDP cores.

TABLE 3: Workload Characteristics.

| Workload                       | Function            | Time (%) | Mem-bound (%) | Mem. accesses |
|--------------------------------|---------------------|----------|---------------|---------------|
| Kcore-decomposition            | edgeMapDense        | 52.7     | 53.82         | 2.6 GB        |
| Radii                          | edgeMapDense        | 80.78    | 52.41         | 136 MB        |
| RayTrace                       | [VTune format]      | 62.35    | 6.52          | 70.13 MB      |
| Backpropagation                | bpnn_adjust_weights | 61.82    | 86.50         | 3.9 GB        |
| Breadth-First Search (Rodinia) | BFŚGraph            | 5.45     | 22.54         | 1.39 GB       |
| Breadth-First Search (Ligra)   | edgeMapDense        | 30.86    | 34.08         | 2.5 GB        |
| Needleman-Wunsch               | nw_optimized        | 42.54    | 39.66         | 81 MB         |
| Particle Filter                | [Vtune format]      | 3.99     | 2.70          | 3.09 MB       |
| Ocean (contiguous)             | slave2              | 24.41    | 22.98         | 2.62 GB       |
| Ocean (non-contiguous)         | slave2              | 16.00    | 21.96         | 1.23 GB       |

# 6 EVALUATION

In this section, we show the performance and energy benefits of ALP for various workloads. Throughout this section, **No\_DM** refers to partitioning the application based on the architectural suitability of each segment with *zero* data movement cost, and **DM\_Included** refers to partitioning the application based on the architectural suitability of each segment with the cost of data movement included.

### 6.1 Performance

In this section, we analyze the performance benefit of ALP, compared to only host CPU, only NDP, DM\_Included and No\_DM execution for the workloads with the potential for data movement alleviation. As described in Section 3.2.1, reducing the performance overhead of inter-segment data movement using the proactive data transfer approach can be possible if the time for transferring the data can be mostly overlapped and hidden with other operations. This means there should be more instructions between the instruction that *generates* the inter-segment data and the instruction that *consumes* the data. We show the performance benefits of ALP for workloads without the potential for proactive data transfer in Section 6.4. Synth\_WL is the synthetic workload in Listing 1 to demonstrate the proactive data transfer technique.

Figure 6 shows the performance benefit of ALP with 8 NDP cores and 8 host CPU cores. We observe that ALP performs 54.3% better than host-only and 45.4% better than NDP-only execution. The memory-bound applications segments with high memory bandwidth requirements are able to issue more concurrent memory accesses in the NDP configuration with larger number of cores and larger available main memory bandwidth. The effectiveness of NDP execution improves in systems with larger number of cores because they take better advantage of the high main memory bandwidth available to the NDP cores. However, the application segments that take better advantage of larger caches take advantage of execution on the host cores. We conclude that ALP enables efficient partitioning of application segments between the host and NDP cores through efficient inter-segment data movement alleviation.

Figure 7 shows the performance benefits of ALP for single-core scenario to gain deeper understanding of how ALP alleviates inter-segment data movement overhead, by isolating other effects such as thread communication. Based on this figure, we make two observations. First, On average,



Fig. 6: Performance benefits of ALP with total 16 cores.

ALP achieves almost all of the potential performance benefits of partitioning, achieving on average 18.9% speedup over execution only on a host CPU core, and 19.7% better than execution only an NDP core. For most these workloads, ALP achieves all of the potential benefits of partitioning because it can move the inter-segment data in a timely manner. For some other workloads (ocean\_cp and ocean\_ncp), ALP outperforms the No\_DM configuration because the consumer segments can start the execution concurrently as soon as their required inter-segment data arrives (as described in Section 3.2.2). For some other workloads (Backprop, KCdec), this technique enhances performance, but it does not reach the maximum possible performance because alleviating the data movement cost is only possible for some clusters. Second, ALP outperforms the performance of the DM\_Included case for all the workloads, even though in the DM\_Included case, each segment maps to the core on which it individually performs best. The reason is that when partitioning, ALP considers the effect of the inter-segment data movement between the segments and alleviates its performance overhead. We conclude that using both the compiler and runtime information, ALP efficiently maps code segments to either host or NDP cores considering 1) the architectural suitability of each segment, 2) the intersegment data movement overhead of each segment, and 3) whether this inter-segment data movement overhead can be alleviated proactively and in a timely manner.



Fig. 7: Performance benefits of ALP with data transfers.

### 6.2 The Effect of Core Counts and Types

To study the effect of having different number and types of cores in host and NDP configurations, we study an NDP configuration with 32 in-order cores and a host configuration with 8 OoO cores. Figure 8 shows ALP's performance benefits in such a system. We see that NDP benefits from the higher core count in this case. The runtime system of ALP (as described is Section 3.3), using steps 1, 2, and 7, collects the IPC of the segments on both CPU and NDP cores. This way, it will detect the higher performance benefits of NDP and make the right decision for offloading segments accordingly. In this case, ALP performs  $2.24 \times$  faster than execution only on the host CPU cores, and even 22% faster

than NDP-only execution. The reason is that although the NDP configuration has much larger number of cores in this case, there are still some application segments that take advantage of the larger cache hierarchy in the host. We conclude that ALP adapts to different system configurations with various numbers and types of cores by incorporating the architecture, input, and runtime information during the third phase.



Fig. 8: Performance benefits of ALP with 8 CPU cores and 32 in-order NDP cores.

## 6.3 Energy

In this section, we show energy benefits of ALP. The energy consumption of executing a segment of application on an NDP core is the sum of the energy spent on the cores, L1 NDP caches, and DRAM. The energy consumption of executing a segment of application on a host CPU core is the sum of the energy spent on the cores, L1, L2, and LLC CPU caches, off-chip links, and DRAM. ALP's energy is the sum of the energy spent on the cores, L1, L2, and LLC CPU caches (for segments that access these caches), L1 NDP caches (for segments that access this cache), off-chip links (for data movement between NDP and CPU cores and for CPU memory accesses), and DRAM. The value of energy per access for each of these elements are listed in Table 1.

Figure 9 shows the energy consumption of only-NDP execution, only-host CPU execution, and ALP. We observe that ALP provides significant energy improvement over both NDP and CPU executions ( $4.5 \times$  and  $2.12 \times$ ). The reason is that segments that map to a host CPU core take advantage of the large LLC and reduce the number of accesses that go to memory. They also capture the re-use between the segments that have high data movement between each other, avoiding extra off-chip data communication. Segments that map to an NDP core are those with random memory accesses which would have lead to high LLC miss rates. By executing these on NDP cores, they do not pay the extra cost of accessing the LLC and then subsequently bringing data from DRAM to the LLC via the off-chip links.

## 6.4 Segments without Data Movement Alleviation

In this section, we present ALP's performance benefit for the inseparable segments as described in Section 3.3.3. These



Fig. 9: Energy consumption of ALP.

are the segments without TRANSFER instructions because the data transfer between them could not be overlapped with other instructions. Figure 10, we show the performance benefit of ALP for inseparable segments. We make two key observations. First, ALP performs on average 32.8% better than mapping segments based on their individual characteristics (DM\_Included). ALP avoids mapping these segments to different cores by considering the effect of inter-segment data movement. ALP profiles the aggregated behavior of the segments over the epochs of execution and maps the inseparable segments to the core that they collectively find to be the most profitable candidate. This way, ALP avoids the performance loss that would have resulted from neglecting the inter-segment data movement overhead between these segments. Second, although Particle Finder is heavily compute-bound, we observe that it takes advantage of NDP execution. The reason is that the working set of this application is very small such that it can even fit in the small NDP caches. Therefore, ALP's runtime system (Section 3.3) detects that the host-side LLC is not more efficient than L1, and offloads the application segments to the NDP cores accordingly. In this case, the workload performs better on the NDP cores because it does not spend extra time on the unnecessary L2 and LLC accesses in the host.



Fig. 10: Performance benefit of ALP for inseparable segments.

### 6.5 Area Overhead

In this section, we determine the area overhead of ALP by calculating the size of the Offload Table (*Table\_size*) as follows:

$$Table\_Size = Row\_Count \times (Ratio + ID + Block\_Type + L1LLC\_Ratio, IPC \times 2 + Decision)$$
(2)

where the number of table rows (Row\_Count) is determined by the number of distinct offload units. In this work, we use 50 rows which is significantly more than the maximum number of offload units ALP extracts for the applications we studied. With 4 bits for representing the ratio of the inter-segment data (Ratio), 6 bits for block ID (ID), 2 bits for block types producer, consumer, or inseparable segments (*Block\_Type*), and 4 bits for ratio of cache misses (L1LLC\_Ratio) and 4 bits for considering IPC in in 16level granularity (*IPC*), and one decision bit (*Decision*), the table size becomes 1.25 KB, which is significantly less than L1 cache size. Based on our CACTI [98] simulations, a table of this size can be accessed within one clock cycle. In case other applications have large number of offload units, the rows of the Offload Table can be filled with an LRU policy.

# 7 RELATED WORK

To our knowledge, ALP is the first programmer-transparent mechanism to alleviate the inter-segment data movement overhead between the host and NDP cores by proactively transferring data between application segments. In this section, we discuss prior works that are related to different aspects of our work.

# 7.1 Offloading Applications to NDP Computation Units

Prior works take two approaches to inter-segment data movement when partitioning applications between the host and NDP computation units. The first class of works maps segments to host or NDP based on the characteristics of each segment by considering the memory access behavior of each segment individually [20], [70], [71]. Such works offload the memory-bound application segments to the NDP computation units, and keep the more cache-friendly segments in the host CPU cores. For example, CoNDA [20] assumes a programmer-annotated partitioning between NDP and CPU, and its goal is to enable efficient coherency between the partitions. DAMOV [70] identifies new insights about the different data movement bottlenecks and uses these insights to determine whether NDP or other data movement mitigation techniques are suitable for different applications. The key focus of these works does not target the problem of finding an efficient approach for partitioning the application to alleviate the overall data movement overhead. Since these approaches consider the memory bottlenecks of each segment individually and isolated from the other segments, they suffer from inter-segment data movement overhead between the host cores and NDP execution units.

The second class of works maps application segments to the host or NDP computation units based on the overall memory bandwidth saving of each segment, which depends on the memory bandwidth saving within each segment and the inter-segment data movement overhead between other segments [21], [72], [73], [99]–[101]. If partitioning segments leads to high inter-segment data movement overhead, these works do not partition such segments to different cores that would be most beneficial for each segments. Therefore, as shown in Section 2.2, these works suffer from missing some of the potential benefits of partitioning. On the other hand, ALP proposes techniques for alleviating inter-segment data movement overhead to enable efficient application partitioning between NDP and host cores. ALP can be tuned to be adopted in different NDP proposals assuming different execution units in the logic layer.

### 7.2 Co-Locating Computation and Data

Prior work has studied placing data close to computation in different contexts. Hardware prefetchers [102], [103] do not properly detect the inter-segment data with irregular access patterns. Complex prefetchers [104], [105] need long time to train over a large set of data, however, this cannot be timely for small inter-segment data or when the execution moves fast between CPU and NDP cores. Software prefetchers [80], [81] execute next to the code in the next segment, therefore, do not transfer the data proactively. Works on thread migration acceleration [106] and caching techniques [107] are orthogonal to this work and can further improve ALP's performance. Prior works on co-locating computation and data do not study inter-segment data movement in the context of systems with the host and NDP computation units and do not consider the asymmetry in the memory hierarchy. These factors leave significant challenges to address in the context of NDP. First, performance and energy overhead of communication between NDP and host computation units are very high due to off-chip communication. Since NDP's goal is reducing the overhead of data movement, this extra communication due to data sharing with the host can reduce the potential benefits of NDP. This work addresses these challenges in the context of NDP using compiler and hardware support.

Data Marshalling [74] mitigates inter-core data misses in Staged-Execution models using proactive data transfers (marshalling). While ALP also uses proactive data transfers as part of its second step, it is not known statically where each segment maps. This factor, along with the expensive off-chip communication in NDP scenario, impose more challenges for efficient data transfer in the context of NDP and host systems. Data Marshalling also does not address the problem of efficiently partitioning applications, and assumes the stages of the applications are already known.

Livia [108] proposes a new system architecture and programming model that co-locates tasks and data throughout the memory hierarchy with the goal of reducing the data movement. In ALP, we solve a different problem. We show that if two segments would ideally map to different NDP/CPU cores, the cost of data movement between them could amortize the benefits of partitioning. Therefore, by alleviating the data movement between them, we enable them to map to their ideal core. ALP can be integrated in Livia's system to further improve its performance by alleviating inter-segment data movement between different parts of the application mapped in different computation units throughout the memory hierarchy. Tang et al. in [109] propose a compiler algorithm that maps the computations to different NDP cores to reduce the distance-to-data on the on-chip network. ALP's approach can further improve the performance for this proposal by applying proactive data transfer.

# 8 CONCLUSION

We identify and characterize an important aspect of NDP: the inter-segment data movement overhead between NDP and CPU cores when code is partitioned between them. We demonstrate that the inter-segment data movement overhead can significantly diminish the potential performance benefits from NDP. To fully leverage NDP, we introduce a programmer-transparent hardware-software cooperative mechanism, ALP, that (1) considers and alleviates the performance impact of data movement and (2) efficiently partitions applications between NDP and CPU, factoring in both architectural suitability and estimated data movement overhead. Our analyses on a wide range of workloads show that ALP can achieve almost all the benefits of partitioning in workloads with the potential for proactive data transfer.

# REFERENCES

- Christina Giannoula, Ivan Fernandez, Juan Gómez Luna, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-in-Memory Architectures. SIGMETRICS, 2022.
- [2] Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F Oliveira, and Onur Mutlu. Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System. In *IEEE Access*, 2022.
- [3] Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, and Onur Mutlu. Simultaneous Multi-Layer Access: Improving 3D-stacked Memory Bandwidth at Low Cost. TACO, 2016.
- [4] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing. In ISCA, 2015.
- [5] Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks. In *HPCA*, 2017.
- [6] Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu. Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks. In ASPLOS, 2018.
- [7] Amirali Boroumand, Saugata Ghose, Brandon Lucia, Kevin Hsieh, Krishna Malladi, Hongzhong Zheng, and Onur Mutlu. LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory. CAL, 2017.
- [8] Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L Greathouse, Lifan Xu, and Michael Ignatowski. TOP-PIM: Throughput-Oriented Programmable Processing in Memory. In *HPDC*, 2014.
- [9] Mingyu Gao and Christos Kozyrakis. HRL: Efficient and flexible reconfigurable logic for near-data processing. In *HPCA*, 2016.
- [10] Jeremie S Kim, Damla Senol Cali, Hongyi Xin, Donghyuk Lee, Saugata Ghose, Mohammed Alser, Hasan Hassan, Oguz Ergin, Can Alkan, and Onur Mutlu. GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-memory Technologies. BMC Genomics, 2018.
- [11] Mario Drumond, Alexandros Daglis, Nooshin Mirzadeh, Dmitrii Ustiugov, Javier Picorel, Babak Falsafi, Boris Grot, and Dionisios Pnevmatikatos. The Mondrian Data Engine. In ISCA, 2017.
- [12] P. C. Santos, G. F. Oliveira, D. G. Tomé, M. A. Z. Alves, E. C. Almeida, and L. Carro. Operand Size Reconfiguration for Big Data Processing in Memory. In DATE, 2017.
- [13] Geraldo F Oliveira, Paulo C Santos, Marco AZ Alves, and Luigi Carro. NIM: An HMC-Based Machine for Neuron Computation. In ARC, 2017.
- [14] Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. PIM-enabled Instructions: A Low-overhead, Locality-aware Processing-in-memory Architecture. In *ISCA*, 2015.
- [15] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In ASPLOS, 2017.
- [16] Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In ISCA, 2016.
- [17] Peng Gu, Shuangchen Li, Dylan Stow, Russell Barnes, Liu Liu, Yuan Xie, and Eren Kursun. Leveraging 3D Technologies for Hardware Security: Opportunities and Challenges. In *GLSVLSI*, 2016.
- [18] Dong Uk Lee, Kyung Whan Kim, Kwan Weon Kim, Hongjung Kim, Ju Young Kim, Young Jun Park, Jae Hwan Kim, Dae Suk Kim, Heat Bit Park, Jin Wook Shin, et al. A 1.2V 8Gb 8-Channel 128GB/s High-Bandwidth Memory (HBM) Stacked DRAM with Effective Microbump I/O Test Methods Using 29nm Process and TSV. In ISSCC, 2014.
- [19] Hybrid Memory Cube Consortium. Hybrid Memory Cube Specification Rev. 2.0, 2013. http://www.hybridmemorycube.org/.
- [20] Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Rachata Ausavarungnirun, Kevin Hsieh, Nastaran Hajinazar, Krishna T Malladi, Hongzhong Zheng, and Onur Mutlu. CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators. In *ISCA*, 2019.

- [21] Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W Keckler. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems. In ISCA, 2016.
- [22] Damla Senol Cali, Gurpreet S Kalsi, Zülal Bingöl, Can Firtina, Lavanya Subramanian, Jeremie S Kim, Rachata Ausavarungnirun, Mohammed Alser, Juan Gomez-Luna, Amirali Boroumand, et al. GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis. In *MICRO*, 2020.
- [23] Q. Zhu, T. Graf, H. E. Sumbul, L. Pileggi, and F. Franchetti. Accelerating Sparse Matrix-Matrix Multiplication with 3D-Stacked Logic-in-Memory Hardware. In *HPEC*, 2013.
- [24] S. H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, et al. NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In *ISPASS*, 2014.
- [25] Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. NDA: Near-DRAM Acceleration Architecture Leveraging Commodity DRAM Devices and Standard Memory Modules. In *HPCA*, 2015.
- [26] Gabriel H Loh, Nuwan Jayasena, M Oskin, Mark Nutter, David Roberts, Mitesh Meswani, Dong Ping Zhang, and Mike Ignatowski. A Processing in Memory Taxonomy and a Case for Studying Fixed-Function PIM. In *WoNDP*, 2013.
- [27] Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, and Chita R Das. Scheduling Techniques for GPU Architectures With Processing-in-memory Capabilities. In *PACT*, 2016.
- [28] Berkin Akin, Franz Franchetti, and James C Hoe. Data Reorganization in Memory Using 3D-Stacked DRAM. In ISCA, 2015.
- [29] Kevin Hsieh, Samira Khan, Nandita Vijaykumar, Kevin K Chang, Amirali Boroumand, Saugata Ghose, and Onur Mutlu. Accelerating Pointer chasing in 3D-stacked Memory: Challenges, Mechanisms, Evaluation. In *ICCD*, 2016.
- [30] Oreoluwatomiwa O Babarinsa and Stratos Idreos. JAFAR: Near-Data Processing for Databases. In SIGMOD, 2015.
- [31] Joo Hwan Lee, Jaewoong Sim, and Hyesoon Kim. BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models. In PACT, 2015.
- [32] Fabrice Devaux. The True Processing In Memory Accelerator. In Hot Chips, 2019.
- [33] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory. In ISCA, 2016.
- [34] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars. In ISCA, 2016.
- [35] Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A Kozuch, Onur Mutlu, Phillip B Gibbons, and Todd C Mowry. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology. In *MICRO*, 2017.
- [36] Vivek Seshadri and Onur Mutlu. In-DRAM Bulk Bitwise Execution Engine. arXiv:1905.09822 [cs.AR], 2019.
- [37] Shuangchen Li, Dimin Niu, Krishna T Malladi, Hongzhong Zheng, Bob Brennan, and Yuan Xie. Drisa: A DRAM-Based Reconfigurable In-Situ Accelerator. In *MICRO*, 2017.
- [38] Vivek Šeshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, et al. Row-Clone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization. In *MICRO*, 2013.
- [39] Vivek Seshadri and Onur Mutlu. The Processing Using Memory Paradigm: In-DRAM Bulk Copy, Initialization, Bitwise AND and OR. arXiv:1610.09603 [cs.AR], 2016.
- [40] Quan Deng, Lei Jiang, Youtao Zhang, Minxuan Zhang, and Jun Yang. DrAcc: A DRAM Based Accelerator for Accurate CNN Inference. In DAC, 2018.
- [41] Xin Xin, Youtao Zhang, and Jun Yang. ELP2IM: Efficient and Low Power Bitwise Operation Processing in DRAM. In *HPCA*, 2020.
- [42] Linghao Song, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. GraphR: Accelerating Graph Processing Using ReRAM. In HPCA, 2018.

- [43] Linghao Song, Xuehai Qian, Hai Li, and Yiran Chen. PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning. In HPCA, 2017.
- [44] Fei Gao, Georgios Tziantzioulis, and David Wentzlaff. ComputeDRAM: In-Memory Compute Using Off-the-Shelf DRAMs. In MICRO, 2019.
- [45] Charles Eckert, Xiaowei Wang, Jingcheng Wang, Arun Subramaniyan, Ravi Iyer, Dennis Sylvester, David Blaauw, and Reetuparna Das. Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks. In *ISCA*, 2018.
- [46] Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das. Compute Caches. In HPCA, 2017.
- [47] Daichi Fujiki, Scott Mahlke, and Reetuparna Das. Duality Cache for Data Parallel Acceleration. In *ISCA*, 2019.
- [48] Ivan Fernandez, Ricardo Quislant, Eladio Gutiérrez, Oscar Plata, Christina Giannoula, Mohammed Alser, Juan Gómez-Luna, and Onur Mutlu. NATSA: A Near-Data Processing Accelerator for Time Series Analysis. In *ICCD*, 2020.
- [49] Nastaran Hajinazar, Geraldo F Oliveira, Sven Gregorio, João Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gómez-Luna, and Onur Mutlu. SIM-DRAM: A Framework for Bit-Serial SIMD Processing Using DRAM. In ASPLOS, 2021.
- [50] Christina Giannoula, Nandita Vijaykumar, Nikela Papadopoulou, Vasileios Karakostas, Ivan Fernandez, Juan Gómez-Luna, Lois Orosa, Nectarios Koziris, Georgios Goumas, and Onur Mutlu. SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures. In HPCA, 2021.
- [51] Amirali Boroumand, Saugata Ghose, Geraldo F Oliveira, and Onur Mutlu. Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design. *ICDE*, 2022.
- [52] Amirali Boroumand, Saugata Ghose, Berkin Akin, Ravi Narayanaswami, Geraldo F Oliveira, Xiaoyu Ma, Eric Shiu, and Onur Mutlu. Mitigating Edge Machine Learning Inference Bottlenecks: An Empirical Study on Accelerating Google Edge Models. arXiv:2103.00768 [cs.AR], 2021.
- [53] Amirali Boroumand. Practical Mechanisms for Reducing Processor-Memory Data Movement in Modern Workloads. PhD thesis, Carnegie Mellon University, 2020.
- [54] Geraldo F Oliveira, Paulo C Santos, Marco AZ Alves, and Luigi Carro. A Generic Processing in Memory Cycle Accurate Simulator Under Hybrid Memory Cube Architecture. In SAMOS, 2017.
- [55] Jeremie S Kim, Minesh Patel, Hasan Hassan, Lois Orosa, and Onur Mutlu. D-RaNGe: Using Commodity DRAM Devices to Generate True Random Numbers With Low Latency and High Throughput. In *HPCA*, 2019.
- [56] Jeremie S Kim, Minesh Patel, Hasan Hassan, and Onur Mutlu. The DRAM Latency PUF: Quickly Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability Tradeoff in Modern Commodity DRAM Devices. In *HPCA*, 2018.
- [57] Maciej Besta, Raghavendra Kanakagiri, Grzegorz Kwasniewski, Rachata Ausavarungnirun, Jakub Beránek, Konstantinos Kanellopoulos, Kacper Janda, Zur Vonarburg-Shmaria, Lukas Gianinazzi, Ioana Stefan, et al. SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems. arXiv:2104.07582 [cs.AR], 2021.
- [58] João Dinis Ferreira, Gabriel Falcao, Juan Gómez-Luna, Mohammed Alser, Lois Orosa, Mohammad Sadrosadati, Jeremie S Kim, Geraldo F Oliveira, Taha Shahroodi, Anant Nori, et al. pLUTo: In-DRAM Lookup Tables to Enable Massively Parallel General-Purpose Computation. arXiv:2104.07699 [cs.AR], 2021.
- [59] Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali Boroumand, Jeremie Kim, Michael A Kozuch, Onur Mutlu, Phillip B Gibbons, and Todd C Mowry. Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM. arXiv:1611.09988 [cs.AR], 2016.
- [60] Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Kevin Hsieh, Krishna T Malladi, Hongzhong Zheng, and Onur Mutlu. LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-memory. CAL, 2017.
- [61] Jeremie S Kim, Damla Senol, Hongyi Xin, Donghyuk Lee, Saugata Ghose, Mohammed Alser, Hasan Hassan, Oguz Ergin, Can Alkan, and Onur Mutlu. GRIM-Filter: Fast Seed Filtering in Read Mapping using Emerging Memory Technologies. arXiv:1708.04329 [cs.AR], 2017.

- [62] Saugata Ghose, Kevin Hsieh, Amirali Boroumand, Rachata Ausavarungnirun, and Onur Mutlu. Enabling the Adoption of Processing-in-memory: Challenges, Mechanisms, Future Research Directions. arXiv, 2018.
- [63] Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, et al. Row-Clone: Accelerating Data Movement and Initialization Using DRAM. arXiv:1805.03502 [cs.AR], 2018.
- [64] Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. Processing Data Where It Makes Sense: Enabling In-Memory Computation. *MicPro*, 2019.
- [65] Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, and Rachata Ausavarungnirun. Enabling Practical Processing in and Near Memory for Data-Intensive Computing. In DAC, 2019.
- [66] Saugata Ghose, Amirali Boroumand, Jeremie S Kim, Juan Gómez-Luna, and Onur Mutlu. A Workload and Programming Ease Driven Perspective of Processing-in-Memory. arXiv:1907.12947 [cs.AR], 2019.
- [67] Ataberk Olgun, Minesh Patel, Abdullah Giray Yağlıkçı, Haocong Luo, Jeremie S. Kim, F. Nisa Bostancı, Nandita Vijaykumar, Oğuz Ergin, and Onur Mutlu. QUAC-TRNG: High-Throughput True Random Number Generation Using Quadruple Row Activation in Commodity DRAMs. In ISCA, 2021.
- [68] Dimitrios Skarlatos, Renji Thomas, Aditya Agrawal, Shibin Qin, Robert Pilawa-Podgurski, Ulya R Karpuzcu, Radu Teodorescu, Nam Sung Kim, and Josep Torrellas. Snatch: Opportunistically Reassigning Power Allocation Between Processor and Memory in 3D Stacks. In *MICRO*, 2016.
- [69] Po-An Tsai, Changping Chen, and Daniel Sanchez. Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies. In *MICRO*, 2018.
- [70] Geraldo F Oliveira, Juan Gómez-Luna, Lois Orosa, Saugata Ghose, Nandita Vijaykumar, Ivan Fernandez, Mohammad Sadrosadati, and Onur Mutlu. DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks. *IEEE Access*, 2021.
- [71] Mingyu Gao, Grant Ayers, and Christos Kozyrakis. Practical Near-Data Processing for In-Memory Analytics Frameworks. In *PACT*, 2015.
- [72] Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. GraphPIM : Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks. In *HPCA*, 2017.
- [73] Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, and Kevin Hsieh. Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs. In SC, 2017.
- [74] M. Aater Suleman, Onur Mutlu, José A. Joao, Khubaib, and Yale N. Patt. Data Marshaling for Multi-Core Architectures. *ISCA*, 2010.
- [75] John H Lau. Overview and Outlook of Through-Silicon via (TSV) and 3D Integrations. *Microelectronics International*, 2011.
- [76] Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. A Case for Near Memory Computation Inside the Smart Memory Cube. In EMS, 2016.
- [77] Todd M Austin and Gurindar S Sohi. Dynamic Dependency Analysis of Ordinary Programs. In *ISCA*, 1992.
- [78] Francis Tseng and Yale N Patt. Achieving Out-of-Order Performance with Almost In-Order Complexity. ISCA, 2008.
- [79] Mark C Jeffrey, Suvinay Subramanian, Cong Yan, Joel Emer, and Daniel Sanchez. A Scalable Architecture for Ordered Parallelism. In *MICRO*, 2015.
- [80] Kamruzzaman, Md and Swanson, Steven and Tullsen, Dean M. Inter-Core Prefetching for Multicore Processors Using Migrating Helper Threads. ASPLOS, 2011.
- [81] Sam Ainsworth and Timothy Jones. Software Prefetching for Indirect Memory Accesses: A Microarchitectural Perspectivey. TOCS, 2019.
- [82] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. A Scalable Processing-in-memory Accelerator for Parallel Graph Processing. In *ISCA*, 2015.
- [83] Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D Hill, and Michael M Swift. Efficient Virtual Memory for Big Memory Servers. In *ISCA*, 2013.
- [84] Javier Picorel, Djordje Jevdjic, and Babak Falsafi. Near-Memory Address Translation. In PACT, 2017.

- [85] Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. Design and Evaluation of a Processing-in-memory Architecture for the Smart Memory Cube. In ARCS, 2016.
- [86] Daniel Sanchez and Christos Kozyrakis. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. In ISCA, 2013.
- [87] Yoongu Kim, Weikun Yang, and Onur Mutlu. Ramulator: A Fast and Extensible DRAM Simulator. CAL, 2016.
- [88] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In *PLDI*, 2005.
- [89] Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In *MICRO*, 2007.
- [90] Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim. Memory-Centric System Interconnect Design with Hybrid Memory Cubes. In *Proceedings of the 22nd international conference on Parallel architectures and compilation techniques*, 2013.
- [91] Julian Shun and Guy E Blelloch. Ligra: a Lightweight Graph Processing Framework for Shared Memory. In PPoPP, 2013.
- [92] Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. R-MAT: A Recursive Model for Graph Mining. In SIAM ICDM, 2004.
- [93] Matei Ripeanu. Peer-to-Peer Architecture Case Study: Gnutella Network. In P2P, 2001.
- [94] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In *PACT*, 2008.
- [95] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In *IISWC*, 2009.
- [96] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. ISCA, 1995.
- [97] Intel. Intel VTune Amplifier 2019 User Guide, 2018. https://software.intel.com/en-us/vtune-amplifier-help.
- [98] Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B Brockman, and Norman P Jouppi. CACTI-3DD: Architecture-Level Modeling for 3D Die-Stacked DRAM Main Memory. In DATE, 2012.
- [99] Yizhou Wei, Minxuan Zhou, Sihang Liu, Korakit Seemakhupt, Tajana Rosing, and Samira Khan. PIMProf: An Automated Program Profiler for Processing-in-Memory Offloading Decisions. DATE, 2022.
- [100] Hameeza Ahmed, Paulo C Santos, João PC Lima, Rafael F Moura, Marco AZ Alves, Antônio CS Beck, and Luigi Carro. A Compiler for Automatic Selection of Suitable Processing-in-memory Instructions. In DATE, 2019.
- [101] Jaejin Lee, Yan Solihin, and J Torrettas. Automatically Mapping Code on an Intelligent Memory Architecture. In HPCA, 2001.
- [102] S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Spatial Memory Streaming. In ISCA, 2006.
- [103] J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson, and Z. Chishti. Path Confidence Based Lookahead Prefetching. In *MICRO*, 2016.
- [104] X. Yu, C. J. Hughes, N. Satish, and S. Devadas. IMP: Indirect Memory Prefetcher. In MICRO, 2015.
- [105] Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian, Chris Wilkerson, Seth H. Pugsley, and Zeshan Chishti. Efficiently Prefetching Complex Address Patterns. In *MICRO*, 2013.
- [106] Jeffery A. Brown, Leo Porter, and Dean M. Tullsen. Fast Thread Migration via Cache Working Set Prediction. In HPCA, 2011.
- [107] N. Beckmann, P. Tsai, and D. Sanchez. In HPCA, 2015.
- [108] Elliot Lockerman, Alex Feldmann, Mohammad Bakhshalipour, Alexandru Stanescu, Shashwat Gupta, Daniel Sanchez, and Nathan Beckmann. Livia: Data-Centric Computing Throughout the Memory Hierarchy. In ASPLOS, 2020.
- [109] Xulong Tang, Orhan Kislal, Mahmut Kandemir, and Mustafa Karakoy. Data Movement Aware Computation Partitioning. In *MICRO*, 2017.



Nika Mansouri Ghiasi received the B.S. degree in Electrical Engineering from the University of Tehran, and the M.S. degree in Electrical Engineering from ETH Zürich. She is currently pursuing the Ph.D. degree at ETH Zürich, where she is advised by Onur Mutlu. Her research interests include emerging memory and processing technologies, near-data processing, storage systems, and bioinformatics.



Nandita Vijaykumar received the M.S. and Ph.D. degrees from Carnegie Mellon University, in 2019, where she was advised by Prof. Onur Mutlu and Prof. Phil Gibbons. She is currently an Assistant Professor with the Computer Science Department, University of Toronto, and the Department of Computer and Mathematical Sciences, University of Toronto Scarborough, and is affiliated with the Vector Institute. Before joining the University of Toronto, she was a research scientist with the Memory Architecture and Ac-

celerator Laboratory, Intel Labs. In the past, she worked for AMD, Intel, Microsoft, and Nvidia. Her research interests include computer architecture, compilers, and systems with a focus on the interaction between programming models, systems, and architectures. Her current interests include the system-level and programming challenges of robotics, and large-scale machine learning. For more information, please visit her website at http://www.cs.toronto.edu/nandita/.



Geraldo F. Oliveira received the B.S. degree in computer science from the Federal University of Viçosa, Viçosa, Brazil, in 2015, and the M.S. degree in computer science from the Federal University of Rio Grande do Sul, Porto Alegre, Brazil, in 2017. He is currently pursuing the Ph.D. degree with ETH Zürich, Zürich, Switzerland, under the supervision of Onur Mutlu. His current research interests include system support for processing-in-memory and processingusing-memory architectures, data-centric accel-

erators for emerging applications, approximate computing, and emerging memory systems for consumer devices. He has several publications on these topics.



Lois Orosa is a senior researcher at SAFARI Research group at ETH Zürich, Switzerland. He received his BS and MS degrees in Telecommunication Engineering from the University of Vigo, Spain, his Ph.D. degree from the University of Santiago de Compostela, Spain, and he held a postDoc position in the University of Campinas, Brazil. He was a visiting researcher at multiple companies (IBM, Recore Systems, Xilinx and Huawei) and universities (UIUC and Universidade Nova de Lisboa). His current research in-

terests are in computer architecture, hardware security, reliability, memory systems, and machine learning (ML) accelerators. For more information, please see his webpage at https://loisorosa.github.io/.



**Ivan Fernandez** received the B.S. degree in computer engineering and the M.S. degree in mechatronics engineering from the University of Malaga, in 2017 and 2018, respectively, where he is currently pursuing the Ph.D. degree. His current research interests include processing in memory, near-data processing, stacked memory architectures, high-performance computing, transprecision computing, and time series analysis.



**Onur Mutlu (Fellow, IEEE)** received a B.S. degree in computer engineering and psychology from the University of Michigan, and the M.S. and Ph.D. degrees in electrical and computer engineering from the University of Texas at Austin. He is a Professor at ETH Zurich and a Faculty Member with Carnegie Mellon University, where he was Strecker Early Career Professor. He started the Computer Architecture Group at Microsoft Research and held various positions at Intel, AMD, VMware, and Google. His research

interests include computer architecture, systems, hardware security, and bioinformatics. He is an ACM Fellow and an Elected Member of the Academy of Europe. He received the IEEE HPCA Test of Time Award, IEEE CS Edward J. McCluskey Award, ACM SIGARCH Maurice Wilkes Award, and faculty partnership awards from various companies, and a healthy number of best paper recognitions. More information in (https://people.inf.ethz.ch/omutlu/).



Mohammad Sadrosadati received the B.Sc., M.Sc., and Ph.D. degrees in computer engineering from Sharif University of Technology, Tehran, Iran, in 2012, 2014, and 2019, respectively. From April 2017 to April 2018, he spent one year as an Academic Guest at ETH Zurich, hosted by Prof. Onur Mutlu during his Ph.D. program. He is currently a Postdoctoral Researcher with ETH Zurich, working under the supervision of Prof. Onur Mutlu. His research interests include heterogeneous computing, processing-in-memory,

memory systems, and interconnection networks. Due to his achievements and impact on improving the energy efficiency of GPUs, he received Khwarizmi Youth Award, one of the most prestigious awards, as the first laureate, in 2020, to honor and embolden him to keep taking even bigger steps in his research career.



Konstantinos Kanellopoulos received the B.S. and M.S. degree in Computer Science from the National Technical University of Athens. He is currently pursuing the Ph.D. degree at ETH Zürich, where he is advised by Onur Mutlu. His research interests include hardware/software interfaces and hardware security.



Juan Gómez Luna received the B.S. and M.S. degrees in telecommunication engineering from the University of Seville, Spain, in 2001, and the Ph.D. degree in computer science from the University of Córdoba, Spain, in 2012. From 2005 to 2017, he was a Faculty Member of the University of Córdoba. He is currently a Senior Researcher and a Lecturer with SAFARI Research Group, ETH Zürich. He is the lead author of PrIM (https://github.com/CMU-SAFARI/primbenchmarks), the first publicly-available bench-

mark suite for a real-world processing-in-memory architecture, and Chai (https://github.com/chai- benchmarks/chai), a benchmark suite for heterogeneous systems with CPU/GPU/FPGA. His research interests include processing-in-memory, memory systems, heterogeneous computing, and hardware and software acceleration of medical imaging and bioinformatics.



Nastaran Hajinazar is a Senior Researcher at ETH Zürich. Nastaran received her M.S. degree in computer hardware engineering from Sharif University of Technology, Iran, in 2011 and her Ph.D. degree in computer science from Simon Fraser University, British Columbia, Canada, in 2020. Her research incorporates several aspects of computer architecture with a significant focus on designing efficient high- performance computing systems, memory architectures, and intelligent memory management techniques.