# Design and Evaluation of a Rack-Scale Disaggregated Memory Architecture For Data Centers

Amit Puri, John Jose, Tamarapalli Venkatesh Dept. of CSE, IIT Guwahati, Assam, India email: {amitpuri, john.jose, t.venkat}@iitg.ac.in

Memory

Abstract-Memory disaggregation is being considered as a strong alternative to traditional architecture to deal with the memory under-utilization in data centers. Disaggregated memory can adapt to dynamically changing memory requirements for the data center applications like data analytics, big data, etc., that require in-memory processing. However, such systems can face high remote memory access latency due to the interconnect speeds. In this paper, we explore a rack-scale disaggregated memory architecture and discuss the various design aspects. We design a trace-driven simulator that combines an eventbased interconnect and a cycle-accurate memory simulator to evaluate the performance of disaggregated memory system at the rack scale. Our study shows that not only the interconnect but the contention in the remote memory queues also adds significantly to remote memory access latency. We introduces a memory allocation policy to reduce the latency compared to the conventional policies. We conduct experiments using various benchmarks with diverse memory access patterns. Our study shows encouraging results towards the rack-scale memory disaggregation and acceptable average memory access latency. Index Terms—Data centers, Memory disaggregation, Remote

# I. INTRODUCTION

With high-end server class multi-processors like Xeon Phi and AMD's EPYC, the compute capability in servers has improved dramatically, with ability to run multiple applications simultaneously. However, the typical workloads in high-performance computing (HPC) facilities and cloud data centers such as big data analytics, and machine learning applications, fall short of server memory due to underutilization and the memory capacity wall [1]. Due to improper use of on-board memory, memory gets stranded as small fragments within each server node increasing the total cost of ownership [2]. Disaggregation of memory resources allows a modular approach to manage memory in a fine-grained manner, where memory does not need to be on the same board as the processor. It allows independent upgrade of memory and increases the data center hardware refresh cycle time [3].In this paper, we study a rack-scale system with partial memory disaggregation where each compute node has some local memory to fulfill the primary requirements, while most of the application memory requirements get fulfilled by the remote memory. The remote memory is managed in the form of multiple remote memory pools within the same rack and is allocated to the compute nodes on-demand.

Disaggregated memory introduces several design challenges. First, the placement of remote memory outside the board should be such that multiple compute nodes can access remote memory simultaneously, without significant congestion. Second, how should the remote memory address space be exposed to avoid system-level bottlenecks with little overhead. Another requirement is a centralized memory manager to take care of remote memory allocation that should also balance load across memory pools. Different types of remote memory access require the support of different interconnect design and protocols. Cache-based access requires a memory binding fabric support for faster access [4], [5], whereas accessing remote memory in larger chunks requires support for Remote direct memory access (RDMA) [6]. We propose a two-level remote memory allocation mechanism, one at the compute node level and other at the global memory manager. Our study shows that different memory allocation methods can impact the performance in pool-based remote memory.The contributions of this work are as follows:

- We explore a rack-scale design for memory disaggregation and discuss the design space.
- We identify the major factors impacting the remote memory access latency and propose cost-effective memory allocation policies that also perform load-balancing to get rid of tail latency.
- We evaluate the performance of proposed memory allocation policy on diverse workloads to show the overall impact of memory disaggregation.

#### II. RELATED WORK AND MOTIVATION

Earlier designs proposed memory disaggregation at rackscale for traditional server nodes [7]–[9]. Infiniswap [10], and FARM [11] presented optimizations for virtually disaggregated systems to utilize RDMA access to remote memory and leverage free memory in other servers. Lim. et al. [1], [12] present a general-purpose physical disaggregated memory design, where memory blades are connected to the compute nodes through PCIe buses. Scale-out NUMA [13] presented an on-chip hardware block to provide a low latency interface between the processor and remote memory. Venice [4] and DEOI [5] also explored similar on-chip modules for remote memory access with separate channels for fine-grained and paged access to remote memory. Recently, a consortium of hardware industry leaders released protocol standards for a similar on-chip memory coherent interconnect, Gen-Z that includes a switch and a pooled memory subsystem [14]. Komareddy et al. proposed a shared memory approach for pooled memory with a single remote address space to all the compute nodes [15]. On the other hand, a large class of data center applications can fulfill their CPU demands from the computing power available within a single system and only require shared excess to memory in a few instances. Under such a scenario, multiple compute nodes rarely require shared access to the remote memory. Instead, retaining coherency inside a single domain will prevent coherency traffic and reduce significant overhead. Our work leverages the noncoherent use of pooled memory systems to allocate remote memory. Furthermore, evaluation of disaggregated memory systems has earlier been done at a small scale either in a virtualized environment [2], or with a host OS [16] by adding fixed network latency and fixed division of address space into local and remote.

### III. RACK SCALE DESIGN

Our approach to memory disaggregation considers only rack-scale remote memory access, because going beyond a rack will increase the latency further due to network latency.

#### A. Pooled Memory Management

The compute nodes will not only depend on remote memory for most of their memory allocation, but the design should also support extensive memory requests from these nodes in a time-bound manner. Remote memory on facing contention in its memory queues adds significant tail latency, which makes it compulsory to have smaller pools of memory and more communication points to get high overall memory bandwidth. There is no consensus yet on the number of memory pools within a rack, as disaggregated designs are still in an experimental stage. Our base design includes remote memory in the form of multiple remote pools. Our experiments in later sections will establish a correlation between the compute node's workload demands and the number of memory pools.

#### B. Remote Memory Organization

The remote memory can be made transparent to all compute nodes, where the operating system at the node can directly allocate memory from any part. However, all the nodes should have a consistent view through a global memory manager. This approach has scalability issue due to significant coherency traffic to the shared memory. Another approach is to provide mapped access to the remote memory, where any remote memory page exclusively belongs to a single node. The global memory manager can reserve the remote memory in larger chunks (a few megabytes) without any bottleneck. We chose distributed access to memory in our design.

### C. Interconnect Requirements

Even though the fast network switches had substantially decreased the network latency, a large part of network overhead



Fig. 1. Remote memory access with RMAC.

is due to the node's deep protocol stack, slow I/O buses, and protocol conversion while offloading requests off the chip. An on-chip network interface such as a remote memory access controller (RMAC) shown in Fig. 1(a) holds the key for future data centers that join the remote memory resources for cachebased load/store. RMAC is an addressable device that also takes care of the bookkeeping mechanism required for routing cache misses toward the appropriate memory pool. Such onchip interconnects that enable quick remote memory access have been explored in the past [4], [5], [13], [17], [18] and are a good match for data center applications. As shown in Fig. 1(b), RMAC forwards the last-level cache (LLC) miss requests that belong to remote memory and implements a lightweight network protocol on the hardware. On the other hand, coarse grain page access can be implemented as a DMA-like channel over the same interface that works with a user or kernel space daemon to monitor hot remote memory pages and occasionally bring them to local memory. RDMA interconnects such as RoCE [19] and InfiniBand [20] that allows one-sided access to remote memory are already in use in present data centers [21].

### D. Global Memory Manager

A global memory manager manages all the remote memory within a single rack. Whenever an application falls short of the local memory, it causes page fault requesting the global memory manager to allocate a chunk from one of the memory pools, which forms an extended local memory address space on compute node. Linux allows online up-gradation of system memory using memory hot-plug service, which is exposed to the OS page allocator once initialized. It is also essential to allocate the memory in smaller chunks to have more granular control over remote memory. If the allocation size is too small, the mapping table will grow huge and incurs significant search latency. If it is too big, memory will be under-utilized like in traditional servers, which makes remote memory reclamation challenging, requiring large amounts of data migration. The global memory manager can be hosted at the ToR switch and will maintain the memory tables for the allocated and free remote memory. Once it reserves a remote memory chunk, the manager sends the chunk details to the requesting node to add suitable local-to-remote mappings at RMAC for address translation.



Fig. 2. Random pool selection with alternate local-remote page allocation

## IV. RACK-SCALE MEMORY ALLOCATION

Memory allocation in disaggregated memory systems is two-fold. Firstly, the page allocation policy on compute nodes must decide when to start utilizing the remote memory. A node has an option to use Local memory first, or it can use an Alternative Local-Remote approach for allocating consecutive pages. With the first one, it will initially enjoy the benefits of fast memory but suffer a sudden slow down once the local memory is exhausted. Many applications tend to go through a start-up phase and do not benefit from this scheme. The other option will allow better average memory access latency for an extended period but does not make full use of fast local memory. Secondly, the global memory manager warrants a pool-selection policy for allocating a chunk of memory. Although network latency is the major hurdle in remote memory access, our study shows that pool-selection policy significantly impacts the average memory access time on remote pools. In a pooled remote memory system, a memory pool connects to one of the switch links. Without load-balancing, few memory pools that get more requests will face tail latency due to congestion at switch buffers.

#### A. Random Pool Selection

We initially analyze workload WL-Mix (explained in section-V) with a random pool selection, for which the global manager randomly selects a pool for every chunk allocation request by a compute node OS (4MB allocation size) and the pages are allocated alternatively in local and remote memory. As shown in fig 2(a), the average memory access latency for each benchmark has exceeded to microseconds, which is way beyond the manageable limit at which an application can execute normally. The only significant factor for this high latency is the contention in remote memory queues, which is the consequence of blindly allocating a random remote pool and using it for the application's memory requirements. The same can be observed in fig 2(b) that shows only the average remote memory latency, excluding the network delays. We probed all the memory accesses to find that 2% of the memory accesses (only remote), with latency of 1000ns or more, led to high average latency (shown in fig 2(c)). In fig 2(d), we show the variation in the number of memory accesses across different pools for every 1.5 millioncycles. The variation is calculated by subtracting the pool with

maximum and minimum memory access during that period. The large variation shows the imbalance of memory access among the pools, causing high tail latency.

#### B. Smart-Idle Pool Selection

This policy performs optimal memory pool selection in two different steps. The first step selects a small subset of memory pools from all the available pools based on the recent memory access traffic. The rationale is that the memory pools with less current traffic are least likely to face contention soon and can be selected currently for more memory allocation. The second step will finally select a pool from the subset with the least amount of already allocated memory. The reason behind this is to balance the amount of memory allocation equally among pools. Another reason is that even if a memory pool is currently facing less traffic, it can still suddenly face more memory requests from the previously allocated memory if that pool has been allocated more memory in the past. So the choice of pool with the least allocated memory is less likely to face such sudden accesses. To implement this, we use the global memory manager hosted at the rack switch that keeps track of total memory accesses to each remote pool. It only requires a small number of 32 or 64-byte counters. We use a window-based mechanism to determine the traffic by measuring an access factor (Af) of each pool at the start of every window. A window is the duration between allocating two consecutive remote memory chunks. Memory accesses of four recent windows are maintained to determine the activity of each remote pool since it reflects the most recent status of the memory traffic. The access factors (Af) is calculated as described in (1), and MemAccCount in (2) refers to the total number of memory accesses to a memory pool in a window. More weightage is given to memory access count in the recent window compared to older windows to get most recent status. A lower value of Af indicates that a pool has faced less memory traffic recently and can be selected for the next chunk allocation.

$$Af_{win(n+1)} = M_{aC}(New) + M_{aC}(Old)/3$$
(1)

$$M_{aC}(New) = MemAccCount_{win(n)}$$
(2)

$$M_{aC}(Old) = \sum_{z=n-1}^{n-3} MemAccCount_{win(z)}$$
(3)

The smart-idle allocation makes sure to choose a lesser active pool while also evenly distributing the memory chunks across pools. Assuming the total number of memory pools to be *n*, smart-idle policy will initially select a set of pools with set-size *m*, where *m* is calculated as: m = Ceil[log2(n)]. Then will finally choose a pool with lowest allocated memory.

#### V. EXPERIMENT METHODOLOGY AND RESULTS

We simulate all the main memory accesses for an application, combined to represent many nodes running inside a rack which are finally processed to simulate the interconnect and memory. Once the traces are ready, the task is to simulate each node's main memory accesses at local or remote memory based on the reference address. The front-end uses Intel's PIN [22] platform to perform binary instrumentation for the application analysis. Our tool is based on a Allcache pin-tool that performs a functional simulation of TLB and cache hierarchy. The base tool is modified to support multithreaded trace collection and give an approximate timing for a TLB/cache miss, over which instrumentation was performed at instruction-level granularity by collecting LLC misses in the same way as in [23]. A combined trace is sorted by time-stamp of merged cache-misses from all the cores. The LLC misses eventually give the main memory accesses that also preserve the multi-threaded nature of the application, where each record has a virtual address of the LLC miss, its time-stamp, thread-id, and read/write access type. The virtual addresses in the trace are translated by simulating a memory management unit, that tracks pages in local and remote memory. For every page fault, it allocates a new page in local or remote memory. A global memory manager is also simulated to serve remote memory allocation requests from nodes.

The simulation for interconnect occurs as a set of discrete events where an event is one CPU cycle. A queuebased mechanism simulates latency for NIC as well as racklevel interconnect. We use finite-sized queues for NIC and rack switch ports with a back-pressure congestion control policy and add appropriate queuing delays to the waiting requests once the queues get full. Further, propagation delays and transmission delays are added to each packet according to wire-length and link speeds respectively. Each remote memory access is sent in the form of network packets, for which packing and unpacking time is added appropriately. A switch arbitrator selects the ready packets from virtual queues at input ports, avoiding starvation and head-of-line blocking. Finally, we use DRAMSim2 [24] to simulate the main memory for which multiple instances were deployed, each for local memory at compute nodes and remote memory in memory pools. Sorted memory access from a multi-core front-end facilitates memory simulations in an environment representing multi-threaded execution.

We choose 4-multi-threaded benchmarks, shown in table I. Each benchmark have large variation in total number of memory access made during the simulation time and represents the heterogeneous workload of data center servers. The workload

TABLE I Benchmarks

| Benchmark<br>Name | Cache<br>Miss-Rate | RAM Accesses<br>(in Millions) | Footprint<br>(in GB) | Label    |
|-------------------|--------------------|-------------------------------|----------------------|----------|
| lbm_s             | 12.49%             | 45.47                         | 2.7                  |          |
| fotonik3d_s       | 5.96%              | 11.92                         | 0.57                 | WI Mix   |
| fft               | 4.16%              | 15.81                         | 1.06                 | WL-IVIIX |
| fmm               | 3.42%              | 12.5                          | 3.20                 |          |

TABLE II Simulation Parameters

| Element             | Parameter                                    |  |
|---------------------|----------------------------------------------|--|
| CPU                 | 1.2GHz, 8-core                               |  |
| ITLB                | 128 Entries, 8-Way ITLB, 60-cycle            |  |
| DTLB                | 64-Enteries, 4-Way DTLB, 60-cycle            |  |
| Cache Size          | 32KB(L1-I/D), 256KB(L2), 16MB(L3)            |  |
| Cache Associativity | 8-Way(L1), 4-Way(L2), 16-Way(L3)             |  |
| Cache Latency       | 4-Cycle(L1), 12-Cycle(L2), 41-Cycle(L3)      |  |
| Cache Type          | Write-Back/Write-Allocate, Round-Robin, 64B  |  |
| Memory              | 256MB Per node, 32GB per Pool                |  |
| Switch              | 128x 100Gbps, 132mb Buffer, 20ns delay       |  |
| RMAC (NIC)          | 100Gbps, 10ns Delay                          |  |
| Packet-Size         | 64B request, 128B response, 25ns Packet-Prep |  |

mix *WL-Mix* are deployed for rack-scale experimentation with 64-compute nodes and 6-memory pools, where one workload is deployed on 16 nodes each having 256MB of local memory. We intentionally kept the number of memory pools on the lesser side, as our motive here is to test the memory pools' maximum-bandwidth limits for a high workload scenario. In table II, we sum up all the system parameters used for the simulations. We perform the experiments over both *local-first* and *alternate local-remote* page allocation policies that are run with Round-Robin and Smart-Idle pool selection.

We first discuss local-first page allocation for round-robin pool selection in Fig. 3(a), which shows the cumulative average memory access latency at different simulation points, where black marks represent the time when no more local memory is left. Although the results show a substantial decrease in the average access latency compared to the random pool selection, it is still high, especially for *lbm* and *fotonik*. This is because both these benchmarks send the most memory access to remote memory. The average latency of fft and fmm is not good either to maintain sufficient application speed. On the other hand, as shown in fig. 3(b), smart-idle improves the average latency to a significant margin compared to roundrobin pool selection. None of the benchmarks face serious contention at remote memory queues except after epoch4 to some extent, despite which smart-idle kept the average latency down throughout. Even with local-first allocation, latency only increased gradually for all of the benchmarks, which was the result of proper load-balancing. Fig. 3(c) shows the cumulative average memory access latency at the memory pools (without including network delays) for both round-robin and smart-idle pool selection, which are



Fig. 3. Average memory access latency with Local-First Allocation



Fig. 4. Average memory access latency with Alternate Local-Remote Allocation



Fig. 5. Distribution of remote memory accesses based on access times and Latency breakdown for Local/Remote/Network time

completely in sync with the above results. Due to a sudden burst of memory requests in between, the round-robin could not perfectly handle balancing these requests across memory pools. However, with smart-idle allocation, chunk allocation was done so that memory accesses would be divided almost equally across multiple pools, which is why it gives a better result. We next measure the performance with alternate local*remote* allocation, shown in fig. 4. As expected, there is no sudden burst of remote memory accesses, and we observe a gradual increase in latency for all benchmarks once the local memory is finished. Surprisingly, both round-robin and smartidle perform relatively better than they performed with *local first.* It does not show much impact of not having exclusive access to the fast local memory initially and even though lbm and *fmm* have a large memory footprint, they are still able to achieve good enough average memory latency. Overall, We saw the same trend being followed here also, where smart-idle performs better than the round-robin. However, the latency difference was less this time. These results show that

the overall average memory latency is most optimized with *alternate local-remote* page allocation combined with smartidle pool selection, shown in fig. 4(b).

Further analysis of the completion time of all the remote memory accesses is shown in Fig. 5(a). This latency only includes a memory request's time at the remote memory before it completes the memory access. Different colored bars here represent the number of memory accesses completed in each category based on its access latency. While the latency with random pool selection was higher, round-robin pool selection brought it down. There are a large number of memory requests beyond 500ns (bars in yellow, light blue, and green). Roundrobin policy is not sufficient to balance the memory traffic equally across different pools. The smart-idle pool selection combined with local-remote alternate page allocation is better than a simple round-robin to reduce the tail latency. The graph shows that very few accesses take more than 500ns to complete. We show the overall latency breakdown for all the memory accesses in Fig. 5(b). Smart-idle suffers lesser network delays than round-robin as memory request packets are distributed equally across multiple links connecting the memory pools. However, there is a big variation in average remote memory access time for all the benchmarks through different policies.

#### VI. CONCLUSION AND FUTURE WORK

In this paper, we explored rack-scale memory disaggregated systems, that provide more flexibility in memory utilization but come with some overheads. Remote memory access delay is another aspect we looked into through our experiments. High contention in the memory queues of remote memory became a dominating factor in most cases when pool selection was done through conventional policies. The proposed smartidle pool selection policy evenly distributes the memory traffic load among all the memory pools to counter the high tail latency and provides a much more balanced combined average access latency to local and remote memory. Further, it will be interesting to see the improvement with optimization like remote prefetching or remote page migration to local memory. Disaggregated memory systems will require strategically chosen pages to be migrated from shared remote memory for each node. Considering such approaches to mask the remote memory latency would be part of our future work.

#### REFERENCES

- K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and T. F. Wenisch, "Disaggregated memory for expansion and sharing in blade servers," in *Proceedings of the 36th Annual International Symposium on Computer Architecture*, ser. ISCA '09. New York, NY, USA: Association for Computing Machinery, 2009, p. 267–278. [Online]. Available: https://doi.org/10.1145/1555754.1555789
- [2] C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch, "Heterogeneity and dynamicity of clouds at scale: Google trace analysis," in *Proceedings of the Third ACM Symposium on Cloud Computing*, ser. SoCC '12. New York, NY, USA: Association for Computing Machinery, 2012. [Online]. Available: https://doi.org/10.1145/2391229.2391236
- [3] C. Li., H. Franke., C. Parris., and V. Chang., "Disaggregated architecture for at scale computing," in *Proceedings of the 2nd International Workshop on Emerging Software as a Service and Analytics - ESaaSA*, (CLOSER 2015), INSTICC. SciTePress, 2015, pp. 45–52.
- [4] J. Dong, R. Hou, M. Huang, T. Jiang, B. Zhao, S. A. McKee, H. Wang, X. Cui, and L. Zhang, "Venice: Exploring server architectures for effective resource sharing," in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016, pp. 507–518.
- [5] Y. Chang, K. Zhang, S. A. McKee, L. Zhang, M. Chen, L. Ren, and Z. Xu, "Extending on-chip interconnects for rack-level remote resource access," in 2016 IEEE 34th International Conference on Computer Design (ICCD), 2016, pp. 56–63.
- [6] N. Schelten, F. Steinert, A. Schulte, and B. Stabernack, "A highthroughput, resource-efficient implementation of the rocev2 remote dma protocol for network-attached hardware accelerators," in 2020 International Conference on Field-Programmable Technology (ICFPT), 2020, pp. 241–249.
- [7] R. Hou, T. Jiang, L. Zhang, P. Qi, J. Dong, H. Wang, X. Gu, and S. Zhang, "Cost effective data center servers," in 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), 2013, pp. 179–187.
- [8] C.-C. Tu, C.-t. Lee, and T.-c. Chiueh, "Marlin: A memory-based rack area network," ser. ANCS '14. New York, NY, USA: Association for Computing Machinery, 2014, p. 125–136. [Online]. Available: https://doi.org/10.1145/2658260.2658262

- [9] H. Montaner, F. Silla, and J. Duato, "A practical way to extend shared memory support beyond a motherboard at low cost," ser. HPDC '10. New York, NY, USA: Association for Computing Machinery, 2010, p. 155–166. [Online]. Available: https://doi.org/10.1145/1851476.1851495
- [10] J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G. Shin, "Efficient memory disaggregation with infiniswap," in *Proceedings of the 14th* USENIX Conference on Networked Systems Design and Implementation, ser. NSDI'17. USA: USENIX Association, 2017, p. 649–667.
- [11] A. Dragojević, D. Narayanan, M. Castro, and O. Hodson, "FaRM: Fast remote memory," in 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). Seattle, WA: USENIX Association, Apr. 2014, pp. 401–414. [Online]. Available: https://www.usenix.org/conference/nsdi14/technical-sessions/ dragojevi{\unbox\voidb@x\bgroup\let\unbox\voidb@x\setbox\ @tempboxa\hbox{c\global\mathchardef\accent@spacefactor} spacefactor}\let\begingroup\endgroup\relax\let\ignorespaces\relax\ accent19c\egroup\spacefactor}
- [12] K. Lim, Y. Turner, J. R. Santos, A. AuYoung, J. Chang, P. Ranganathan, and T. F. Wenisch, "System-level implications of disaggregated memory," in *IEEE International Symposium on High-Performance Comp Architecture*, 2012, pp. 1–12.
- [13] S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot, "Scale-out numa," in *Proceedings of the 19th International Conference* on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '14. New York, NY, USA: Association for Computing Machinery, 2014, p. 3–18. [Online]. Available: https://doi.org/10.1145/2541940.2541965
- [14] W. Kwon, C. Park, and M. Oh, "Gen-z memory pool system architecture," in 2020 International Conference on Information and Communication Technology Convergence (ICTC), 2020, pp. 1356–1360.
- [15] V. R. Kommareddy, A. Awad, C. Hughes, and S. D. Hammond, "Exploring allocation policies in disaggregated non-volatile memories," in *Proceedings of the Workshop on Memory Centric High Performance Computing*, ser. MCHPC'18. New York, NY, USA: Association for Computing Machinery, 2018, p. 58–66. [Online]. Available: https://doi.org/10.1145/3286475.3286480
- [16] D. Buragohain, A. Ghogare, T. Patel, M. Vutukuru, and P. Kulkarni, "Dime: A performance emulator for disaggregated memory architectures," in *Proceedings of the 8th Asia-Pacific Workshop on Systems*, ser. APSys '17. New York, NY, USA: Association for Computing Machinery, 2017. [Online]. Available: https://doi.org/10.1145/3124680.3124731
- [17] S. Hong, W.-O. Kwon, and M.-H. Oh, "Hardware implementation and analysis of gen-z protocol for memory-centric architecture," *IEEE Access*, vol. 8, pp. 127 244–127 253, 2020.
- [18] G. Liao and L. Bhuyan, "Performance measurement of an integrated nic architecture with 10gbe," in 2009 17th IEEE Symposium on High Performance Interconnects, 2009, pp. 52–59.
- [19] D. Cohen, T. Talpey, A. Kanevsky, U. Cummings, M. Krause, R. Recio, D. Crupnicoff, L. Dickman, and P. Grun, "Remote direct memory access over the converged enhanced ethernet fabric: Evaluating the options," in 2009 17th IEEE Symposium on High Performance Interconnects, 2009, pp. 123–130.
- [20] S. Liang, R. Noronha, and D. K. Panda, "Swapping to remote memory over infiniband: An approach using a high performance network block device," in 2005 IEEE International Conference on Cluster Computing, 2005, pp. 1–10.
- [21] P. X. Gao, A. Narayan, S. Karandikar, J. Carreira, S. Han, R. Agarwal, S. Ratnasamy, and S. Shenker, "Network requirements for resource disaggregation," ser. OSDI'16. USA: USENIX Association, 2016, p. 249–264.
- [22] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, "Pin: Building customized program analysis tools with dynamic instrumentation," *SIGPLAN Not.*, vol. 40, no. 6, p. 190–200, jun 2005. [Online]. Available: https://doi.org/10.1145/1064978.1065034
- [23] N. Alachiotis, A. Andronikakis, O. Papadakis, D. Theodoropoulos, D. Pnevmatikatos, and D. Syrivelis, *dReDBox: A Disaggregated Architectural Perspective for Data Centers*. Cham: Springer International Publishing, 2019, pp. 35–56. [Online]. Available: https://doi.org/10.1007/978-3-319-92792-3\_
- [24] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, "Dramsim2: A cycle accurate memory system simulator," *IEEE Computer Architecture Letters*, vol. 10, no. 1, pp. 16–19, 2011.