ABSTRACT
Traditional Processor-centric computing architectures do not scale-out well, because servers do not share their local main memories. To bypass this architectural limitation, programmers place their shared state on shared storage. But since Storage is slow (many hundreds of microseconds), they speed performance by duplicating the shared state to the compute nodes and have a complex coherent protocols to try keep all copies in sync. In recent years, Memory-centric architectures were proposed as an alternative.
In memory-centric architectures the shared state is placed on a shared memory pool that can be accessed with extreme low latency. Good memory-centric architectures should also be elastic, reliable, load balanced, cheaper than DRAM, thinly provisioned and support multi-tenancy.
We present an industrial shared memory solution with all these features. As shown in Figure 1, it is comprised of: 1) a scale-out persistent memory (PM) pool, 2) random-access client-side libraries, 3) a control plane and 4) RDMA fabric. Scale-out application owners use a library called pmAddr, to allocate or connect to a shared logical address space. The client library hides most of the complexity. It communicates with the relevant Data Server (DS) using RDMA and only communicates with the control plane when it does not know which DS holds the relevant memory region, or when its speculative destination turned out to be incorrect.
Shared Memory, unlike Storage, should primarily be optimized for low Read latency. pmAddr achieves this by combining: zero copy on the client, direct client-server communication (i.e. typically no redundant hops), and no software on the servers read data path. This extremely read-performant design is valid for both the 1-- and 3-copy reliability configurations, and regardless of the number of clients.
Shared memory Writes are optimized for low latency in a similar manner, but server software is involved for 3-copy configurations. The primary DS for the given memory region, will only expose the newly-written data to readers and will only return an Ack to the client after it replicated and validated that the data was successfully written to PM in two other DSs.
The experimental setup uses a synthetic benchmark (FIO) with meta-data like I/O sizes (0.5KB), Linux (CentOS 8.3) and commodity off-the-shelf hardware. The hardware included 8 single-socket clients and 8 DS servers, equipped with eight 128GB Optane PM 200 and a 200GbE switch.
Figure 2a shows the latency for synchronous I/O as a function of different load levels (IOPS). Reads were measured to be available to the application within 4us. The read latency is stable as long as the network is not saturated (as also shown in Figure 2b). Performance of single-copy writes is similar to reads.
Triple-copy writes, as expected, are slower and are sensitive to the number of writes per second. Writes complete within 10us for low to medium loads, but take longer when the DS CPUs are preoccupied and requests are queued. Using CPUs with more cores, even if weaker ones (e.g. Arm) may help improve that in the future.
The pmAddr memory-centric results are 2-3 orders-of-magnitude lower latency compared to modern storage, and proof that PM-centric processing is possible even today, using 2021 off-the-shelf hardware.
Index Terms
- pmAddr: a persistent memory centric computing architecture
Recommendations
WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory
This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM—attributed to PCM SET—...
Mellow writes: extending lifetime in resistive memories through selective slow write backs
ISCA'16Emerging resistive memory technologies, such as PCRAM and ReRAM, have been proposed as promising replacements for DRAM-based main memory, due to their better scalability, low standby power, and non-volatility. However, limited write endurance is a major ...
FlexFS: a flexible flash file system for MLC NAND flash memory
USENIX'09: Proceedings of the 2009 conference on USENIX Annual technical conferenceThe multi-level cell (MLC) NAND flash memory technology enables multiple bits of information to be stored on a single cell, thus making it possible to increase the density of the memory without increasing the die size. For most MLC flash memories, each ...
Comments