Abstract
Persistent memory (PM) disaggregation significantly improves the resource utilization and failure isolation to build a scalable and cost-effective remote memory pool in modern data centers. However, due to offering limited computing power and overlooking the bandwidth and persistence properties of real PMs, existing distributed transaction schemes, which are designed for legacy DRAM-based monolithic servers, fail to efficiently work on the disaggregated PM. In this article, we propose FORD, a Fast One-sided RDMA-based Distributed transaction system for the new disaggregated PM architecture. FORD thoroughly leverages one-sided remote direct memory access to handle transactions for bypassing the remote CPU in the PM pool. To reduce the round trips, FORD batches the read and lock operations into one request to eliminate extra locking and validations for the read-write data. To accelerate the transaction commit, FORD updates all remote replicas in a single round trip with parallel undo logging and data visibility control. Moreover, considering the limited PM bandwidth, FORD enables the backup replicas to be read to alleviate the load on the primary replicas, thus improving the throughput. To efficiently guarantee the remote data persistency in the PM pool, FORD selectively flushes data to the backup replicas to mitigate the network overheads. Nevertheless, the original FORD wastes some validation round trips if the read-only data are not modified by other transactions. Hence, we further propose a localized validation scheme to transfer the validation operations for the read-only data from remote to local as much as possible to reduce the round trips. Experimental results demonstrate that FORD significantly improves the transaction throughput by up to 3× and decreases the latency by up to 87.4% compared with state-of-the-art systems.
- [1] . 2018. Remote regions: A simple abstraction for remote memory. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC’18). 775–787.Google Scholar
- [2] . 2010. Resistive random access memory (ReRAM) based on metal oxides. Proceedings of the IEEE 98, 12 (2010), 2237–2251.Google ScholarCross Ref
- [3] . 2020. Can far memory improve job throughput? In Proceedings of the 15th EuroSys Conference (EuroSys’20). ACM, New York, NY, Article 14, 16 pages.Google ScholarDigital Library
- [4] . 2020. Assise: Performance and availability via client-local NVM in a Distributed File System. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 1011–1027.Google Scholar
- [5] . 2013. Spin-transfer torque magnetic random access memory (STT-MRAM). ACM Journal on Emerging Technologies in Computing Systems 9, 2 (2013), Article 13, 35 pages.Google ScholarDigital Library
- [6] . 1981. Concurrency control in distributed database systems. ACM Computing Surveys 13, 2 (
June 1981), 185–221.Google ScholarDigital Library - [7] . 2022. Cache-coherent accelerators for persistent memory crash consistency. In Proceedings of the 14th ACM Workshop on Hot Topics in Storage and File Systems (HotStorage’22). ACM, New York, NY, 37–44.Google ScholarDigital Library
- [8] . 2012. CAP twelve years later: How the “rules” have changed. Computer 45, 2 (2012), 23–29.Google ScholarDigital Library
- [9] . 2000. Towards robust distributed systems. In Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing (PODC’00), Vol. 7. 343477–343502.Google Scholar
- [10] . 2018. Efficient distributed memory management with RDMA and caching. Proceedings of the VLDB Endowment 11, 11 (2018), 1604–1617.Google ScholarDigital Library
- [11] . 2021. Rethinking software runtimes for disaggregated memory. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’21). ACM, New York, NY, 79–92.Google ScholarDigital Library
- [12] . 2020. Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 209–223.Google ScholarDigital Library
- [13] . 2020. HotRing: A hotspot-aware in-memory key-value store. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 239–252.Google Scholar
- [14] . 2016. Fast and general distributed transactions using RDMA and HTM. In Proceedings of the 11th European Conference on Computer Systems (EuroSys’16). ACM, New York, NY, Article 26, 17 pages.Google ScholarDigital Library
- [15] . 2022. Compute Express Link™: The Breakthrough CPU-to-Device Interconnect. Retrieved February 2, 2023 from https://www.computeexpresslink.org.Google Scholar
- [16] . 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC’10). ACM, New York, NY, 143–154.Google ScholarDigital Library
- [17] . 2022. DCS800 Data Center Switch. Retrieved February 2, 2023 from https://www.edge-core.com/_upload/images/2022-051-DCS800_Wedge100BF-32X-R10-20220705.pdf.Google Scholar
- [18] . 2022. The Machine: A New Kind of Computer. Retrieved February 2, 2023 from https://www.hpl.hp.com/research/systems-research/themachine/.Google Scholar
- [19] . 2022. Intel Reports Second-Quarter 2022 Financial Results. Retrieved February 2, 2023 from https://www.intc.com/news-events/press-releases/detail/1563/intel-reports-second-quarter-2022-financial-results.Google Scholar
- [20] . 2022. Intel®Optane™Persistent Memory. Retrieved February 2, 2023 from https://www.intel.com/content/www/us/en/products/docs/memory-storage/optane-persistent-memory/overview.html.Google Scholar
- [21] . 2022. Intel®Optane™Persistent Memory 200 Series (512GB PMEM Module). Retrieved February 2, 2023 from https://www.intel.com/content/www/us/en/products/sku/203880/intel-optane-persistent-memory-200-series-512gb-pmem-module/specifications.html.Google Scholar
- [22] . 2022. Intel®Rack Scale Design (Intel®RSD). Retrieved February 2, 2023 from https://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.html.Google Scholar
- [23] . 2022. Nantero’s NRAM®. Retrieved February 2, 2023 from https://www.nantero.com/technology/.Google Scholar
- [24] . 2021. NVIDIA CONNECTX-5. Retrieved February 2, 2023 from https://nvdam.widen.net/s/pkxbnmbgkh/networking-infiniband-datasheet-connectx-5-2069273.Google Scholar
- [25] . 2021. NVIDIA CONNECTX-7. Retrieved February 2, 2023 from https://nvdam.widen.net/s/m6pt7j5rlb/networking-datasheet-infiniband-connectx-7-ds---1779005.Google Scholar
- [26] . 2022. NVIDIA CONNECTX-6. Retrieved February 2, 2023 from https://nvdam.widen.net/s/5j7xtzqfxd/connectx-6-infiniband-datasheet-1987500-r2.Google Scholar
- [27] . 2022. TPC-C Benchmark. Retrieved February 2, 2023 from http://www.tpc.org/tpcc/.Google Scholar
- [28] . 2022. Use Read-Only Replicas to Offload Read-Only Query Workloads. Retrieved February 2, 2023 from https://docs.microsoft.com/en-us/azure/azure-sql/database/read-scale-out.Google Scholar
- [29] . 2015. RDMA with PM: Software mechanisms for enabling persistent memory replication. In Proceedings of the 2015 Storage Developer Conference.Google Scholar
- [30] . 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 401–414.Google Scholar
- [31] . 2015. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 54–70.Google ScholarDigital Library
- [32] . 2016. Network requirements for resource disaggregation. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 249–264.Google Scholar
- [33] . 2002. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News 33, 2 (2002), 51–59.Google ScholarDigital Library
- [34] . 1989. Leases: An efficient fault-tolerant mechanism for distributed file cache consistency. ACM SIGOPS Operating Systems Review 23, 5 (1989), 202–210.Google ScholarDigital Library
- [35] . 2018. Persistent memory over fabrics. In Proceedings of the 2018 Persistent Memory Summit.Google Scholar
- [36] . 2017. Efficient memory disaggregation with Infiniswap. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI’17). 649–667.Google Scholar
- [37] . 2022. uKharon: A membership service for microsecond applications. In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC’22). 101–120.Google Scholar
- [38] . 2022. Clio: A hardware-software co-designed disaggregated memory system. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’22). ACM, New York, NY, 417–433.Google ScholarDigital Library
- [39] . 2021. Releasing locks as early as you can: Reducing contention of hotspots by violating two-phase locking. In Proceedings of the International Conference on Management of Data (SIGMOD’21). ACM, New York, NY, 658–670.Google ScholarDigital Library
- [40] . 2015. Fail-stop failure algorithm-based fault tolerance for Cholesky decomposition. IEEE Transactions on Parallel and Distributed Systems 26, 5 (2015), 1323–1335. Google ScholarDigital Library
- [41] . 2008. Nysiad: Practical protocol transformation to tolerate Byzantine failures. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation (NSDI’08). 175–188.Google Scholar
- [42] . 2022. Aurogon: Taming aborts in all phases for distributed in-memory transactions. In Proceedings of the 20th USENIX Conference on File and Storage Technologies (FAST’22). 217–232.Google Scholar
- [43] . 2022. Hello bytes, bye blocks: PCIe storage meets Computer Express Link for memory expansion (CXL-SSD). In Proceedings of the 14th ACM Workshop on Hot Topics in Storage and File Systems (HotStorage’22). ACM, New York, NY, 45–51.Google ScholarDigital Library
- [44] . 2020. Challenges and solutions for fast remote persistent memory access. In Proceedings of the ACM Symposium on Cloud Computing (SoCC’20). ACM, New York, NY, 105–119.Google ScholarDigital Library
- [45] . 2016. Design guidelines for high performance RDMA systems. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX’16). 437–450.Google Scholar
- [46] . 2016. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 185–201.Google Scholar
- [47] . 2021. Zeus: Locality-aware distributed transactions. In Proceedings of the 16th European Conference on Computer Systems (EuroSys’21). ACM, New York, NY, 145–161.Google ScholarDigital Library
- [48] . 2018. Hyperloop: Group-based NIC-offloading to accelerate replicated transactions in multi-tenant storage systems. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM’18). ACM, New York, NY, 297–312.Google ScholarDigital Library
- [49] . 1981. On optimistic methods for concurrency control. ACM Transactions on Database Systems 6, 2 (1981), 213–226.Google ScholarDigital Library
- [50] . 2009. Vertical Paxos and primary-backup replication. In Proceedings of the 28th Annual ACM Symposium on Principles of Distributed Computing (PODC’09). ACM, New York, NY, 312–313.Google ScholarDigital Library
- [51] . 2020. NVDIMM-C: A byte-addressable non-volatile memory module for compatibility with standard DDR memory interfaces. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’20). IEEE, Los Alamitos, CA, 502–514.Google ScholarCross Ref
- [52] . 2021. MIND: In-network memory management for disaggregated data centers. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP’21). ACM, New York, NY, 488–504.Google ScholarDigital Library
- [53] . 2019. Recipe: Converting concurrent DRAM indexes to persistent-memory indexes. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). ACM, New York, NY, 462–477.Google ScholarDigital Library
- [54] . 2009. Disaggregated memory for expansion and sharing in blade servers. In Proceedings of the 36th International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 267–278.Google ScholarDigital Library
- [55] . 2012. System-level implications of disaggregated memory. In Proceedings of the 18th IEEE International Symposium on High Performance Computer Architecture (HPCA’12). IEEE, Los Alamitos, CA, 189–200.Google ScholarDigital Library
- [56] . 2016. Towards a non-2PC transaction management in distributed database systems. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16). ACM, New York, NY, 1659–1674.Google ScholarDigital Library
- [57] . 2017. Octopus: An RDMA-enabled distributed persistent memory file system. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC’17). 773–785.Google Scholar
- [58] . 2020. AsymNVM: An efficient framework for implementing persistent data structures on asymmetric NVM architecture. In Proceedings of the Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). ACM, New York, NY, 757–773.Google ScholarDigital Library
- [59] . 2014. Extracting more concurrency from distributed transactions. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 479–494.Google Scholar
- [60] . 2019. Storm: A fast transactional dataplane for remote data structures. In Proceedings of the 12th ACM International Conference on Systems and Storage (SYSTOR’19). ACM, New York, NY, 97–108.Google ScholarDigital Library
- [61] . 2022. What Is OLTP? Retrieved February 2, 2023 from https://www.oracle.com/database/what-is-oltp/.Google Scholar
- [62] . 2021. Remote Memory. Retrieved February 2, 2023 from https://research.vmware.com/projects/remote-memory.Google Scholar
- [63] . 2020. AIFM: High-performance, application-integrated far memory. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 315–332.Google Scholar
- [64] . 2021. Xenic: SmartNIC-accelerated distributed transactions. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP’21). ACM, New York, NY, 740–755.Google ScholarDigital Library
- [65] . 2019. Fast general distributed transactions with opacity. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). ACM, New York, NY, 433–448.Google ScholarDigital Library
- [66] . 2018. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 69–87.Google Scholar
- [67] . 2017. Distributed shared persistent memory. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17). ACM, New York, NY, 323–337.Google ScholarDigital Library
- [68] . 2019. Shoal: A network architecture for disaggregated racks. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). 255–270.Google Scholar
- [69] . 2020. RDMA Extensions for Enhanced Memory Placement. Retrieved February 2, 2023 from https://tools.ietf.org/html/draft-talpey-rdma-commit-01.Google Scholar
- [70] . 2011. Telecom Application Transaction Processing Benchmark. Retrieved February 2, 2023 from http://tatpbenchmark.sourceforge.net/.Google Scholar
- [71] . 2022. SmallBank Benchmark. Retrieved February 2, 2023 from https://hstore.cs.brown.edu/documentation/deployment/benchmarks/smallbank/.Google Scholar
- [72] NVIDIA Corporation. 2023. RDMA Aware Networks Programming User Manual v1.7. Retrieved February 2, 2023 from https://docs.nvidia.com/networking/display/RDMAAwareProgrammingv17/RDMA+Aware+Networks+Programming+User+Manual.Google Scholar
- [73] . 2008. A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies. In Proceedings of the 35th International Symposium on Computer Architecture (ISCA’08). IEEE, Los Alamitos, CA, 51–62.Google ScholarDigital Library
- [74] . 2020. Borg: The next generation. In Proceedings of the 15th EuroSys Conference (EuroSys’20). ACM, New York, NY, Article 30, 14 pages.Google ScholarDigital Library
- [75] . 2020. Disaggregating persistent memory and controlling them remotely: An exploration of passive disaggregated key-value stores. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC’20). 33–48.Google Scholar
- [76] . 2017. LITE kernel RDMA support for datacenter applications. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). ACM, New York, NY, 306–324.Google ScholarDigital Library
- [77] . 2017. Amazon Aurora: Design considerations for high throughput cloud-native relational databases. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD’17). ACM, New York, NY, 1041–1052.Google ScholarDigital Library
- [78] . 2020. Semeru: A memory-disaggregated managed runtime. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 261–280.Google Scholar
- [79] . 2021. Polyjuice: High-performance transactions via learned concurrency control. In Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI’21). 198–216.Google Scholar
- [80] . 2018. Extending RDMA for persistent memory over fabrics. In Proceedings of the SNIA Networking Storage Forum.Google Scholar
- [81] . 2018. Deconstructing RDMA-enabled distributed transactions: Hybrid is better! In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 233–251.Google Scholar
- [82] . 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 87–104.Google ScholarDigital Library
- [83] . 2021. Characterizing and optimizing remote persistent memory with RDMA and NVM. In Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC’21). 523–536.Google Scholar
- [84] . 2010. Phase change memory. Proceedings of the IEEE 98, 12 (2010), 2201–2227.Google ScholarCross Ref
- [85] . 2014. Salt: Combining ACID and BASE in a distributed database. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 495–509.Google Scholar
- [86] . 2015. High-performance ACID via modular concurrency control. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 279–294.Google ScholarDigital Library
- [87] . 2019. Orion: A distributed file system for non-volatile main memory and RDMA-capable networks. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST’19). 221–234.Google Scholar
- [88] . 2020. FileMR: Rethinking RDMA networking for scalable persistent memory. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI’20). 111–125.Google Scholar
- [89] . 2020. An empirical guide to the behavior and use of scalable persistent memory. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 169–182.Google ScholarDigital Library
- [90] . 2020. A large scale analysis of hundreds of in-memory cache clusters at Twitter. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 191–208.Google Scholar
- [91] . 2017. The end of a myth: Distributed transactions can scale. Proceedings of the VLDB Endowment 10, 6 (
Feb. 2017), 685–696.Google ScholarDigital Library - [92] . 2020. Chiller: Contention-centric transaction execution and data partitioning for modern networks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD’20). ACM, New York, NY, 511–526.Google ScholarDigital Library
- [93] . 2015. Building consistent transactions with inconsistent replication. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). ACM, New York, NY, 263–278.Google ScholarDigital Library
- [94] . 2022. FORD: Fast one-sided RDMA-based distributed transactions for disaggregated persistent memory. In Proceedings of the 20th USENIX Conference on File and Storage Technologies (FAST’22). 51–68.Google Scholar
- [95] . 2015. Mojim: A reliable and highly-available non-volatile memory system. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 3–18.Google ScholarDigital Library
- [96] . 2021. One-sided RDMA-conscious extendible hashing for disaggregated memory. In Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC’21). 15–29.Google Scholar
Index Terms
- Localized Validation Accelerates Distributed Transactions on Disaggregated Persistent Memory
Recommendations
A Case for Virtualizing Persistent Memory
SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud ComputingWith the proliferation of software and hardware support for persistent memory (PM) like PCM and NV-DIMM, we envision that PM will soon become a standard component of commodity cloud, especially for those applications demanding high performance and low ...
Distributed shared persistent memory
SoCC '17: Proceedings of the 2017 Symposium on Cloud ComputingNext-generation non-volatile memories (NVMs) will provide byte addressability, persistence, high density, and DRAM-like performance. They have the potential to benefit many datacenter applications. However, most previous research on NVMs has focused on ...
Toward Virtual Machine Image Management for Persistent Memory
Persistent memory’s (PM) byte-addressability and high capacity will also make it emerging for virtualized environment. Modern virtual machine monitors virtualize PM using either I/O virtualization or memory virtualization. However, I/O virtualization will ...
Comments