Abstract
Disaggregated memory systems separate monolithic servers into different components, including compute and memory nodes, to enjoy the benefits of high resource utilization, flexible hardware scalability, and efficient data sharing. By exploiting the high-performance RDMA (Remote Direct Memory Access), the compute nodes directly access the remote memory pool without involving remote CPUs. Hence, the ordered key-value (KV) stores (e.g., B-trees and learned indexes) keep all data sorted to provide range query services via the high-performance network. However, existing ordered KVs fail to work well on the disaggregated memory systems, due to either consuming multiple network roundtrips to search the remote data or heavily relying on the memory nodes equipped with insufficient computing resources to process data modifications. In this article, we propose a scalable RDMA-oriented KV store with learned indexes, called ROLEX, to coalesce the ordered KV store in the disaggregated systems for efficient data storage and retrieval. ROLEX leverages a retraining-decoupled learned index scheme to dissociate the model retraining from data modification operations via adding a bias and some data movement constraints to learned models. Based on the operation decoupling, data modifications are directly executed in compute nodes via one-sided RDMA verbs with high scalability. The model retraining is hence removed from the critical path of data modification and asynchronously executed in memory nodes by using dedicated computing resources. ROLEX efficiently alleviates the fragmentation and garbage collection issues, due to allocating and reclaiming space via fixed-size leaves that are accessed via the atomic-size leaf numbers. Our experimental results on YCSB and real-world workloads demonstrate that ROLEX achieves competitive performance on the static workloads, as well as significantly improving the performance on dynamic workloads by up to 2.2× over state-of-the-art schemes on the disaggregated memory systems. We have released the open-source codes for public use in GitHub.
- [1] . 2018. Remote regions: A simple abstraction for remote memory. In 2018 USENIX Annual Technical Conference (ATC’18). 775–787.Google Scholar
- [2] . 2021. Amazon Elastic Block Store. https://aws.amazon.com/ebs/?nc1=h_lsGoogle Scholar
- [3] . 2021. Amazon s3. https://aws.amazon.com/s3/Google Scholar
- [4] . 2016. The end of slow networks: It’s time for a redesign. Proc. VLDB Endow. 9, 7 (2016), 528–539.Google ScholarDigital Library
- [5] . 2017. Tolerating faults in disaggregated datacenters. In Proceedings of the 16th ACM Workshop on Hot Topics in Networks (HotNets’17). 164–170.Google ScholarDigital Library
- [6] . 2019. Scalable RDMA RPC on reliable connection with efficient resource sharing. In Proceedings of the 14th EuroSys Conference 2019 (EuroSys’19). 19:1–19:14.Google ScholarDigital Library
- [7] . 1979. The ubiquitous B-tree. ACM Comput. Surv. 11, 2 (1979), 121–137.Google ScholarDigital Library
- [8] . 2021. Intel Rack Scale Design Architecture. https://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.htmlGoogle Scholar
- [9] . 2020. From WiscKey to bourbon: A learned index for log-structured merge trees. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 155–171.Google Scholar
- [10] . 2020. ALEX: An updatable adaptive learned index. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD’20). 969–984.Google ScholarDigital Library
- [11] . 2020. Tsunami: A learned multi-dimensional index for correlated data and skewed workloads. Proc. VLDB Endow. 14, 2 (2020), 74–86.Google ScholarDigital Library
- [12] . 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 401–414.Google Scholar
- [13] . 2015. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). 54–70.Google ScholarDigital Library
- [14] . 2020. The PGM-index: A fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow. 13, 8 (2020), 1162–1175.Google ScholarDigital Library
- [15] . 2019. FITing-tree: A data-aware index structure. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). 1189–1206.Google ScholarDigital Library
- [16] . 2016. Network requirements for resource disaggregation. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 249–264.Google Scholar
- [17] . 2022. Clio: A hardware-software co-designed disaggregated memory system. In 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’22).Google ScholarDigital Library
- [18] . 2020. DeepDB: Learn from data, not from queries! Proc. VLDB Endow. 13, 7 (2020), 992–1005.Google ScholarDigital Library
- [19] . 2021. The Machine. https://www.hpl.hp.com/research/systems-research/themachine/Google Scholar
- [20] . 2018. Endurable transient inconsistency in byte-addressable persistent B+-Tree. In 16th USENIX Conference on File and Storage Technologies (FAST’18). 187–200.Google Scholar
- [21] . 2014. Using RDMA efficiently for key-value services. In ACM SIGCOMM 2014 Conference (SIGCOMM’14). 295–306.Google ScholarDigital Library
- [22] . 2016. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 185–201.Google Scholar
- [23] . 2019. Datacenter RPCs can be general and fast. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). 1–16.Google Scholar
- [24] . 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18). 489–504.Google ScholarDigital Library
- [25] . 2023. Updatable learned indexes meet disk-resident DBMS - From evaluations to design choices. Proc. ACM Manag. Data 1, 2 (2023), 139:1–139:22.Google ScholarDigital Library
- [26] . 1981. Efficient locking for concurrent operations on B-trees. ACM Trans. Database Syst. 6, 4 (1981), 650–670.Google ScholarDigital Library
- [27] . 2020. LISA: A learned index structure for spatial data. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD’20). 2119–2133.Google ScholarDigital Library
- [28] . 2023. DILI: A distribution-driven learned index. Proc. VLDB Endow. 16, 9 (2023), 2212–2224.Google ScholarDigital Library
- [29] . 2009. Disaggregated memory for expansion and sharing in blade servers. In 36th International Symposium on Computer Architecture (ISCA’09). 267–278.Google Scholar
- [30] . 2012. System-level implications of disaggregated memory. In 18th IEEE International Symposium on High Performance Computer Architecture (HPCA’12). 189–200.Google Scholar
- [31] . 2017. Octopus: An RDMA-enabled distributed persistent memory file system. In 2017 USENIX Annual Technical Conference (ATC’17). 773–785.Google Scholar
- [32] . 2012. Cache craftiness for fast multicore key-value storage. In European Conference on Computer Systems, Proceedings of the 7th EuroSys Conference 2012 (EuroSys’12). 183–196.Google ScholarDigital Library
- [33] . 2016. Balancing CPU and network in the cell distributed B-tree store. In 2016 USENIX Annual Technical Conference (ATC’16). 451–464.Google Scholar
- [34] . 2020. AIFM: High-performance, application-integrated far memory. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 315–332.Google Scholar
- [35] . 2017. Rethinking distributed query execution on high-speed networks. IEEE Data Eng. Bull. 40, 1 (2017), 27–37.Google Scholar
- [36] . 2018. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 69–87.Google Scholar
- [37] . 2017. Distributed shared persistent memory. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17). 323–337.Google ScholarDigital Library
- [38] . 2023. FUSEE: A fully memory-disaggregated key-value store. In 21st USENIX Conference on File and Storage Technologies (FAST’23). 81–98.Google Scholar
- [39] . 2019. Shoal: A network architecture for disaggregated racks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). 255–270.Google Scholar
- [40] . 2020. XIndex: A scalable learned index for multicore data storage. In 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’20). 308–320.Google Scholar
- [41] . 2020. Borg: The next generation. In 15th EuroSys Conference 2020 (EuroSys’20). 30:1–30:14.Google ScholarDigital Library
- [42] . 2020. Disaggregating persistent memory and controlling them remotely: An exploration of passive disaggregated key-value stores. In 2020 USENIX Annual Technical Conference (ATC’20). 33–48.Google Scholar
- [43] . 2017. LITE kernel RDMA support for datacenter applications. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). 306–324.Google ScholarDigital Library
- [44] . 2012. Performance analysis and evaluation of InfiniBand FDR and 40GigE RoCE on HPC and cloud computing systems. In IEEE 20th Annual Symposium on High-performance Interconnects (HOTI’12). 48–55.Google Scholar
- [45] . 2020. Semeru: A memory-disaggregated managed runtime. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 261–280.Google Scholar
- [46] . 2022. Sherman: A write-optimized distributed B+ tree index on disaggregated memory. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD’22).Google ScholarDigital Library
- [47] . 2020. Fast RDMA-based ordered key-value store using remote learned cache. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 117–135.Google Scholar
- [48] . 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). 87–104.Google ScholarDigital Library
- [49] . 2014. Maximum error-bounded piecewise linear representation for online stream approximation. VLDB J. 23, 6 (2014), 915–937.Google ScholarDigital Library
- [50] . 2019. Yahoo! Cloud Serving Benchmark (YCSB). https://github.com/brianfrankcooper/YCSBGoogle Scholar
- [51] . 2019. Orion: A distributed file system for non-volatile main memory and RDMA-capable networks. In 17th USENIX Conference on File and Storage Technologies (FAST’19). 221–234.Google Scholar
- [52] . 2017. The end of a myth: Distributed transaction can scale. Proc. VLDB Endow. 10, 6 (2017), 685–696.Google ScholarDigital Library
- [53] . 2022. FORD: Fast one-sided RDMA-based distributed transactions for disaggregated persistent memory. In 20th USENIX Conference on File and Storage Technologies (FAST’22). 51–68.Google Scholar
- [54] . 2015. Congestion control for large-scale RDMA deployments. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM’15). 523–536.Google ScholarDigital Library
- [55] . 2019. Designing distributed tree-based index structures for fast RDMA-capable networks. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). 741–758.Google ScholarDigital Library
- [56] . 2021. One-sided RDMA-conscious extendible hashing for disaggregated memory. In 2021 USENIX Annual Technical Conference (ATC’21). 15–29.Google Scholar
Index Terms
- A High-performance RDMA-oriented Learned Key-value Store for Disaggregated Memory Systems
Recommendations
LSM-tree managed storage for large-scale key-value store
SoCC '17: Proceedings of the 2017 Symposium on Cloud ComputingKey-value stores are increasingly adopting LSM-trees as their enabling data structure in the backend storage, and persisting their clustered data through a file system. A file system is expected to not only provide file/directory abstraction to organize ...
NVLSM: A Persistent Memory Key-Value Store Using Log-Structured Merge Tree with Accumulative Compaction
Computer systems utilizing byte-addressable Non-Volatile Memory (NVM) as memory/storage can provide low-latency data persistence. The widely used key-value stores using Log-Structured Merge Tree (LSM-Tree) are still beneficial for NVM systems in aspects ...
Exploiting Hybrid Index Scheme for RDMA-based Key-Value Stores
SYSTOR '23: Proceedings of the 16th ACM International Conference on Systems and StorageRDMA (Remote Direct Memory Access) is widely studied in building key-value stores to achieve ultra-low latency. In RDMA-based key-value stores, the indexing time takes a large fraction of the overall operation latency as RDMA enables fast data access. ...
Comments