skip to main content
research-article

A High-performance RDMA-oriented Learned Key-value Store for Disaggregated Memory Systems

Authors Info & Claims
Published:03 October 2023Publication History
Skip Abstract Section

Abstract

Disaggregated memory systems separate monolithic servers into different components, including compute and memory nodes, to enjoy the benefits of high resource utilization, flexible hardware scalability, and efficient data sharing. By exploiting the high-performance RDMA (Remote Direct Memory Access), the compute nodes directly access the remote memory pool without involving remote CPUs. Hence, the ordered key-value (KV) stores (e.g., B-trees and learned indexes) keep all data sorted to provide range query services via the high-performance network. However, existing ordered KVs fail to work well on the disaggregated memory systems, due to either consuming multiple network roundtrips to search the remote data or heavily relying on the memory nodes equipped with insufficient computing resources to process data modifications. In this article, we propose a scalable RDMA-oriented KV store with learned indexes, called ROLEX, to coalesce the ordered KV store in the disaggregated systems for efficient data storage and retrieval. ROLEX leverages a retraining-decoupled learned index scheme to dissociate the model retraining from data modification operations via adding a bias and some data movement constraints to learned models. Based on the operation decoupling, data modifications are directly executed in compute nodes via one-sided RDMA verbs with high scalability. The model retraining is hence removed from the critical path of data modification and asynchronously executed in memory nodes by using dedicated computing resources. ROLEX efficiently alleviates the fragmentation and garbage collection issues, due to allocating and reclaiming space via fixed-size leaves that are accessed via the atomic-size leaf numbers. Our experimental results on YCSB and real-world workloads demonstrate that ROLEX achieves competitive performance on the static workloads, as well as significantly improving the performance on dynamic workloads by up to 2.2× over state-of-the-art schemes on the disaggregated memory systems. We have released the open-source codes for public use in GitHub.

REFERENCES

  1. [1] Aguilera Marcos K., Amit Nadav, Calciu Irina, Deguillard Xavier, Gandhi Jayneel, Novakovic Stanko, Ramanathan Arun, Subrahmanyam Pratap, Suresh Lalith, Tati Kiran, Venkatasubramanian Rajesh, and Wei Michael. 2018. Remote regions: A simple abstraction for remote memory. In 2018 USENIX Annual Technical Conference (ATC’18). 775787.Google ScholarGoogle Scholar
  2. [2] Amazon. 2021. Amazon Elastic Block Store. https://aws.amazon.com/ebs/?nc1=h_lsGoogle ScholarGoogle Scholar
  3. [3] Amazon. 2021. Amazon s3. https://aws.amazon.com/s3/Google ScholarGoogle Scholar
  4. [4] Binnig Carsten, Crotty Andrew, Galakatos Alex, Kraska Tim, and Zamanian Erfan. 2016. The end of slow networks: It’s time for a redesign. Proc. VLDB Endow. 9, 7 (2016), 528539.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Carbonari Amanda and Beschasnikh Ivan. 2017. Tolerating faults in disaggregated datacenters. In Proceedings of the 16th ACM Workshop on Hot Topics in Networks (HotNets’17). 164170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Chen Youmin, Lu Youyou, and Shu Jiwu. 2019. Scalable RDMA RPC on reliable connection with efficient resource sharing. In Proceedings of the 14th EuroSys Conference 2019 (EuroSys’19). 19:1–19:14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Comer Douglas. 1979. The ubiquitous B-tree. ACM Comput. Surv. 11, 2 (1979), 121137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Corporation Intel. 2021. Intel Rack Scale Design Architecture. https://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.htmlGoogle ScholarGoogle Scholar
  9. [9] Dai Yifan, Xu Yien, Ganesan Aishwarya, Alagappan Ramnatthan, Kroth Brian, Arpaci-Dusseau Andrea C., and Arpaci-Dusseau Remzi H.. 2020. From WiscKey to bourbon: A learned index for log-structured merge trees. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 155171.Google ScholarGoogle Scholar
  10. [10] Ding Jialin, Minhas Umar Farooq, Yu Jia, Wang Chi, Do Jaeyoung, Li Yinan, Zhang Hantian, Chandramouli Badrish, Gehrke Johannes, Kossmann Donald, Lomet David B., and Kraska Tim. 2020. ALEX: An updatable adaptive learned index. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD’20). 969984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Ding Jialin, Nathan Vikram, Alizadeh Mohammad, and Kraska Tim. 2020. Tsunami: A learned multi-dimensional index for correlated data and skewed workloads. Proc. VLDB Endow. 14, 2 (2020), 7486.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Dragojevic Aleksandar, Narayanan Dushyanth, Castro Miguel, and Hodson Orion. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 401414.Google ScholarGoogle Scholar
  13. [13] Dragojevic Aleksandar, Narayanan Dushyanth, Nightingale Edmund B., Renzelmann Matthew, Shamis Alex, Badam Anirudh, and Castro Miguel. 2015. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). 5470.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Ferragina Paolo and Vinciguerra Giorgio. 2020. The PGM-index: A fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow. 13, 8 (2020), 11621175.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Galakatos Alex, Markovitch Michael, Binnig Carsten, Fonseca Rodrigo, and Kraska Tim. 2019. FITing-tree: A data-aware index structure. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). 11891206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Gao Peter Xiang, Narayan Akshay, Karandikar Sagar, Carreira Joao, Han Sangjin, Agarwal Rachit, Ratnasamy Sylvia, and Shenker Scott. 2016. Network requirements for resource disaggregation. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 249264.Google ScholarGoogle Scholar
  17. [17] Guo Zhiyuan, Shan Yizhou, Luo Xuhao, Huang Yutong, and Zhang Yiying. 2022. Clio: A hardware-software co-designed disaggregated memory system. In 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’22).Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Hilprecht Benjamin, Schmidt Andreas, Kulessa Moritz, Molina Alejandro, Kersting Kristian, and Binnig Carsten. 2020. DeepDB: Learn from data, not from queries! Proc. VLDB Endow. 13, 7 (2020), 9921005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] HP. 2021. The Machine. https://www.hpl.hp.com/research/systems-research/themachine/Google ScholarGoogle Scholar
  20. [20] Hwang Deukyeon, Kim Wook-Hee, Won Youjip, and Nam Beomseok. 2018. Endurable transient inconsistency in byte-addressable persistent B+-Tree. In 16th USENIX Conference on File and Storage Technologies (FAST’18). 187200.Google ScholarGoogle Scholar
  21. [21] Kalia Anuj, Kaminsky Michael, and Andersen David G.. 2014. Using RDMA efficiently for key-value services. In ACM SIGCOMM 2014 Conference (SIGCOMM’14). 295306.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Kalia Anuj, Kaminsky Michael, and Andersen David G.. 2016. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 185201.Google ScholarGoogle Scholar
  23. [23] Kalia Anuj, Kaminsky Michael, and Andersen David G.. 2019. Datacenter RPCs can be general and fast. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). 116.Google ScholarGoogle Scholar
  24. [24] Kraska Tim, Beutel Alex, Chi Ed H., Dean Jeffrey, and Polyzotis Neoklis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18). 489504.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Lan Hai, Bao Zhifeng, Culpepper J. Shane, and Borovica-Gajic Renata. 2023. Updatable learned indexes meet disk-resident DBMS - From evaluations to design choices. Proc. ACM Manag. Data 1, 2 (2023), 139:1–139:22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Lehman Philip L. and Yao S. Bing. 1981. Efficient locking for concurrent operations on B-trees. ACM Trans. Database Syst. 6, 4 (1981), 650670.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Li Pengfei, Lu Hua, Zheng Qian, Yang Long, and Pan Gang. 2020. LISA: A learned index structure for spatial data. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD’20). 21192133.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Li Pengfei, Lu Hua, Zhu Rong, Ding Bolin, Yang Long, and Pan Gang. 2023. DILI: A distribution-driven learned index. Proc. VLDB Endow. 16, 9 (2023), 22122224.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Lim Kevin T., Chang Jichuan, Mudge Trevor N., Ranganathan Parthasarathy, Reinhardt Steven K., and Wenisch Thomas F.. 2009. Disaggregated memory for expansion and sharing in blade servers. In 36th International Symposium on Computer Architecture (ISCA’09). 267278.Google ScholarGoogle Scholar
  30. [30] Lim Kevin T., Turner Yoshio, Santos Jose Renato, AuYoung Alvin, Chang Jichuan, Ranganathan Parthasarathy, and Wenisch Thomas F.. 2012. System-level implications of disaggregated memory. In 18th IEEE International Symposium on High Performance Computer Architecture (HPCA’12). 189200.Google ScholarGoogle Scholar
  31. [31] Lu Youyou, Shu Jiwu, Chen Youmin, and Li Tao. 2017. Octopus: An RDMA-enabled distributed persistent memory file system. In 2017 USENIX Annual Technical Conference (ATC’17). 773785.Google ScholarGoogle Scholar
  32. [32] Mao Yandong, Kohler Eddie, and Morris Robert Tappan. 2012. Cache craftiness for fast multicore key-value storage. In European Conference on Computer Systems, Proceedings of the 7th EuroSys Conference 2012 (EuroSys’12). 183196.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Mitchell Christopher, Montgomery Kate, Nelson Lamont, Sen Siddhartha, and Li Jinyang. 2016. Balancing CPU and network in the cell distributed B-tree store. In 2016 USENIX Annual Technical Conference (ATC’16). 451464.Google ScholarGoogle Scholar
  34. [34] Ruan Zhenyuan, Schwarzkopf Malte, Aguilera Marcos K., and Belay Adam. 2020. AIFM: High-performance, application-integrated far memory. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 315332.Google ScholarGoogle Scholar
  35. [35] Salama Abdallah, Binnig Carsten, Kraska Tim, Scherp Ansgar, and Ziegler Tobias. 2017. Rethinking distributed query execution on high-speed networks. IEEE Data Eng. Bull. 40, 1 (2017), 2737.Google ScholarGoogle Scholar
  36. [36] Shan Yizhou, Huang Yutong, Chen Yilun, and Zhang Yiying. 2018. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 6987.Google ScholarGoogle Scholar
  37. [37] Shan Yizhou, Tsai Shin-Yeh, and Zhang Yiying. 2017. Distributed shared persistent memory. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17). 323337.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Shen Jiacheng, Zuo Pengfei, Luo Xuchuan, Yang Tianyi, Su Yuxin, Zhou Yangfan, and Lyu Michael R.. 2023. FUSEE: A fully memory-disaggregated key-value store. In 21st USENIX Conference on File and Storage Technologies (FAST’23). 8198.Google ScholarGoogle Scholar
  39. [39] Shrivastav Vishal, Valadarsky Asaf, Ballani Hitesh, Costa Paolo, Lee Ki-Suh, Wang Han, Agarwal Rachit, and Weatherspoon Hakim. 2019. Shoal: A network architecture for disaggregated racks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). 255270.Google ScholarGoogle Scholar
  40. [40] Tang Chuzhe, Wang Youyun, Dong Zhiyuan, Hu Gansen, Wang Zhaoguo, Wang Minjie, and Chen Haibo. 2020. XIndex: A scalable learned index for multicore data storage. In 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’20). 308320.Google ScholarGoogle Scholar
  41. [41] Tirmazi Muhammad, Barker Adam, Deng Nan, Haque Md. E., Qin Zhijing Gene, Hand Steven, Harchol-Balter Mor, and Wilkes John. 2020. Borg: The next generation. In 15th EuroSys Conference 2020 (EuroSys’20). 30:1–30:14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Tsai Shin-Yeh, Shan Yizhou, and Zhang Yiying. 2020. Disaggregating persistent memory and controlling them remotely: An exploration of passive disaggregated key-value stores. In 2020 USENIX Annual Technical Conference (ATC’20). 3348.Google ScholarGoogle Scholar
  43. [43] Tsai Shin-Yeh and Zhang Yiying. 2017. LITE kernel RDMA support for datacenter applications. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). 306324.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Vienne Jérôme, Chen Jitong, Wasi-ur-Rahman Md., Islam Nusrat S., Subramoni Hari, and Panda Dhabaleswar K.. 2012. Performance analysis and evaluation of InfiniBand FDR and 40GigE RoCE on HPC and cloud computing systems. In IEEE 20th Annual Symposium on High-performance Interconnects (HOTI’12). 4855.Google ScholarGoogle Scholar
  45. [45] Wang Chenxi, Ma Haoran, Liu Shi, Li Yuanqi, Ruan Zhenyuan, Nguyen Khanh, Bond Michael D., Netravali Ravi, Kim Miryung, and Xu Guoqing Harry. 2020. Semeru: A memory-disaggregated managed runtime. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 261280.Google ScholarGoogle Scholar
  46. [46] Wang Qing, Lu Youyou, and Shu Jiwu. 2022. Sherman: A write-optimized distributed B+ tree index on disaggregated memory. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD’22).Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Wei Xingda, Chen Rong, and Chen Haibo. 2020. Fast RDMA-based ordered key-value store using remote learned cache. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 117135.Google ScholarGoogle Scholar
  48. [48] Wei Xingda, Shi Jiaxin, Chen Yanzhe, Chen Rong, and Chen Haibo. 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). 87104.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Xie Qing, Pang Chaoyi, Zhou Xiaofang, Zhang Xiangliang, and Deng Ke. 2014. Maximum error-bounded piecewise linear representation for online stream approximation. VLDB J. 23, 6 (2014), 915937.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Yahoo. 2019. Yahoo! Cloud Serving Benchmark (YCSB). https://github.com/brianfrankcooper/YCSBGoogle ScholarGoogle Scholar
  51. [51] Yang Jian, Izraelevitz Joseph, and Swanson Steven. 2019. Orion: A distributed file system for non-volatile main memory and RDMA-capable networks. In 17th USENIX Conference on File and Storage Technologies (FAST’19). 221234.Google ScholarGoogle Scholar
  52. [52] Zamanian Erfan, Binnig Carsten, Kraska Tim, and Harris Tim. 2017. The end of a myth: Distributed transaction can scale. Proc. VLDB Endow. 10, 6 (2017), 685696.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Zhang Ming, Hua Yu, Zuo Pengfei, and Liu Lurong. 2022. FORD: Fast one-sided RDMA-based distributed transactions for disaggregated persistent memory. In 20th USENIX Conference on File and Storage Technologies (FAST’22). 5168.Google ScholarGoogle Scholar
  54. [54] Zhu Yibo, Eran Haggai, Firestone Daniel, Guo Chuanxiong, Lipshteyn Marina, Liron Yehonatan, Padhye Jitendra, Raindel Shachar, Yahia Mohamad Haj, and Zhang Ming. 2015. Congestion control for large-scale RDMA deployments. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM’15). 523536.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Ziegler Tobias, Vani Sumukha Tumkur, Binnig Carsten, Fonseca Rodrigo, and Kraska Tim. 2019. Designing distributed tree-based index structures for fast RDMA-capable networks. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). 741758.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Zuo Pengfei, Sun Jiazhao, Yang Liu, Zhang Shuangwu, and Hua Yu. 2021. One-sided RDMA-conscious extendible hashing for disaggregated memory. In 2021 USENIX Annual Technical Conference (ATC’21). 1529.Google ScholarGoogle Scholar

Index Terms

  1. A High-performance RDMA-oriented Learned Key-value Store for Disaggregated Memory Systems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Storage
        ACM Transactions on Storage  Volume 19, Issue 4
        November 2023
        238 pages
        ISSN:1553-3077
        EISSN:1553-3093
        DOI:10.1145/3626486
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 3 October 2023
        • Online AM: 5 September 2023
        • Accepted: 14 August 2023
        • Received: 8 May 2023
        Published in tos Volume 19, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text