research-article

A High-performance RDMA-oriented Learned Key-value Store for Disaggregated Memory Systems

Authors:
Pengfei Li

Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, China

Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, China

0000-0001-6793-0964
View Profile

,
Yu Hua

WuhanNational Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, China

WuhanNational Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, China

0000-0001-7730-3796
View Profile

,
Pengfei Zuo

WuhanNational Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, China

WuhanNational Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, China

0000-0001-9982-5130
View Profile

,
Zhangyu Chen

WuhanNational Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, China

WuhanNational Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, China

0000-0001-9020-3693
View Profile

,
Jiajie Sheng

WuhanNational Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, China

WuhanNational Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, China

0009-0007-1193-7869
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 19 Issue 4Article No.: 30pp 1–30https://doi.org/10.1145/3620674

Published:03 October 2023Publication History

ACM Transactions on Storage

Abstract

Disaggregated memory systems separate monolithic servers into different components, including compute and memory nodes, to enjoy the benefits of high resource utilization, flexible hardware scalability, and efficient data sharing. By exploiting the high-performance RDMA (Remote Direct Memory Access), the compute nodes directly access the remote memory pool without involving remote CPUs. Hence, the ordered key-value (KV) stores (e.g., B-trees and learned indexes) keep all data sorted to provide range query services via the high-performance network. However, existing ordered KVs fail to work well on the disaggregated memory systems, due to either consuming multiple network roundtrips to search the remote data or heavily relying on the memory nodes equipped with insufficient computing resources to process data modifications. In this article, we propose a scalable RDMA-oriented KV store with learned indexes, called ROLEX, to coalesce the ordered KV store in the disaggregated systems for efficient data storage and retrieval. ROLEX leverages a retraining-decoupled learned index scheme to dissociate the model retraining from data modification operations via adding a bias and some data movement constraints to learned models. Based on the operation decoupling, data modifications are directly executed in compute nodes via one-sided RDMA verbs with high scalability. The model retraining is hence removed from the critical path of data modification and asynchronously executed in memory nodes by using dedicated computing resources. ROLEX efficiently alleviates the fragmentation and garbage collection issues, due to allocating and reclaiming space via fixed-size leaves that are accessed via the atomic-size leaf numbers. Our experimental results on YCSB and real-world workloads demonstrate that ROLEX achieves competitive performance on the static workloads, as well as significantly improving the performance on dynamic workloads by up to 2.2× over state-of-the-art schemes on the disaggregated memory systems. We have released the open-source codes for public use in GitHub.

REFERENCES

[1] Aguilera Marcos K., Amit Nadav, Calciu Irina, Deguillard Xavier, Gandhi Jayneel, Novakovic Stanko, Ramanathan Arun, Subrahmanyam Pratap, Suresh Lalith, Tati Kiran, Venkatasubramanian Rajesh, and Wei Michael. 2018. Remote regions: A simple abstraction for remote memory. In 2018 USENIX Annual Technical Conference (ATC’18). 775–787.Google Scholar
[2] Amazon. 2021. Amazon Elastic Block Store. https://aws.amazon.com/ebs/?nc1=h_lsGoogle Scholar
[3] Amazon. 2021. Amazon s3. https://aws.amazon.com/s3/Google Scholar
[4] Binnig Carsten, Crotty Andrew, Galakatos Alex, Kraska Tim, and Zamanian Erfan. 2016. The end of slow networks: It’s time for a redesign. Proc. VLDB Endow. 9, 7 (2016), 528–539.Google ScholarDigital Library
[5] Carbonari Amanda and Beschasnikh Ivan. 2017. Tolerating faults in disaggregated datacenters. In Proceedings of the 16th ACM Workshop on Hot Topics in Networks (HotNets’17). 164–170.Google ScholarDigital Library
[6] Chen Youmin, Lu Youyou, and Shu Jiwu. 2019. Scalable RDMA RPC on reliable connection with efficient resource sharing. In Proceedings of the 14th EuroSys Conference 2019 (EuroSys’19). 19:1–19:14.Google ScholarDigital Library
[7] Comer Douglas. 1979. The ubiquitous B-tree. ACM Comput. Surv. 11, 2 (1979), 121–137.Google ScholarDigital Library
[8] Corporation Intel. 2021. Intel Rack Scale Design Architecture. https://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.htmlGoogle Scholar
[9] Dai Yifan, Xu Yien, Ganesan Aishwarya, Alagappan Ramnatthan, Kroth Brian, Arpaci-Dusseau Andrea C., and Arpaci-Dusseau Remzi H.. 2020. From WiscKey to bourbon: A learned index for log-structured merge trees. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 155–171.Google Scholar
[10] Ding Jialin, Minhas Umar Farooq, Yu Jia, Wang Chi, Do Jaeyoung, Li Yinan, Zhang Hantian, Chandramouli Badrish, Gehrke Johannes, Kossmann Donald, Lomet David B., and Kraska Tim. 2020. ALEX: An updatable adaptive learned index. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD’20). 969–984.Google ScholarDigital Library
[11] Ding Jialin, Nathan Vikram, Alizadeh Mohammad, and Kraska Tim. 2020. Tsunami: A learned multi-dimensional index for correlated data and skewed workloads. Proc. VLDB Endow. 14, 2 (2020), 74–86.Google ScholarDigital Library
[12] Dragojevic Aleksandar, Narayanan Dushyanth, Castro Miguel, and Hodson Orion. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 401–414.Google Scholar
[13] Dragojevic Aleksandar, Narayanan Dushyanth, Nightingale Edmund B., Renzelmann Matthew, Shamis Alex, Badam Anirudh, and Castro Miguel. 2015. No compromises: Distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). 54–70.Google ScholarDigital Library
[14] Ferragina Paolo and Vinciguerra Giorgio. 2020. The PGM-index: A fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow. 13, 8 (2020), 1162–1175.Google ScholarDigital Library
[15] Galakatos Alex, Markovitch Michael, Binnig Carsten, Fonseca Rodrigo, and Kraska Tim. 2019. FITing-tree: A data-aware index structure. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). 1189–1206.Google ScholarDigital Library
[16] Gao Peter Xiang, Narayan Akshay, Karandikar Sagar, Carreira Joao, Han Sangjin, Agarwal Rachit, Ratnasamy Sylvia, and Shenker Scott. 2016. Network requirements for resource disaggregation. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 249–264.Google Scholar
[17] Guo Zhiyuan, Shan Yizhou, Luo Xuhao, Huang Yutong, and Zhang Yiying. 2022. Clio: A hardware-software co-designed disaggregated memory system. In 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’22).Google ScholarDigital Library
[18] Hilprecht Benjamin, Schmidt Andreas, Kulessa Moritz, Molina Alejandro, Kersting Kristian, and Binnig Carsten. 2020. DeepDB: Learn from data, not from queries! Proc. VLDB Endow. 13, 7 (2020), 992–1005.Google ScholarDigital Library
[19] HP. 2021. The Machine. https://www.hpl.hp.com/research/systems-research/themachine/Google Scholar
[20] Hwang Deukyeon, Kim Wook-Hee, Won Youjip, and Nam Beomseok. 2018. Endurable transient inconsistency in byte-addressable persistent B+-Tree. In 16th USENIX Conference on File and Storage Technologies (FAST’18). 187–200.Google Scholar
[21] Kalia Anuj, Kaminsky Michael, and Andersen David G.. 2014. Using RDMA efficiently for key-value services. In ACM SIGCOMM 2014 Conference (SIGCOMM’14). 295–306.Google ScholarDigital Library
[22] Kalia Anuj, Kaminsky Michael, and Andersen David G.. 2016. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 185–201.Google Scholar
[23] Kalia Anuj, Kaminsky Michael, and Andersen David G.. 2019. Datacenter RPCs can be general and fast. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). 1–16.Google Scholar
[24] Kraska Tim, Beutel Alex, Chi Ed H., Dean Jeffrey, and Polyzotis Neoklis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD’18). 489–504.Google ScholarDigital Library
[25] Lan Hai, Bao Zhifeng, Culpepper J. Shane, and Borovica-Gajic Renata. 2023. Updatable learned indexes meet disk-resident DBMS - From evaluations to design choices. Proc. ACM Manag. Data 1, 2 (2023), 139:1–139:22.Google ScholarDigital Library
[26] Lehman Philip L. and Yao S. Bing. 1981. Efficient locking for concurrent operations on B-trees. ACM Trans. Database Syst. 6, 4 (1981), 650–670.Google ScholarDigital Library
[27] Li Pengfei, Lu Hua, Zheng Qian, Yang Long, and Pan Gang. 2020. LISA: A learned index structure for spatial data. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD’20). 2119–2133.Google ScholarDigital Library
[28] Li Pengfei, Lu Hua, Zhu Rong, Ding Bolin, Yang Long, and Pan Gang. 2023. DILI: A distribution-driven learned index. Proc. VLDB Endow. 16, 9 (2023), 2212–2224.Google ScholarDigital Library
[29] Lim Kevin T., Chang Jichuan, Mudge Trevor N., Ranganathan Parthasarathy, Reinhardt Steven K., and Wenisch Thomas F.. 2009. Disaggregated memory for expansion and sharing in blade servers. In 36th International Symposium on Computer Architecture (ISCA’09). 267–278.Google Scholar
[30] Lim Kevin T., Turner Yoshio, Santos Jose Renato, AuYoung Alvin, Chang Jichuan, Ranganathan Parthasarathy, and Wenisch Thomas F.. 2012. System-level implications of disaggregated memory. In 18th IEEE International Symposium on High Performance Computer Architecture (HPCA’12). 189–200.Google Scholar
[31] Lu Youyou, Shu Jiwu, Chen Youmin, and Li Tao. 2017. Octopus: An RDMA-enabled distributed persistent memory file system. In 2017 USENIX Annual Technical Conference (ATC’17). 773–785.Google Scholar
[32] Mao Yandong, Kohler Eddie, and Morris Robert Tappan. 2012. Cache craftiness for fast multicore key-value storage. In European Conference on Computer Systems, Proceedings of the 7th EuroSys Conference 2012 (EuroSys’12). 183–196.Google ScholarDigital Library
[33] Mitchell Christopher, Montgomery Kate, Nelson Lamont, Sen Siddhartha, and Li Jinyang. 2016. Balancing CPU and network in the cell distributed B-tree store. In 2016 USENIX Annual Technical Conference (ATC’16). 451–464.Google Scholar
[34] Ruan Zhenyuan, Schwarzkopf Malte, Aguilera Marcos K., and Belay Adam. 2020. AIFM: High-performance, application-integrated far memory. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 315–332.Google Scholar
[35] Salama Abdallah, Binnig Carsten, Kraska Tim, Scherp Ansgar, and Ziegler Tobias. 2017. Rethinking distributed query execution on high-speed networks. IEEE Data Eng. Bull. 40, 1 (2017), 27–37.Google Scholar
[36] Shan Yizhou, Huang Yutong, Chen Yilun, and Zhang Yiying. 2018. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 69–87.Google Scholar
[37] Shan Yizhou, Tsai Shin-Yeh, and Zhang Yiying. 2017. Distributed shared persistent memory. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC’17). 323–337.Google ScholarDigital Library
[38] Shen Jiacheng, Zuo Pengfei, Luo Xuchuan, Yang Tianyi, Su Yuxin, Zhou Yangfan, and Lyu Michael R.. 2023. FUSEE: A fully memory-disaggregated key-value store. In 21st USENIX Conference on File and Storage Technologies (FAST’23). 81–98.Google Scholar
[39] Shrivastav Vishal, Valadarsky Asaf, Ballani Hitesh, Costa Paolo, Lee Ki-Suh, Wang Han, Agarwal Rachit, and Weatherspoon Hakim. 2019. Shoal: A network architecture for disaggregated racks. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). 255–270.Google Scholar
[40] Tang Chuzhe, Wang Youyun, Dong Zhiyuan, Hu Gansen, Wang Zhaoguo, Wang Minjie, and Chen Haibo. 2020. XIndex: A scalable learned index for multicore data storage. In 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’20). 308–320.Google Scholar
[41] Tirmazi Muhammad, Barker Adam, Deng Nan, Haque Md. E., Qin Zhijing Gene, Hand Steven, Harchol-Balter Mor, and Wilkes John. 2020. Borg: The next generation. In 15th EuroSys Conference 2020 (EuroSys’20). 30:1–30:14.Google ScholarDigital Library
[42] Tsai Shin-Yeh, Shan Yizhou, and Zhang Yiying. 2020. Disaggregating persistent memory and controlling them remotely: An exploration of passive disaggregated key-value stores. In 2020 USENIX Annual Technical Conference (ATC’20). 33–48.Google Scholar
[43] Tsai Shin-Yeh and Zhang Yiying. 2017. LITE kernel RDMA support for datacenter applications. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). 306–324.Google ScholarDigital Library
[44] Vienne Jérôme, Chen Jitong, Wasi-ur-Rahman Md., Islam Nusrat S., Subramoni Hari, and Panda Dhabaleswar K.. 2012. Performance analysis and evaluation of InfiniBand FDR and 40GigE RoCE on HPC and cloud computing systems. In IEEE 20th Annual Symposium on High-performance Interconnects (HOTI’12). 48–55.Google Scholar
[45] Wang Chenxi, Ma Haoran, Liu Shi, Li Yuanqi, Ruan Zhenyuan, Nguyen Khanh, Bond Michael D., Netravali Ravi, Kim Miryung, and Xu Guoqing Harry. 2020. Semeru: A memory-disaggregated managed runtime. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 261–280.Google Scholar
[46] Wang Qing, Lu Youyou, and Shu Jiwu. 2022. Sherman: A write-optimized distributed B+ tree index on disaggregated memory. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD’22).Google ScholarDigital Library
[47] Wei Xingda, Chen Rong, and Chen Haibo. 2020. Fast RDMA-based ordered key-value store using remote learned cache. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20). 117–135.Google Scholar
[48] Wei Xingda, Shi Jiaxin, Chen Yanzhe, Chen Rong, and Chen Haibo. 2015. Fast in-memory transaction processing using RDMA and HTM. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP’15). 87–104.Google ScholarDigital Library
[49] Xie Qing, Pang Chaoyi, Zhou Xiaofang, Zhang Xiangliang, and Deng Ke. 2014. Maximum error-bounded piecewise linear representation for online stream approximation. VLDB J. 23, 6 (2014), 915–937.Google ScholarDigital Library
[50] Yahoo. 2019. Yahoo! Cloud Serving Benchmark (YCSB). https://github.com/brianfrankcooper/YCSBGoogle Scholar
[51] Yang Jian, Izraelevitz Joseph, and Swanson Steven. 2019. Orion: A distributed file system for non-volatile main memory and RDMA-capable networks. In 17th USENIX Conference on File and Storage Technologies (FAST’19). 221–234.Google Scholar
[52] Zamanian Erfan, Binnig Carsten, Kraska Tim, and Harris Tim. 2017. The end of a myth: Distributed transaction can scale. Proc. VLDB Endow. 10, 6 (2017), 685–696.Google ScholarDigital Library
[53] Zhang Ming, Hua Yu, Zuo Pengfei, and Liu Lurong. 2022. FORD: Fast one-sided RDMA-based distributed transactions for disaggregated persistent memory. In 20th USENIX Conference on File and Storage Technologies (FAST’22). 51–68.Google Scholar
[54] Zhu Yibo, Eran Haggai, Firestone Daniel, Guo Chuanxiong, Lipshteyn Marina, Liron Yehonatan, Padhye Jitendra, Raindel Shachar, Yahia Mohamad Haj, and Zhang Ming. 2015. Congestion control for large-scale RDMA deployments. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM’15). 523–536.Google ScholarDigital Library
[55] Ziegler Tobias, Vani Sumukha Tumkur, Binnig Carsten, Fonseca Rodrigo, and Kraska Tim. 2019. Designing distributed tree-based index structures for fast RDMA-capable networks. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). 741–758.Google ScholarDigital Library
[56] Zuo Pengfei, Sun Jiazhao, Yang Liu, Zhang Shuangwu, and Hua Yu. 2021. One-sided RDMA-conscious extendible hashing for disaggregated memory. In 2021 USENIX Annual Technical Conference (ATC’21). 15–29.Google Scholar

Index Terms

A High-performance RDMA-oriented Learned Key-value Store for Disaggregated Memory Systems
1. Information systems
  1. Information storage systems
    1. Storage architectures
      1. Distributed storage
      2. Storage network architectures
        Network attached storage

Recommendations

LSM-tree managed storage for large-scale key-value store
SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing

Key-value stores are increasingly adopting LSM-trees as their enabling data structure in the backend storage, and persisting their clustered data through a file system. A file system is expected to not only provide file/directory abstraction to organize ...
Read More
NVLSM: A Persistent Memory Key-Value Store Using Log-Structured Merge Tree with Accumulative Compaction
Computer systems utilizing byte-addressable Non-Volatile Memory (NVM) as memory/storage can provide low-latency data persistence. The widely used key-value stores using Log-Structured Merge Tree (LSM-Tree) are still beneficial for NVM systems in aspects ...
Read More
Exploiting Hybrid Index Scheme for RDMA-based Key-Value Stores
SYSTOR '23: Proceedings of the 16th ACM International Conference on Systems and Storage

RDMA (Remote Direct Memory Access) is widely studied in building key-value stores to achieve ultra-low latency. In RDMA-based key-value stores, the indexing time takes a large fraction of the overall operation latency as RDMA enables fast data access. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Storage Volume 19, Issue 4
November 2023
238 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3626486
Editor:
Erez Zadok
Stony Brook University, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 October 2023
- Online AM: 5 September 2023
- Accepted: 14 August 2023
- Received: 8 May 2023
Published in tos Volume 19, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Disaggregated memory system
learned index
key-value store
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 333
  Total Downloads
- Downloads (Last 12 months)333
- Downloads (Last 6 weeks)27
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

A High-performance RDMA-oriented Learned Key-value Store for Disaggregated Memory Systems

ACM Transactions on Storage

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

LSM-tree managed storage for large-scale key-value store

NVLSM: A Persistent Memory Key-Value Store Using Log-Structured Merge Tree with Accumulative Compaction

Exploiting Hybrid Index Scheme for RDMA-based Key-Value Stores

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

Caption

A High-performance RDMA-oriented Learned Key-value Store for Disaggregated Memory Systems

ACM Transactions on Storage

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

LSM-tree managed storage for large-scale key-value store

NVLSM: A Persistent Memory Key-Value Store Using Log-Structured Merge Tree with Accumulative Compaction

Exploiting Hybrid Index Scheme for RDMA-based Key-Value Stores

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media