research-article

Conflux: Exploiting Persistent Memory and RDMA Bandwidth via Adaptive I/O Mode Selection

Authors:

Linpeng HuangAuthors Info & Claims

ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

Pages 685 - 694

https://doi.org/10.1145/3605573.3605574

Published: 13 September 2023 Publication History

Abstract

Persistent Memory (PM) and Remote Direct Memory Access (RDMA) technologies have significantly improved the storage and network performance in data centers and spawned a slew of distributed file system (DFS) designs. Existing DFSs often consider remote storage a performance constraint, assuming it delivers lower bandwidth and higher latency than local storage devices. However, the advances in RDMA technology provide an opportunity to bridge the performance gap between local and remote access, enabling DFSs to leverage both local and remote PM bandwidth and achieve higher overall throughput.

We propose Conflux, a new DFS architecture that leverages the aggregated bandwidth of PM and RDMA networks. Conflux dynamically steers I/O requests to local and remote PM to fully utilize PM and RDMA bandwidth under heavy workloads. To adaptively decide the I/O run-time path, we propose SEED, a learning-based policy engine predicting Conflux I/O latency and making decisions in a real-time system. Furthermore, Conflux adopts a fine-grained concurrency control approach to improve its scalability. Experimental results show that Conflux achieves up to 4.7 × throughput compared to existing DFSs on multi-threaded workloads.

References

[1]

Emmanuel Amaro, Christopher Branner-Augmon, Zhihong Luo, Amy Ousterhout, Marcos K Aguilera, Aurojit Panda, Sylvia Ratnasamy, and Scott Shenker. 2020. Can far memory improve job throughput?. In Proceedings of the Fifteenth European Conference on Computer Systems. 1–16.

Digital Library

[2]

Thomas E Anderson, Marco Canini, Jongyul Kim, Dejan Kostić, Youngjin Kwon, Simon Peter, Waleed Reda, Henry N Schuh, and Emmett Witchel. 2020. Assise: Performance and Availability via Client-local { NVM} in a Distributed File System. In 14th { USENIX} Symposium on Operating Systems Design and Implementation ({ OSDI} 20). 1011–1027.

[3]

Jens Axboe. 2021. Flexible I/O tester. https://github.com/axboe/fio.

[4]

Youmin Chen, Youyou Lu, Bohong Zhu, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, and Jiwu Shu. 2021. Scalable Persistent Memory File System with Kernel-Userspace Collaboration. In 19th { USENIX} Conference on File and Storage Technologies ({ FAST} 21). 81–95.

[5]

Jeremy Condit, Edmund B Nightingale, Christopher Frost, Engin Ipek, Benjamin Lee, Doug Burger, and Derrick Coetzee. 2009. Better I/O through byte-addressable, persistent memory. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. 133–146.

Digital Library

[6]

Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices 48, 4 (2013), 77–88.

Digital Library

[7]

Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127–144.

Digital Library

[8]

Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, 2020. ALEX: an updatable adaptive learned index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 969–984.

Digital Library

[9]

Mingkai Dong, Heng Bu, Jifei Yi, Benchao Dong, and Haibo Chen. 2019. Performance and protection in the ZoFS user-space NVM file system. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 478–493.

Digital Library

[10]

Mo Dong, Tong Meng, Doron Zarchy, Engin Arslan, Yossi Gilad, Brighten Godfrey, and Michael Schapira. 2018. { PCC} vivace: Online-learning congestion control. In 15th { USENIX} Symposium on Networked Systems Design and Implementation ({ NSDI} 18). 343–356.

[11]

Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale Applications. ACM Transactions on Storage (TOS) 17, 4 (2021), 1–32.

Digital Library

[12]

Zhuohui Duan, Haodi Lu, Haikun Liu, Xiaofei Liao, Hai Jin, Yu Zhang, and Song Wu. 2021. Hardware-supported remote persistence for distributed persistent memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.

Digital Library

[13]

Subramanya R Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. 2014. System software for persistent memory. In Proceedings of the Ninth European Conference on Computer Systems. 1–15.

Digital Library

[14]

Facebook-RocksDB. 2021. db_bench. https://github-wiki-see.page/m/facebook/rocksdb/wiki/Benchmarking-tools.

[15]

Facebook-RocksDB. 2021. A Persistent Key-value Store for Fast Storage Environment. https://rocksdb.org/.

[16]

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles. 29–43.

Digital Library

[17]

Google-Tensorflow. 2021. TensorFlow. https://github.com/tensorflow/tensorflow/tree/master/tensorflow.

[18]

Mingzhe Hao, Levent Toksoz, Nanqinqin Li, Edward Edberg Halim, Henry Hoffmann, and Haryadi S Gunawi. 2020. LinnOS: Predictability on unpredictable flash storage with a light neural network. In 14th { USENIX} Symposium on Operating Systems Design and Implementation ({ OSDI} 20). 173–190.

[19]

Milad Hashemi, Kevin Swersky, Jamie Smith, Grant Ayers, Heiner Litz, Jichuan Chang, Christos Kozyrakis, and Parthasarathy Ranganathan. 2018. Learning memory access patterns. In International Conference on Machine Learning. PMLR, 1919–1928.

[20]

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural networks 2, 5 (1989), 359–366.

Digital Library

[21]

Intel. 2021. Achieve Greater Insight From Your Data with Intel Optane Persistent Memory. https://www.intel.com/content/www/us/en/products/docs/memory-storage/optane-persistent-memory/optane-persistent-memory-200-series-brief.html.

[22]

Nusrat S Islam, Mohammad Wahidur Rahman, Jithin Jose, Raghunath Rajachandrasekar, Hao Wang, Hari Subramoni, Chet Murthy, and Dhabaleswar K Panda. 2012. High performance RDMA-based design of HDFS over InfiniBand. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–12.

Digital Library

[23]

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Reinforcement learning: A survey. Journal of artificial intelligence research 4 (1996), 237–285.

Digital Library

[24]

Anuj Kalia, David Andersen, and Michael Kaminsky. 2020. Challenges and solutions for fast remote persistent memory access. In Proceedings of the 11th ACM Symposium on Cloud Computing. 105–119.

Digital Library

[25]

Anuj Kalia, Michael Kaminsky, and David G Andersen. 2016. Design guidelines for high performance { RDMA} systems. In 2016 { USENIX} Annual Technical Conference ({ USENIX}{ ATC} 16). 437–450.

[26]

Jongyul Kim, Insu Jang, Waleed Reda, Jaeseong Im, Marco Canini, Dejan Kostić, Youngjin Kwon, Simon Peter, and Emmett Witchel. 2021. LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 756–771.

Digital Library

[27]

Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. 489–504.

Digital Library

[28]

Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, and Thomas Anderson. 2017. Strata: A cross media file system. In Proceedings of the 26th Symposium on Operating Systems Principles. 460–477.

Digital Library

[29]

Huaicheng Li, Martin L Putra, Ronald Shi, Xing Lin, Gregory R Ganger, and Haryadi S Gunawi. 2021. lODA: A Host/Device Co-Design for Strong Predictability Contract on Modern Flash Storage. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 263–279.

Digital Library

[30]

Jianwei Li, Wei-keng Liao, Alok Choudhary, Robert Ross, Rajeev Thakur, William Gropp, Robert Latham, Andrew Siegel, Brad Gallagher, and Michael Zingale. 2003. Parallel netCDF: A high-performance scientific I/O interface. In SC’03: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing. IEEE, 39–39.

Digital Library

[31]

Eric Liang, Hang Zhu, Xin Jin, and Ion Stoica. 2019. Neural packet classification. In Proceedings of the ACM Special Interest Group on Data Communication. 256–269.

Digital Library

[32]

Xiaoyi Lu, Nusrat S Islam, Md Wasi-Ur-Rahman, Jithin Jose, Hari Subramoni, Hao Wang, and Dhabaleswar K Panda. 2013. High-performance design of Hadoop RPC with RDMA over InfiniBand. In 2013 42nd International Conference on Parallel Processing. IEEE, 641–650.

Digital Library

[33]

Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. 2017. Octopus: an rdma-enabled distributed persistent memory file system. In 2017 { USENIX} Annual Technical Conference ({ USENIX}{ ATC} 17). 773–785.

[34]

Teng Ma, Mingxing Zhang, Kang Chen, Zhuo Song, Yongwei Wu, and Xuehai Qian. 2020. Asymnvm: An efficient framework for implementing persistent data structures on asymmetric nvm architecture. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 757–773.

Digital Library

[35]

Nafiseh Moti, Frederic Schimmelpfennig, Reza Salkhordeh, David Klopp, Toni Cortes, Ulrich Rückert, and André Brinkmann. 2021. Simurgh: a fully decentralized and secure NVMM user space file system. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.

Digital Library

[36]

Yujie Ren, Changwoo Min, and Sudarsun Kannan. 2020. CrossFS: A cross-layered direct-access file system. In 14th { USENIX} Symposium on Operating Systems Design and Implementation ({ OSDI} 20). 137–154.

[37]

Subhash Sethumurugan, Jieming Yin, and John Sartori. 2021. Designing a cost-effective cache replacement policy using machine learning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 291–303.

[38]

Warren Smith, Ian Foster, and Valerie Taylor. 1998. Predicting application run times using historical information. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 122–142.

Digital Library

[39]

Vasily Tarasov, Erez Zadok, and Spencer Shepler. 2016. Filebench: A flexible framework for file system benchmarking. USENIX; login 41, 1 (2016), 6–12.

[40]

Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell DE Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation. 307–320.

[41]

Matthew Wilcox. 2014. DAX: Page cache bypass for filesystems on memory storage. https://lwn.net/Articles/618064/

[42]

Hobin Woo, Daegyu Han, Seungjoon Ha, Sam H Noh, and Beomseok Nam. 2023. On stacking a persistent memory file system on legacy file systems. In 21st USENIX Conference on File and Storage Technologies (FAST 23). 281–296.

Digital Library

[43]

Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, and Sriram Rao. 2018. Towards a learning optimizer for shared clouds. Proceedings of the VLDB Endowment 12, 3 (2018), 210–222.

Digital Library

[44]

Kan Wu, Zhihan Guo, Guanzhou Hu, Kaiwei Tu, Ramnatthan Alagappan, Rathijit Sen, Kwanghyun Park, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. 2021. The storage hierarchy is not a hierarchy: Optimizing caching on modern storage devices with orthus. In 19th { USENIX} Conference on File and Storage Technologies ({ FAST} 21). 307–323.

[45]

Jian Xu and Steven Swanson. 2016. { NOVA} : A log-structured file system for hybrid volatile/non-volatile main memories. In 14th { USENIX} Conference on File and Storage Technologies ({ FAST} 16). 323–338.

[46]

Jian Yang, Joseph Izraelevitz, and Steven Swanson. 2019. Orion: A distributed file system for non-volatile main memory and RDMA-capable networks. In 17th { USENIX} Conference on File and Storage Technologies ({ FAST} 19). 221–234.

[47]

Jian Yang, Joseph Izraelevitz, and Steven Swanson. 2020. FileMR: Rethinking { RDMA} Networking for Scalable Persistent Memory. In 17th { USENIX} Symposium on Networked Systems Design and Implementation ({ NSDI} 20). 111–125.

[48]

Yongen Yu, Douglas H Rudd, Zhiling Lan, Nickolay Y Gnedin, Andrey Kravtsov, and Jingjin Wu. 2012. Improving parallel IO performance of cell-based AMR cosmology applications. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium. IEEE, 933–944.

Digital Library

[49]

Di Zhang, Dong Dai, Youbiao He, Forrest Sheng Bao, and Bing Xie. 2020. RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.

[50]

Yiying Zhang and Yutong Huang. 2019. " Learned" Operating Systems. ACM SIGOPS Operating Systems Review 53, 1 (2019), 40–45.

Digital Library

[51]

Shengan Zheng, Morteza Hoseinzadeh, and Steven Swanson. 2019. Ziggurat: a tiered file system for non-volatile main memories and disks. In 17th { USENIX} Conference on File and Storage Technologies ({ FAST} 19). 207–219.

[52]

Shawn Zhong, Chenhao Ye, Guanzhou Hu, Suyan Qu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Michael Swift. 2023. { MadFS} :{ Per-File} Virtualization for Userspace Persistent Memory Filesystems. In 21st USENIX Conference on File and Storage Technologies (FAST 23). 265–280.

[53]

Bohong Zhu, Youmin Chen, Qing Wang, Youyou Lu, and Jiwu Shu. 2021. Octopus+: An RDMA-Enabled Distributed Persistent Memory File System. ACM Transactions on Storage (TOS) 17, 3 (2021), 1–25.

Digital Library

Index Terms

Conflux: Exploiting Persistent Memory and RDMA Bandwidth via Adaptive I/O Mode Selection

Recommendations

Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL
Memory disaggregation is a promising architecture for modern datacenters that separates compute and memory resources into independent pools connected by ultra-fast networks, which can improve memory utilization, reduce cost, and enable elastic scaling of ...
A Case for Virtualizing Persistent Memory
SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

With the proliferation of software and hardware support for persistent memory (PM) like PCM and NV-DIMM, we envision that PM will soon become a standard component of commodity cloud, especially for those applications demanding high performance and low ...
Toward Virtual Machine Image Management for Persistent Memory
Persistent memory’s (PM) byte-addressability and high capacity will also make it emerging for virtualized environment. Modern virtual machine monitors virtualize PM using either I/O virtualization or memory virtualization. However, I/O virtualization will ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

August 2023

858 pages

ISBN:9798400708435

DOI:10.1145/3605573

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Key Research and Development Program of China
Shanghai Municipal Science and Technology Major Project
Natural Science Foundation of Shanghai
National Natural Science Foundation of China
The Fundamental Research Funds for the Central Universities
CCF Huawei Populus Grove Fund

Conference

ICPP 2023

ICPP 2023: 52nd International Conference on Parallel Processing

August 7 - 10, 2023

UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
128
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)5

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten