skip to main content
10.1145/3605573.3605574acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Conflux: Exploiting Persistent Memory and RDMA Bandwidth via Adaptive I/O Mode Selection

Published: 13 September 2023 Publication History

Abstract

Persistent Memory (PM) and Remote Direct Memory Access (RDMA) technologies have significantly improved the storage and network performance in data centers and spawned a slew of distributed file system (DFS) designs. Existing DFSs often consider remote storage a performance constraint, assuming it delivers lower bandwidth and higher latency than local storage devices. However, the advances in RDMA technology provide an opportunity to bridge the performance gap between local and remote access, enabling DFSs to leverage both local and remote PM bandwidth and achieve higher overall throughput.
We propose Conflux, a new DFS architecture that leverages the aggregated bandwidth of PM and RDMA networks. Conflux dynamically steers I/O requests to local and remote PM to fully utilize PM and RDMA bandwidth under heavy workloads. To adaptively decide the I/O run-time path, we propose SEED, a learning-based policy engine predicting Conflux I/O latency and making decisions in a real-time system. Furthermore, Conflux adopts a fine-grained concurrency control approach to improve its scalability. Experimental results show that Conflux achieves up to 4.7 × throughput compared to existing DFSs on multi-threaded workloads.

References

[1]
Emmanuel Amaro, Christopher Branner-Augmon, Zhihong Luo, Amy Ousterhout, Marcos K Aguilera, Aurojit Panda, Sylvia Ratnasamy, and Scott Shenker. 2020. Can far memory improve job throughput?. In Proceedings of the Fifteenth European Conference on Computer Systems. 1–16.
[2]
Thomas E Anderson, Marco Canini, Jongyul Kim, Dejan Kostić, Youngjin Kwon, Simon Peter, Waleed Reda, Henry N Schuh, and Emmett Witchel. 2020. Assise: Performance and Availability via Client-local { NVM} in a Distributed File System. In 14th { USENIX} Symposium on Operating Systems Design and Implementation ({ OSDI} 20). 1011–1027.
[3]
Jens Axboe. 2021. Flexible I/O tester. https://github.com/axboe/fio.
[4]
Youmin Chen, Youyou Lu, Bohong Zhu, Andrea C Arpaci-Dusseau, Remzi H Arpaci-Dusseau, and Jiwu Shu. 2021. Scalable Persistent Memory File System with Kernel-Userspace Collaboration. In 19th { USENIX} Conference on File and Storage Technologies ({ FAST} 21). 81–95.
[5]
Jeremy Condit, Edmund B Nightingale, Christopher Frost, Engin Ipek, Benjamin Lee, Doug Burger, and Derrick Coetzee. 2009. Better I/O through byte-addressable, persistent memory. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. 133–146.
[6]
Christina Delimitrou and Christos Kozyrakis. 2013. Paragon: QoS-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices 48, 4 (2013), 77–88.
[7]
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: Resource-efficient and qos-aware cluster management. ACM SIGPLAN Notices 49, 4 (2014), 127–144.
[8]
Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, 2020. ALEX: an updatable adaptive learned index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 969–984.
[9]
Mingkai Dong, Heng Bu, Jifei Yi, Benchao Dong, and Haibo Chen. 2019. Performance and protection in the ZoFS user-space NVM file system. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 478–493.
[10]
Mo Dong, Tong Meng, Doron Zarchy, Engin Arslan, Yossi Gilad, Brighten Godfrey, and Michael Schapira. 2018. { PCC} vivace: Online-learning congestion control. In 15th { USENIX} Symposium on Networked Systems Design and Implementation ({ NSDI} 18). 343–356.
[11]
Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. RocksDB: Evolution of Development Priorities in a Key-value Store Serving Large-scale Applications. ACM Transactions on Storage (TOS) 17, 4 (2021), 1–32.
[12]
Zhuohui Duan, Haodi Lu, Haikun Liu, Xiaofei Liao, Hai Jin, Yu Zhang, and Song Wu. 2021. Hardware-supported remote persistence for distributed persistent memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
[13]
Subramanya R Dulloor, Sanjay Kumar, Anil Keshavamurthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. 2014. System software for persistent memory. In Proceedings of the Ninth European Conference on Computer Systems. 1–15.
[14]
Facebook-RocksDB. 2021. db_bench. https://github-wiki-see.page/m/facebook/rocksdb/wiki/Benchmarking-tools.
[15]
Facebook-RocksDB. 2021. A Persistent Key-value Store for Fast Storage Environment. https://rocksdb.org/.
[16]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles. 29–43.
[17]
Google-Tensorflow. 2021. TensorFlow. https://github.com/tensorflow/tensorflow/tree/master/tensorflow.
[18]
Mingzhe Hao, Levent Toksoz, Nanqinqin Li, Edward Edberg Halim, Henry Hoffmann, and Haryadi S Gunawi. 2020. LinnOS: Predictability on unpredictable flash storage with a light neural network. In 14th { USENIX} Symposium on Operating Systems Design and Implementation ({ OSDI} 20). 173–190.
[19]
Milad Hashemi, Kevin Swersky, Jamie Smith, Grant Ayers, Heiner Litz, Jichuan Chang, Christos Kozyrakis, and Parthasarathy Ranganathan. 2018. Learning memory access patterns. In International Conference on Machine Learning. PMLR, 1919–1928.
[20]
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural networks 2, 5 (1989), 359–366.
[21]
Intel. 2021. Achieve Greater Insight From Your Data with Intel Optane Persistent Memory. https://www.intel.com/content/www/us/en/products/docs/memory-storage/optane-persistent-memory/optane-persistent-memory-200-series-brief.html.
[22]
Nusrat S Islam, Mohammad Wahidur Rahman, Jithin Jose, Raghunath Rajachandrasekar, Hao Wang, Hari Subramoni, Chet Murthy, and Dhabaleswar K Panda. 2012. High performance RDMA-based design of HDFS over InfiniBand. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–12.
[23]
Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. 1996. Reinforcement learning: A survey. Journal of artificial intelligence research 4 (1996), 237–285.
[24]
Anuj Kalia, David Andersen, and Michael Kaminsky. 2020. Challenges and solutions for fast remote persistent memory access. In Proceedings of the 11th ACM Symposium on Cloud Computing. 105–119.
[25]
Anuj Kalia, Michael Kaminsky, and David G Andersen. 2016. Design guidelines for high performance { RDMA} systems. In 2016 { USENIX} Annual Technical Conference ({ USENIX}{ ATC} 16). 437–450.
[26]
Jongyul Kim, Insu Jang, Waleed Reda, Jaeseong Im, Marco Canini, Dejan Kostić, Youngjin Kwon, Simon Peter, and Emmett Witchel. 2021. LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 756–771.
[27]
Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. 489–504.
[28]
Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, and Thomas Anderson. 2017. Strata: A cross media file system. In Proceedings of the 26th Symposium on Operating Systems Principles. 460–477.
[29]
Huaicheng Li, Martin L Putra, Ronald Shi, Xing Lin, Gregory R Ganger, and Haryadi S Gunawi. 2021. lODA: A Host/Device Co-Design for Strong Predictability Contract on Modern Flash Storage. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 263–279.
[30]
Jianwei Li, Wei-keng Liao, Alok Choudhary, Robert Ross, Rajeev Thakur, William Gropp, Robert Latham, Andrew Siegel, Brad Gallagher, and Michael Zingale. 2003. Parallel netCDF: A high-performance scientific I/O interface. In SC’03: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing. IEEE, 39–39.
[31]
Eric Liang, Hang Zhu, Xin Jin, and Ion Stoica. 2019. Neural packet classification. In Proceedings of the ACM Special Interest Group on Data Communication. 256–269.
[32]
Xiaoyi Lu, Nusrat S Islam, Md Wasi-Ur-Rahman, Jithin Jose, Hari Subramoni, Hao Wang, and Dhabaleswar K Panda. 2013. High-performance design of Hadoop RPC with RDMA over InfiniBand. In 2013 42nd International Conference on Parallel Processing. IEEE, 641–650.
[33]
Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. 2017. Octopus: an rdma-enabled distributed persistent memory file system. In 2017 { USENIX} Annual Technical Conference ({ USENIX}{ ATC} 17). 773–785.
[34]
Teng Ma, Mingxing Zhang, Kang Chen, Zhuo Song, Yongwei Wu, and Xuehai Qian. 2020. Asymnvm: An efficient framework for implementing persistent data structures on asymmetric nvm architecture. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 757–773.
[35]
Nafiseh Moti, Frederic Schimmelpfennig, Reza Salkhordeh, David Klopp, Toni Cortes, Ulrich Rückert, and André Brinkmann. 2021. Simurgh: a fully decentralized and secure NVMM user space file system. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
[36]
Yujie Ren, Changwoo Min, and Sudarsun Kannan. 2020. CrossFS: A cross-layered direct-access file system. In 14th { USENIX} Symposium on Operating Systems Design and Implementation ({ OSDI} 20). 137–154.
[37]
Subhash Sethumurugan, Jieming Yin, and John Sartori. 2021. Designing a cost-effective cache replacement policy using machine learning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 291–303.
[38]
Warren Smith, Ian Foster, and Valerie Taylor. 1998. Predicting application run times using historical information. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 122–142.
[39]
Vasily Tarasov, Erez Zadok, and Spencer Shepler. 2016. Filebench: A flexible framework for file system benchmarking. USENIX; login 41, 1 (2016), 6–12.
[40]
Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell DE Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation. 307–320.
[41]
Matthew Wilcox. 2014. DAX: Page cache bypass for filesystems on memory storage. https://lwn.net/Articles/618064/
[42]
Hobin Woo, Daegyu Han, Seungjoon Ha, Sam H Noh, and Beomseok Nam. 2023. On stacking a persistent memory file system on legacy file systems. In 21st USENIX Conference on File and Storage Technologies (FAST 23). 281–296.
[43]
Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, Shi Qiao, and Sriram Rao. 2018. Towards a learning optimizer for shared clouds. Proceedings of the VLDB Endowment 12, 3 (2018), 210–222.
[44]
Kan Wu, Zhihan Guo, Guanzhou Hu, Kaiwei Tu, Ramnatthan Alagappan, Rathijit Sen, Kwanghyun Park, Andrea C Arpaci-Dusseau, and Remzi H Arpaci-Dusseau. 2021. The storage hierarchy is not a hierarchy: Optimizing caching on modern storage devices with orthus. In 19th { USENIX} Conference on File and Storage Technologies ({ FAST} 21). 307–323.
[45]
Jian Xu and Steven Swanson. 2016. { NOVA} : A log-structured file system for hybrid volatile/non-volatile main memories. In 14th { USENIX} Conference on File and Storage Technologies ({ FAST} 16). 323–338.
[46]
Jian Yang, Joseph Izraelevitz, and Steven Swanson. 2019. Orion: A distributed file system for non-volatile main memory and RDMA-capable networks. In 17th { USENIX} Conference on File and Storage Technologies ({ FAST} 19). 221–234.
[47]
Jian Yang, Joseph Izraelevitz, and Steven Swanson. 2020. FileMR: Rethinking { RDMA} Networking for Scalable Persistent Memory. In 17th { USENIX} Symposium on Networked Systems Design and Implementation ({ NSDI} 20). 111–125.
[48]
Yongen Yu, Douglas H Rudd, Zhiling Lan, Nickolay Y Gnedin, Andrey Kravtsov, and Jingjin Wu. 2012. Improving parallel IO performance of cell-based AMR cosmology applications. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium. IEEE, 933–944.
[49]
Di Zhang, Dong Dai, Youbiao He, Forrest Sheng Bao, and Bing Xie. 2020. RLScheduler: an automated HPC batch job scheduler using reinforcement learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.
[50]
Yiying Zhang and Yutong Huang. 2019. " Learned" Operating Systems. ACM SIGOPS Operating Systems Review 53, 1 (2019), 40–45.
[51]
Shengan Zheng, Morteza Hoseinzadeh, and Steven Swanson. 2019. Ziggurat: a tiered file system for non-volatile main memories and disks. In 17th { USENIX} Conference on File and Storage Technologies ({ FAST} 19). 207–219.
[52]
Shawn Zhong, Chenhao Ye, Guanzhou Hu, Suyan Qu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Michael Swift. 2023. { MadFS} :{ Per-File} Virtualization for Userspace Persistent Memory Filesystems. In 21st USENIX Conference on File and Storage Technologies (FAST 23). 265–280.
[53]
Bohong Zhu, Youmin Chen, Qing Wang, Youyou Lu, and Jiwu Shu. 2021. Octopus+: An RDMA-Enabled Distributed Persistent Memory File System. ACM Transactions on Storage (TOS) 17, 3 (2021), 1–25.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing
August 2023
858 pages
ISBN:9798400708435
DOI:10.1145/3605573
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. RDMA
  2. machine learning
  3. persistent memory

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP 2023
ICPP 2023: 52nd International Conference on Parallel Processing
August 7 - 10, 2023
UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 128
    Total Downloads
  • Downloads (Last 12 months)45
  • Downloads (Last 6 weeks)5
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media