ABSTRACT
Disaggregated memory can address resource provisioning inefficiencies in current datacenters. Multiple software runtimes for disaggregated memory have been proposed in an attempt to make disaggregated memory practical. These systems rely on the virtual memory subsystem to transparently offer disaggregated memory to applications using a local memory abstraction. Unfortunately, using virtual memory for disaggregation has multiple limitations, including high overhead that comes from the use of page faults to identify what data to fetch and cache locally, and high dirty data amplification that comes from the use of page-granularity for tracking changes to the cached data (4KB or higher).
In this paper, we propose a fundamentally new approach to designing software runtimes for disaggregated memory that addresses these limitations. Our main observation is that we can use cache coherence instead of virtual memory for tracking applications' memory accesses transparently, at cache-line granularity. This simple idea (1) eliminates page faults from the application critical path when accessing remote data, and (2) decouples the application memory access tracking from the virtual memory page size, enabling cache-line granularity dirty data tracking and eviction. Using this observation, we implemented a new software runtime for disaggregated memory that improves average memory access time by 1.7-5X and reduces dirty data amplification by 2-10X, compared to state-of-the-art systems.
- Balance LRU lists based on relative thrashing. https://lwn.net/Articles/690069/.Google Scholar
- CCIX. https://www.ccixconsortium.com.Google Scholar
- Enzian, a research computer built by the Systems Group at ETH Zürich. http://www.enzian.systems/index.html.Google Scholar
- memtier benchmark: A high-throughput benchmarking tool for redis and memcached. https://redislabs.com/blog/memtier_benchmark-a-high-throughputbenchmarking-tool-for-redis-memcached/.Google Scholar
- Pin-a dynamic binary instrumentation tool. https://software.intel.com/enus/articles/pin-a-dynamic-binary-instrumentation-tool.Google Scholar
- Reconsidering swapping. https://lwn.net/Articles/690079/.Google Scholar
- Redis : open-source, in-memory data structure store. https://redis.io.Google Scholar
- VOLTDB. https://www.voltdb.com/.Google Scholar
- Atul Adya, Robert Grandl, Daniel Myers, and Henry Qin. Fast key-value stores: An idea whose time has come and gone. In Workshop on Hot Topics in Operating Systems (HotOS), 2019.Google ScholarDigital Library
- Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novakovic, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. Remote regions: a simple abstraction for remote memory. In USENIX Annual Technical Conference (ATC), 2018.Google ScholarDigital Library
- Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. Remote memory in the age of fast networks. In ACM Symposium on Cloud Computing (SoCC), 2017.Google ScholarDigital Library
- Mohammed Alser, Hasan Hassan, Akash Kumar, Onur Mutlu, and Can Alkan. Shouji: a fast and eficient pre-alignment filter for sequence alignment. Bioinformatics, 35 ( 21 ), 2019.Google Scholar
- Mohammed Alser, Hasan Hassan, Hongyi Xin, Oðuz Ergin, Onur Mutlu, and Can Alkan. GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping. Bioinformatics, 33 ( 21 ), 2017.Google Scholar
- Mohammed Alser, Taha Shahroodi, Juan Gómez-Luna, Can Alkan, and Onur Mutlu. SneakySnake: a fast and accurate universal genome pre-alignment filter for CPUs, GPUs and FPGAs. Bioinformatics, 2020.Google Scholar
- Emmanuel Amaro, Christopher Branner-Augmon, Zhihong Luo, Amy Ousterhout, Marcos K. Aguilera, Aurojit Panda, Sylvia Ratnasamy, and Scott Shenker. Can far memory improve job throughput? In European Conference on Computer Systems (EuroSys), 2020.Google ScholarDigital Library
- Cristiana Amza, Alan L. Cox, Shandya Dwarkadas, Pete Keleher, Honghui Lu, Ramakrishnan Rajamony, Weimin Yu, and Willy Zwaenepoel. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer, February 1996.Google ScholarDigital Library
- Apple. How We Ported Linux to the M1. https://corellium.com/blog/linux-m1.Google Scholar
- Luiz André Barroso and Urs Hölzle. The case for energy-proportional computing. Computer, 40 ( 12 ): 33-37, December 2007.Google Scholar
- J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Munin: Distributed shared memory based on type-specific memory coherence. In ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), March 1990.Google ScholarDigital Library
- Abhishek Bhattacharjee. Translation-triggered prefetching. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017.Google ScholarDigital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The Gem5 Simulator. SIGARCH Comput. Archit. News, 39 ( 2 ): 1-7, August 2011.Google ScholarDigital Library
- M. Blott and K. Vissers. Dataflow architectures for 10 Gbps line-rate key-valuestores. In IEEE Hot Chips 25 Symposium (HCS), 2013.Google ScholarCross Ref
- Derek Bruening, Qin Zhao, and Saman Amarasinghe. Transparent dynamic instrumentation. In International Conference on Virtual Execution Environments (VEE), 2012.Google ScholarDigital Library
- Irina Calciu, Talha Imran, Ivan Puddu, Sanidhya Kashyap, Hasan Al Maruf, Onur Mutlu, and Aasheesh Kolli. Rethinking Software Runtimes for Disaggregated Memory, February 2021. https://github.com/project-kona/asplos21-ae.Google Scholar
- Irina Calciu, Ivan Puddu, Aasheesh Kolli, Andreas Nowatzyk, Jayneel Gandhi, Onur Mutlu, and Pratap Subrahmanyam. Project PBerry: FPGA Acceleration for Remote Memory. In Workshop on Hot Topics in Operating Systems (HotOS), 2019.Google Scholar
- Irina Calciu, Siddhartha Sen, Mahesh Balakrishnan, and Marcos K. Aguilera. Black-box concurrent data structures for NUMA architectures. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017.Google ScholarDigital Library
- Adrian Caulfield, Eric Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. A Cloud-Scale Acceleration Architecture. In International Symposium on Microarchitecture (MICRO), 2016.Google Scholar
- Convey Computer. The Convey HC-2 Computer. Architectural Overview. https://www.micron.com/~/media/documents/products/whitepaper/wp_convey_hc2_architectual_overview.pdf, 2012.Google Scholar
- Aleksandar Dragojevi?, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. FaRM: Fast remote memory. In Symposium on Networked Systems Design and Implementation (NSDI), April 2014.Google Scholar
- Aleksandar Dragojevi?, Dushyanth Narayanan, Ed Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. No compromises: distributed transactions with consistency, availability, and performance. In ACM Symposium on Operating Systems Principles (SOSP), October 2015.Google Scholar
- Dmitry Duplyakin, Robert Ricci, Aleksander Maricq, Gary Wong, Jonathon Duerig, Eric Eide, Leigh Stoller, Mike Hibler, David Johnson, Kirk Webb, Aditya Akella, Kuangching Wang, Glenn Ricart, Larry Landweber, Chip Elliott, Michael Zink, Emmanuel Cecchet, Snigdhaswin Kar, and Prabodh Mishra. The design and operation of CloudLab. In USENIX Annual Technical Conference (ATC), 2019.Google Scholar
- Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. Network requirements for resource disaggregation. In Symposium on Operating Systems Design and Implementation (OSDI), October 2016.Google ScholarDigital Library
- Gen-Z draft core specification-december 2016. http://genzconsortium.org/draftcore-specification-december-2016.Google Scholar
- G. Gibb, J. W. Lockwood, J. Naous, P. Hartke, and N. McKeown. NetFPGA: An open platform for teaching how to build Gigabit-rate network switches and routers. IEEE Transactions on Education, 2008.Google Scholar
- Heiner Giefers, Raphael Polig, and Christoph Hagleitner. Accelerating Arithmetic Kernels with Coherent Attached FPGA Coprocessors. In Design, Automation & Test in Europe (DATE), 2015.Google ScholarDigital Library
- Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang G Shin. Eficient Memory Disaggregation with Infiniswap. In Symposium on Networked Systems Design and Implementation (NSDI), 2017.Google Scholar
- Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. RDMA over Commodity Ethernet at Scale. In ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM), August 2016.Google ScholarDigital Library
- Zhenhao He, David Sidler, Zsolt István, and Gustavo Alonso. A flexible k-means operator for hybrid databases. In International Conference on Field Programmable Logic and Applications (FPL), 2018.Google ScholarCross Ref
- Intel. Cachegrind. https://valgrind.org/docs/manual/cg-manual.html.Google Scholar
- Intel. EPT-based Sub-Page Permissions. https://software.intel.com/sites/default/ ifles/managed/c5/15/architecture-instruction-set-extensions-programmingreference.pdf.Google Scholar
- Intel. Intel Xeon+FPGA Platform for the Data Center. http:// reconfigurablecomputing4themasses.net/files/2.2%20PK.pdf.Google Scholar
- Intel. Page Modification Logging for Virtual Machine Monitor White Paper. https://www.intel.com/content/dam/www/public/us/en/documents/whitepapers/page-modification-logging-vmm-white-paper.pdf.Google Scholar
- Intel. Intel® 64 and IA-32 Architectures Software Developer's Manual. November 2020.Google Scholar
- Scott F. Kaplan, Lyle A. McGeoch, and Megan F. Cole. Adaptive caching for demand prepaging. In International Symposium on Memory Management (ISMM), 2002.Google ScholarDigital Library
- Stefanos Kaxiras, David Klaftenegger, Magnus Norgren, Alberto Ros, and Konstantinos Sagonas. Turning centralized coherence and distributed critical-section execution on their head: A new approach for scalable distributed shared memory. In IEEE International Symposium on High Performance Distributed Computing (HPDC), 2015.Google ScholarDigital Library
- Ahmed Khawaja, Joshua Landgraf, Rohith Prakash, Michael Wei, Eric Schkufza, and Christopher J. Rossbach. Sharing, Protection, and Compatibility for Reconifgurable Fabric with AmorphOS. In Symposium on Operating Systems Design and Implementation (OSDI), Carlsbad, CA, 2018.Google Scholar
- Maysam Lavasani, Hari Angepat, and Derek Chiou. An FPGA-based in-line accelerator for Memcached. IEEE Computer Architecture Letters, 2014.Google ScholarDigital Library
- Kai Li and Paul Hudak. Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems (TOCS), November 1989.Google ScholarDigital Library
- libibverbs. http://www.rdmamojo.com/ 2012 /05/18/libibverbs.Google Scholar
- Kevin T. Lim, Yoshio Turner, Jose Renato Santos, Alvin AuYoung, Jichuan Chang, Parthasarathy Ranganathan, and Thomas F. Wenisch. System-level implications of disaggregated memory. In IEEE Symposium on High Performance Computer Architecture (HPCA), February 2012.Google ScholarDigital Library
- Liu Ling, Neal Oliver, Chitlur Bhushan, Wang Qigang, Alvin Chen, Shen Wenbo, Yu Zhihong, Arthur Sheiman, Ian McCallum, Joseph Grecco, Henry Mitchel, Liu Dong, and Prabhat Gupta. High-performance, Energy-eficient Platforms Using In-socket FPGA Accelerators. In International Symposium on Field Programmable Gate Arrays (FPGA), 2009.Google Scholar
- Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph Hellerstein. GraphLab: A New Framework for Parallel Machine Learning. In Conference on Uncertainty in Artificial Intelligence (UAI), 2010.Google Scholar
- Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geof Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In International Conference on Programming Language Design and Implementation (PLDI), 2005.Google Scholar
- Jiacheng Ma, Gefei Zuo, Kevin Loughlin, Xiaohe Cheng, Yanqiang Liu, Abel Mulugeta Eneyew, Zhengwei Qi, and Baris Kasikci. A Hypervisor for Shared-Memory FPGA Platforms. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020.Google Scholar
- Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. TABLA: A unified templatebased framework for accelerating statistical machine learning. In IEEE Symposium on High Performance Computer Architecture (HPCA), 2016.Google ScholarCross Ref
- Yandong Mao, Robert Morris, and Frans Kaashoek. Optimizing MapReduce for multicore architectures. Technical Report MIT-CSAIL-TR-2010-020, May 2010.Google Scholar
- Hasan Al Maruf and Mosharaf Chowdhury. Efectively Prefetching Remote Memory with Leap. In USENIX Annual Technical Conference (ATC), 2020.Google Scholar
- Mellanox. Mellanox Innova? IPsec 4 Lx Ethernet Adapter Card User Manual. http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_4_Lx_Ethernet_Adapter_Card_User_Manual_rev_1_3.pdf.Google Scholar
- Justin Meza, Tianyin Xu, Kaushik Veeraraghavan, and Onur Mutlu. A large scale study of data center network reliability. In Proceedings of the Internet Measurement Conference (IMC), 2018.Google ScholarDigital Library
- Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, and David A. Wood. A primer on memory consistency and cache coherence, second edition. Synthesis Lectures on Computer Architecture, 15 ( 1 ): 1-294, 2020.Google ScholarCross Ref
- Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. Latency-tolerant software distributed shared memory. In USENIX Annual Technical Conference (ATC), July 2015.Google ScholarDigital Library
- Muhsen Owaida, David Sidler, Kaan Kara, and Gustavo Alonso. Centaur: A framework for hybrid CPU-FPGA databases. In International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017.Google ScholarCross Ref
- Gagandeep Panwar, Da Zhang, Yihan Pang, Mai Dahshan, Nathan DeBardeleben, Binoy Ravindran, and Xun Jian. Quantifying Memory Underutilization in HPC Systems and Using It to Improve Performance via Architecture Support. In International Symposium on Microarchitecture (MICRO), 2019.Google ScholarDigital Library
- Mark S. Papamarcos and Janak H. Patel. A low-overhead coherence solution for multiprocessors with private cache memories. In International Symposium on Computer Architecture (ISCA), 1984.Google Scholar
- Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services. In International Symposium on Computer Architecture (ISCA), 2014.Google ScholarDigital Library
- Charles Reiss, Alexey Tumanov, Gregory R Ganger, Randy H Katz, and Michael A Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In ACM Symposium on Cloud Computing (SoCC), 2012.Google ScholarDigital Library
- Zhenyuan Ruan, Malte Schwarzkopf, Marcos K. Aguilera, and Adam Belay. AIFM: High-performance, application-integrated far memory. In Symposium on Operating Systems Design and Implementation (OSDI), November 2020.Google Scholar
- Daniel Sanchez and Christos Kozyrakis. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems. In International Symposium on Computer Architecture (ISCA), 2013.Google Scholar
- Daniel J. Scales, Kourosh Gharachorloo, and Chandramohan A. Thekkath. Shasta: A low overhead, software-only approach for supporting fine-grain shared memory. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1996.Google ScholarDigital Library
- Ioannis Schoinas, Babak Falsafi, Alvin R. Lebeck, Steven K. Reinhardt, James R. Larus, and David A. Wood. Fine-grain access control for distributed shared memory. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1994.Google ScholarDigital Library
- Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In Symposium on Operating Systems Design and Implementation (OSDI), Carlsbad, CA, 2018.Google ScholarDigital Library
- Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. Distributed shared persistent memory. In ACM Symposium on Cloud Computing (SoCC), 2017.Google ScholarDigital Library
- Yongming Shen, Michael Ferdman, and Peter Milder. Maximizing CNN accelerator eficiency through resource partitioning. In International Symposium on Computer Architecture (ISCA), 2017.Google Scholar
- Navin Shenoy. A Milestone in Moving Data. https://newsroom.intel.com/ editorials/milestone-moving-data.Google Scholar
- David Sidler, Zsolt István, Muhsen Owaida, Kaan Kara, and Gustavo Alonso. doppioDB: A hardware accelerated database. In International Conference on Management of Data (SIGMOD), 2017.Google Scholar
- Gagandeep Singh, Dionysios Diamantopoulos, Christoph Hagleitner, Juan GómezLuna, Sander Stuijk, Onur Mutlu, and Henk Corporaal. NERO: A near highbandwidth memory stencil accelerator for weather prediction modeling. In International Conference on Field Programmable Logic and Applications (FPL), 2020.Google Scholar
- Mario Smarduch. Enhanced Live Migration For Intensive Memory Loads. https://events.static.linuxfound.org/sites/events/files/slides/CloudOpenJapan-2015.pdf.Google Scholar
- Muhammad Tirmazi, Adam Barker, Nan Deng, Md E. Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. Borg: The next generation. In European Conference on Computer Systems (EuroSys), 2020.Google Scholar
- Shin-Yeh Tsai and Yiying Zhang. LITE kernel RDMA support for datacenter applications. In ACM Symposium on Operating Systems Principles (SOSP), October 2017.Google ScholarDigital Library
- Userfaultfd. https://www.kernel.org/doc/Documentation/vm/userfaultfd.txt.Google Scholar
- Chenxi Wang, Haoran Ma, Shi Liu, Yuanqi Li, Zhenyuan Ruan, Khanh Nguyen, Michael D. Bond, Ravi Netravali, Miryung Kim, and Guoqing Harry Xu. Semeru: A memory-disaggregated managed runtime. In Symposium on Operating Systems Design and Implementation (OSDI), pages 261-280, November 2020.Google Scholar
- Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska. The End of a Myth: Distributed Transactions Can Scale. International Conference on Very Large Data Bases (VLDB), 10 ( 6 ), February 2017.Google ScholarDigital Library
- Yue Zha and Jing Li. Virtualizing FPGAs in the Cloud. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020.Google Scholar
Index Terms
- Rethinking software runtimes for disaggregated memory
Recommendations
Efficient Remote Memory Paging for Disaggregated Memory Systems
Algorithms and Architectures for Parallel ProcessingAbstractMemory disaggregation has attracted increasing attention in recent years because it is a cost-efficient approach to scale memory capacity for applications in a data center. However, the latency of remote memory access is a major concern in ...
DRAM Translation Layer: Software-Transparent DRAM Power Savings for Disaggregated Memory
ISCA '23: Proceedings of the 50th Annual International Symposium on Computer ArchitectureMemory disaggregation is a promising solution to scale memory capacity and bandwidth shared by multiple server nodes in a flexible and cost-effective manner. DRAM power consumption, which is reported to be around 40% of the total system power in the ...
Reconsidering OS memory optimizations in the presence of disaggregated memory
ISMM 2022: Proceedings of the 2022 ACM SIGPLAN International Symposium on Memory ManagementTiered memory systems introduce an additional memory level with higher-than-local-DRAM access latency and require sophisticated memory management mechanisms to achieve cost-efficiency and high performance. Recent works focus on byte-addressable tiered ...
Comments