research-article

Open access

Tiresias: Optimizing NUMA Performance with CXL Memory and Locality-Aware Process Scheduling

Authors:

Jie WuAuthors Info & Claims

ACM-TURC '24: Proceedings of the ACM Turing Award Celebration Conference - China 2024

Pages 6 - 11

https://doi.org/10.1145/3674399.3674411

Published: 30 July 2024 Publication History

All formats PDF

Abstract

The growing demand for memory systems with larger capacities and faster data transfer speeds has driven progress in the widespread adoption of multi-socket machines and memory expansion through Compute eXpress Link (CXL). However, processes running on such multi-socket machines suffer non-uniform bandwidth and latency when accessing physical memory. Despite prior efforts to propose data allocation and placement strategies in NUMA environments over the years, they still fall short due to the semantic gap between the process scheduling and memory access pattern – the process scheduler has limited knowledge of its running processes’ memory access latency. Actually, the latency of memory access is influenced not only by the distance between NUMA nodes but also by the memory bandwidth pressure, especially in scenarios involving co-located workloads. We propose Tiresias, a feedback-based controller that migrates NUMA effects on data access latency by transparently employing memory locality-aware process scheduling and provisioning differentiated memory bandwidth allocations with assistance from CXL memory. Tiresias exploits multiple resource optimization techniques, including (1) workload-aware and software-based memory bandwidth management, (2) a memory page migration strategy to alleviate memory bandwidth contention by leveraging CXL memory, and (3) page-table self-replication (PTSR) based locality-aware process scheduling. To evaluate the impact of Tiresias on performance, we conduct an analysis that focuses on the temporal and spatial correlation of memory access patterns.

References

[1]

2024. DynamoRIO dynamic instrumentation tool platform. http://dynamorio.org/

[2]

Reto Achermann, Ashish Panwar, Abhishek Bhattacharjee, Timothy Roscoe, and Jayneel Gandhi. 2020. Mitosis: Transparently self-replicating page-tables for large-memory machines. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 283–300.

Digital Library

[3]

Moiz Arif, Kevin Assogba, M. Mustafa Rafique, and Sudharshan Vazhkudai. 2022. Exploiting CXL-based Memory for Distributed Deep Learning. In Proceedings of the 51st International Conference on Parallel Processing, ICPP 2022, Bordeaux, France, 29 August 2022 - 1 September 2022. ACM, 19:1–19:11. https://doi.org/10.1145/3545008.3545054

Digital Library

[4]

Nathan Beckmann, Po-An Tsai, and Daniel Sánchez. 2015. Scaling distributed cache hierarchies through computation and data co-scheduling. In 21st IEEE International Symposium on High Performance Computer Architecture, HPCA 2015, Burlingame, CA, USA, February 7-11, 2015. IEEE Computer Society, 538–550. https://doi.org/10.1109/HPCA.2015.7056061

[5]

Yuetao Chen, Keni Qiu, Li Chen, Haipeng Jia, Yunquan Zhang, Limin Xiao, and Lei Liu. 2022. Smart scheduler: an adaptive NVM-aware thread scheduling approach on NUMA systems. CCF Transactions on High Performance Computing 4, 4 (2022), 394–406.

[6]

SM CXL Consortium 2022. Compute express link: The breakthrough CPU-to-device interconnect. Retrieved February 2 (2022), 2023.

[7]

Zhuohui Duan, Haikun Liu, Xiaofei Liao, Hai Jin, Wenbin Jiang, and Yu Zhang. 2019. Hinuma: Numa-aware data placement and migration in hybrid memory systems. In 2019 IEEE 37th International Conference on Computer Design (ICCD). IEEE, 367–375.

[8]

Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016. Data tiering in heterogeneous memory systems. In Proceedings of the Eleventh European Conference on Computer Systems. 1–16.

Digital Library

[9]

Junhyeok Jang, Hanjin Choi, Hanyeoreum Bae, Seungjun Lee, Miryeong Kwon, and Myoungsoo Jung. 2023. CXL-ANNS: Software-Hardware Collaborative Memory Disaggregation and Computation for Billion-Scale Approximate Nearest Neighbor Search. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 585–600. https://www.usenix.org/conference/atc23/presentation/jang

[10]

Hao-Qiang Jin, Michael Frumkin, and Jerry Yan. 1999. The OpenMP implementation of NAS parallel benchmarks and its performance. (1999).

[11]

Kostis Kaffes, Dragos Sbirlea, Yiyan Lin, David Lo, and Christos Kozyrakis. 2020. Leveraging application classes to save power in highly-utilized data centers. In Proceedings of the 11th ACM Symposium on Cloud Computing. 134–149.

Digital Library

[12]

Hwanjun Lee, Seunghak Lee, Yeji Jung, and Daehoon Kim. 2023. T-CAT: Dynamic Cache Allocation for Tiered Memory Systems with Memory Interleaving. IEEE Computer Architecture Letters (2023).

[13]

Huaicheng Li, Daniel S Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, 2023. Pond: Cxl-based memory pooling systems for cloud platforms. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 574–587.

Digital Library

[14]

Zoltan Majó and Thomas R. Gross. 2012. Matching memory access patterns and data placement for NUMA systems. In 10th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2012, San Jose, CA, USA, March 31 - April 04, 2012. ACM, 230–241. https://doi.org/10.1145/2259016.2259046

Digital Library

[15]

Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowdhury, Shobhit O. Kanaujia, and Prakash Chauhan. 2023. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2023, Vancouver, BC, Canada, March 25-29, 2023. ACM, 742–755.

Digital Library

[16]

Iraklis Psaroudakis, Tobias Scheuer, Norman May, Abdelkader Sellami, and Anastasia Ailamaki. 2016. Adaptive NUMA-aware data placement and task scheduling for analytical workloads in main-memory column-stores. Proc. VLDB Endow. 10, 2 (oct 2016), 37–48. https://doi.org/10.14778/3015274.3015275

Digital Library

[17]

Hongliang Qu and Zhibin Yu. 2024. WASP: Workload-Aware Self-Replicating Page-Tables for NUMA Servers. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 1233–1249.

Digital Library

[18]

Amanda Raybuck, Tim Stamler, Wei Zhang, Mattan Erez, and Simon Peter. 2021. Hemem: Scalable tiered memory management for big data applications and real nvm. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 392–407.

Digital Library

[19]

Jie Ren, Dong Xu, Junhee Ryu, Kwangsik Shin, Daewoo Kim, and Dong Li. 2024. MTM: Rethinking Memory Profiling and Migration for Multi-Tiered Large Memory. In Proceedings of the Nineteenth European Conference on Computer Systems. 803–817.

Digital Library

[20]

Sai Sha, Chuandong Li, Yingwei Luo, Xiaolin Wang, and Zhenlin Wang. 2023. vTMM: Tiered Memory Management for Virtual Machines. In Proceedings of the Eighteenth European Conference on Computer Systems. 283–297.

Digital Library

[21]

D Das Sharma and Ishwar Agarwal. 2022. Compute Express Link 3.0. white paper, CXL Consortium (2022).

[22]

Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Chihun Song, Jinghan Huang, Houxiang Ji, Siddharth Agarwal, Jiaqi Lou, Ipoom Jeong, 2023. Demystifying cxl memory with genuine cxl-ready systems and devices. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 105–121.

Digital Library

[23]

Wenda Tang, Senbo Fu, Yutao Ke, Qian Peng, and Feng Gao. 2022. Themis: Fair Memory Subsystem Resource Sharing with Differentiated QoS in Public Clouds. In Proceedings of the 51st International Conference on Parallel Processing. 1–12.

Digital Library

[24]

Wenda Tang, Jiazhen Zhu, Tianxiang Ai, Guanghui Li, Bin Yu, Xin Yang, and Wanchun Dou. 2023. Thoth: Provisioning Over-Committed Memory Resource with Differentiated QoS in Public Clouds. In 2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys). IEEE, 82–89.

[25]

Yupeng Tang, Ping Zhou, Wenhui Zhang, Henry Hu, Qirui Yang, Hao Xiang, Tongping Liu, Jiaxin Shan, Ruoyun Huang, Cheng Zhao, Cheng Chen, Hui Zhang, Fei Liu, Shuai Zhang, Xiaoning Ding, and Jianjun Chen. 2024. Exploring Performance and Cost Optimization with ASIC-Based CXL Memory. In Proceedings of the Nineteenth European Conference on Computer Systems, EuroSys 2024, Athens, Greece, April 22-25, 2024. ACM, 818–833. https://doi.org/10.1145/3627703.3650061

Digital Library

[26]

Mingxing Zhang, Teng Ma, Jinqi Hua, Zheng Liu, Kang Chen, Ning Ding, Fan Du, Jinlei Jiang, Tao Ma, and Yongwei Wu. 2023. Partial Failure Resilient Memory Management System for (CXL-based) Distributed Shared Memory. In Proceedings of the 29th Symposium on Operating Systems Principles (Koblenz, Germany) (SOSP ’23). Association for Computing Machinery, New York, NY, USA, 658–674.

Digital Library

Index Terms

Tiresias: Optimizing NUMA Performance with CXL Memory and Locality-Aware Process Scheduling
1. Information systems
  1. Information storage systems
    1. Storage management
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Allocation / deallocation strategies
        Main memory
        Process management
        Scheduling

Index terms have been assigned to the content through auto-classification.

Recommendations

Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL
Memory disaggregation is a promising architecture for modern datacenters that separates compute and memory resources into independent pools connected by ultra-fast networks, which can improve memory utilization, reduce cost, and enable elastic scaling of ...
Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines
ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems

Multi-socket machines with 1-100 TBs of physical memory are becoming prevalent. Applications running on such multi-socket machines suffer non-uniform bandwidth and latency when accessing physical memory. Decades of research have focused on data ...
Polaris: Enhancing CXL-based Memory Expanders with Memory-side Prefetching
Advanced Parallel Processing Technologies
Abstract
The use of CXL-based memory expanders introduces increased latency compared to local memory due to control and transmission overheads. This latency difference negatively impacts tasks that are sensitive to latency. While cache prefetching has ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ACM-TURC '24: Proceedings of the ACM Turing Award Celebration Conference - China 2024

July 2024

261 pages

ISBN:9798400710117

DOI:10.1145/3674399

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 July 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ACM-TURC '24

ACM-TURC '24: ACM Turing Award Celebration Conference 2024

July 5 - 7, 2024

Changsha, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
669
Total Downloads

Downloads (Last 12 months)669
Downloads (Last 6 weeks)154

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten