ABSTRACT
Latency-critical applications directly interact with end users and often experience the diurnal load pattern. In production, best-effort applications are often co-located with them to utilize the idle cores at the low load. Meanwhile, modern computers are evolving towards heterogeneous NUMA architecture, where the cores have different computation abilities, memory access latencies and network communication delays. Prior co-location scheduling work did not consider the NUMA architecture, and failed to maximize the throughput of best-effort applications while ensuring the required QoS of latency-critical applications. Our investigation shows that NUMA effect has complex impacts on the latency of latency-critical applications and the throughput of best-effort applications. We therefore propose PAC, a preference-aware co-location scheduling scheme that considers the NUMA effect for heterogeneous NUMA architectures. PAC has a performance monitor and a core scheduler. Specifically, the performance monitor identifies the "dangerous" latency-critical applications that require upgrading core allocations. We propose two low-overhead scheduling strategies for the scheduler. The strategies identify the bottlenecks of applications and adjust core allocations accordingly. Experimental result shows that PAC improves the throughput of best-effort applications by 3.87× while ensuring the required QoS of latency-critical applications.
- 2023. Nginx. http://nginx.org.Google Scholar
- 2023. The xapian project. https://xapian.org.Google Scholar
- Luiz André Barroso and Urs Hölzle. 2007. The case for energy-proportional computing. Computer 40, 12 (2007), 33--37.Google ScholarDigital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In 17th International Conference on Parallel Architectures and Compilation Techniques. 72--81.Google ScholarDigital Library
- Quan Chen, Zhenning Wang, Jingwen Leng, Chao Li, Wenli Zheng, and Minyi Guo. 2019. Avalon: Towards QoS Awareness and Improved Utilization through Multi-Resource Management in Datacenters. In ACM International Conference on Supercomputing. 272--283.Google ScholarDigital Library
- Quan Chen, Shuai Xue, Shang Zhao, Shanpei Chen, Yihao Wu, Yu Xu, Zhuo Song, Tao Ma, Yong Yang, and Minyi Guo. 2020. Alita: comprehensive performance isolation through bias resource management for public clouds. In International Conference for High Performance Computing, Networking, Storage, and Analysis.Google ScholarCross Ref
- Shuang Chen, Christina Delimitrou, and José F. Martínez. 2019. PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services. In 24th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 107--120.Google Scholar
- Weihao Cui, Han Zhao, Quan Chen, Ningxin Zheng, Jingwen Leng, Jieru Zhao, Zhuo Song, Tao Ma, Yong Yang, Chao Li, and Minyi Guo. 2021. Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction. In International Conference for High Performance Computing, Networking, Storage and Analysis.Google Scholar
- Dormando. 2023. Memcached - a distributed memory object caching system. http://memcached.org.Google Scholar
- Peter I. Frazier. 2018. A Tutorial on Bayesian Optimization. arXiv:1807.02811Google Scholar
- Will Glozer. 2023. wrk2. https://github.com/giltene/wrk2.Google Scholar
- Md. Enamul Haque, Yuxiong He, Sameh Elnikety, Thu D. Nguyen, Ricardo Bianchini, and Kathryn S. McKinley. 2017. Exploiting heterogeneity for tail latency and energy efficiency. In 50th Annual IEEE/ACM International Symposium on Microarchitecture. 625--638.Google Scholar
- Haowei Huang, Pu Pang, Quan Chen, Jieru Zhao, Wenli Zheng, and Minyi Guo. 2022. CSC: Collaborative System Configuration for I/O-Intensive Applications in Multi-Tenant Clouds. In IEEE International Parallel and Distributed Processing Symposium. 1327--1337.Google Scholar
- Intel. 2023. Performance Hybrid Architecture. https://www.intel.com/content/www/us/en/developer/articles/technical/hybrid-architecture.html.Google Scholar
- Harshad Kasture and Daniel Sanchez. 2016. Tailbench: a benchmark suite and evaluation methodology for latency-critical applications. In 2016 IEEE International Symposium on Workload Characterization. IEEE, 1--10.Google ScholarCross Ref
- Kenji Kawaguchi, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. 2015. Bayesian Optimization with Exponential Convergence. In Annual Conference on Neural Information Processing Systems. 2809--2817.Google Scholar
- The kernel development community. 2023. CFS Bandwidth Control. https://docs.kernel.org/scheduler/sched-bwc.html.Google Scholar
- Michael Kerrisk. 2023. numastat(8) --- Linux manual page. https://man7.org/linux/man-pages/man8/numastat.8.html.Google Scholar
- Michael Kerrisk. 2023. taskset(1) --- Linux manual page. https://man7.org/linux/man-pages/man1/taskset.1.html.Google Scholar
- Alexey Kopytov. 2023. sysbench. https://github.com/akopytov/sysbench.Google Scholar
- Cheng Li, Sunil Gupta, Santu Rana, Vu Nguyen, Svetha Venkatesh, and Alistair Shilton. 2017. High Dimensional Bayesian Optimization using Dropout. In 26th International Joint Conference on Artificial Intelligence, Carles Sierra (Ed.). 2096--2102.Google ScholarCross Ref
- Zijun Li, Quan Chen, Shuai Xue, Tao Ma, Yong Yang, Zhuo Song, and Minyi Guo. 2020. Amoeba: QoS-Awareness and Reduced Resource Usage of Microservices with Serverless Computing. In IEEE International Parallel and Distributed Processing Symposium. 399--408.Google Scholar
- David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In 42nd Annual International Symposium on Computer Architecture. 450--462.Google ScholarDigital Library
- Rajiv Nishtala, Paul M. Carpenter, Vinicius Petrucci, and Xavier Martorell. 2017. Hipster: Hybrid Task Manager for Latency-Critical Cloud Workloads. In 23rd IEEE International Symposium on High Performance Computer Architecture. 409--420.Google Scholar
- Rajiv Nishtala, Vinicius Petrucci, Paul M. Carpenter, and Magnus Själander. 2020. Twig: Multi-Agent Task Management for Colocated Latency-Critical Cloud Services. In 2020 IEEE International Symposium on High Performance Computer Architecture. 167--179.Google Scholar
- Oracle. 2023. MySQL. https://www.mysql.com.Google Scholar
- Pu Pang, Quan Chen, Deze Zeng, and Minyi Guo. 2021. Adaptive Preference-Aware Co-Location for Improving Resource Utilization of Power Constrained Datacenters. IEEE Transactions on Parallel and Distributed Systems 32, 2 (2021), 441--456.Google ScholarCross Ref
- EPFL PARSA. 2023. Data Caching. https://github.com/parsa-epfl/cloudsuite/blob/CSv3/docs/benchmarks/data-caching.md.Google Scholar
- Tirthak Patel and Devesh Tiwari. 2020. CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency-Critical Jobs for Warehouse Scale Computers. In 2020 IEEE International Symposium on High Performance Computer Architecture. IEEE, 193--206.Google Scholar
- Vinicius Petrucci, Michael A. Laurenzano, John Doherty, Yunqi Zhang, Daniel Mossé, Jason Mars, and Lingjia Tang. 2015. Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers. In 21st IEEE International Symposium on High Performance Computer Architecture. 246--258.Google ScholarCross Ref
- Amir M. Rahmani, Bryan Donyanavard, Tiago Mück, Kasra Moazzemi, Axel Jantsch, Onur Mutlu, and Nikil D. Dutt. 2018. SPECTR: Formal Supervisory Control and Coordination for Many-core Systems Resource Management. In 23rd International Conference on Architectural Support for Programming Languages and Operating Systems. 169--183.Google Scholar
- Rohan Basu Roy, Tirthak Patel, and Devesh Tiwari. 2021. SATORI: Efficient and Fair Resource Partitioning by Sacrificing Short-Term Benefits for Long-Term Gains*. In 48th ACM/IEEE Annual International Symposium on Computer Architecture. 292--305.Google ScholarDigital Library
- Rohan Basu Roy, Tirthak Patel, and Devesh Tiwari. 2021. SATORI: Efficient and Fair Resource Partitioning by Sacrificing Short-Term Benefits for Long-Term Gains. In 48th ACM/IEEE Annual International Symposium on Computer Architecture. 292--305.Google ScholarDigital Library
- B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N De Freitas. 2015. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc. IEEE 104, 1 (2015), 148--175.Google ScholarCross Ref
- Jiuchen Shi, Jiawen Wang, Kaihua Fu, Quan Chen, Deze Zeng, and Minyi Guo. 2022. QoS-awareness of Microservices with Excessive Loads via Inter-Datacenter Scheduling. In IEEE International Parallel and Distributed Processing Symposium. 324--334.Google ScholarCross Ref
- Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems 25 (2012).Google Scholar
- Arm Techcon. 2011. Big.LITTLE processing with ARM Cortex-A15 & Cortex-A7. Eetimes Com (2011).Google Scholar
- Arunchandar Vasan, Anand Sivasubramaniam, Vikrant Shimpi, T Sivabalan, and Rajesh Subbiah. 2010. Worth their watts?-an empirical study of datacenter servers. In 16th International Conference on High Performance Computer Architecture. IEEE, 1--10.Google ScholarCross Ref
- Haishan Zhu and Mattan Erez. 2016. Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems. In 25th International Conference on Architectural Support for Programming Languages and Operating Systems. 33--47.Google ScholarDigital Library
Index Terms
- PAC: Preference-Aware Co-location Scheduling on Heterogeneous NUMA Architectures To Improve Resource Utilization
Recommendations
Heterogeneous- and NUMA-aware scheduling for many-core architectures
SYSTOR '17: Proceedings of the 10th ACM International Systems and Storage ConferenceAs the number of cores increases in a single chip processor, several challenges arise: wire delays, contention for out-of-chip accesses, and core heterogeneity. In order to address these issues and the applications demands, future large-scale many-core ...
Asymmetry-Aware Scheduling in Heterogeneous Multi-core Architectures
NPC 2013: Proceedings of the 10th IFIP International Conference on Network and Parallel Computing - Volume 8147As threads of execution in a multi-programmed computing environment have different characteristics and hardware resource requirements, heterogeneous multi-core processors can achieve higher performance as well as power efficiency than homogeneous multi-...
Scheduling Methods for Accelerating Applications on Architectures with Heterogeneous Cores
IPDPSW '14: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium WorkshopsIntra-node architectures for high performance machines have been rapidly evolving over the recent years. We are seeing a diverse set of architectures, most of them with heterogeneous cores. This leads to two important questions for HPC programming: 1) ...
Comments