research-article

ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers

Authors:

Mary Lou SoffaAuthors Info & Claims

ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Pages 89 - 100

https://doi.org/10.1145/2451116.2451126

Published: 16 March 2013 Publication History

Abstract

As multicore processors with expanding core counts continue to dominate the server market, the overall utilization of the class of datacenters known as warehouse scale computers (WSCs) depends heavily on colocation of multiple workloads on each server to take advantage of the computational power provided by modern processors. However, many of the applications running in WSCs, such as websearch, are user-facing and have quality of service (QoS) requirements. When multiple applications are co-located on a multicore machine, contention for shared memory resources threatens application QoS as severe cross-core performance interference may occur. WSC operators are left with two options: either disregard QoS to maximize WSC utilization, or disallow the co-location of high-priority user-facing applications with other applications, resulting in low machine utilization and millions of dollars wasted.

This paper presents ReQoS, a static/dynamic compilation approach that enables low-priority applications to adaptively manipulate their own contentiousness to ensure the QoS of high-priority co-runners. ReQoS is composed of a profile guided compilation technique that identifies and inserts markers in contentious code regions in low-priority applications, and a lightweight runtime that monitors the QoS of high-priority applications and reactively reduces the pressure low-priority applications generate to the memory subsystem when cross-core interference is detected. In this work, we show that ReQoS can accurately diagnose contention and significantly reduce performance interference to ensure application QoS. Applying ReQoS to SPEC2006 and SmashBench workloads on real multicore machines, we are able to improve machine utilization by more than 70% in many cases, and more than 50% on average, while enforcing a 90% QoS threshold. We are also able to improve the energy efficiency of modern multicore machines by 47% on average over a policy of disallowing co-locations.

References

[1]

Intel 64 and ia-32 architectures software developer's manual volume 2b: Instruction set reference, m-z.

[2]

M. Banikazemi, D. Poff, and B. Abali. Pam: a novel performance/power aware meta-scheduler for multi-core systems. SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Nov 2008.

Digital Library

[3]

L. Barroso and U. Hölzle. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis Lectures on Computer Architecture, 4(1):1--108, 2009.

[4]

M. Bhadauria and S. McKee. An approach to resource-aware coscheduling for cmps. ICS 2010, Jun 2010.

Digital Library

[5]

S. Cho and L. Jin. Managing distributed, shared l2 caches through os-level page allocation. MICRO 39, Dec 2006.

Digital Library

[6]

E. Ebrahimi, C. Lee, O. Mutlu, and Y. Patt. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. ASPLOS 2010, Mar 2010.

Digital Library

[7]

A. Fedorova, M. Seltzer, and M. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. PACT 2007, Sep 2007.

Digital Library

[8]

F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A framework for providing quality of service in chip multi-processors. MIRCO 2007, pages 343--355, 2007.

Digital Library

[9]

A. Herdrich, R. Illikkal, R. Iyer, D. Newell, V. Chadha, and J. Moses. Rate-based qos techniques for cache/memory in cmp platforms. ICS '09: Proceedings of the 23rd international conference on Supercomputing, Jun 2009.

Digital Library

[10]

R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. Qos policies and architecture for cache/memory in cmp platforms. SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, Jun 2007.

Digital Library

[11]

Y. Jiang, K. Tian, and X. Shen. Combining locality analysis with online proactive job co-scheduling in chip multiprocessors. High Performance Embedded Architectures and Compilers, pages 201--215, 2010.

Digital Library

[12]

M. Kandemir, S. Muralidhara, S. Narayanan, Y. Zhang, and O. Ozturk. Optimizing shared cache behavior of chip multiprocessors. Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM International Symposium on DOI UR -, pages 505--516, 2009.

Digital Library

[13]

M. Kandemir, T. Yemliha, S. Muralidhara, S. Srikantaiah, M. Irwin, and Y. Zhnag. Cache topology aware computation mapping for multicores. PLDI '10, Jun 2010.

Digital Library

[14]

S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. PACT 2004, Sep 2004.

Digital Library

[15]

R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using os observations to improve performance in multicore systems. IEEE Micro, 28(3):54--66, 2008.

Digital Library

[16]

J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. HPCA 2008, pages 367--378, 2008.

[17]

F. Liu, X. Jiang, and Y. Solihin. Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. HPCA 2010, pages 1--12, 2010.

[18]

C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. PLDI '05, pages 190--200, New York, NY, USA, 2005. ACM.

Digital Library

[19]

J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubbleup: Increasing utilization in modern warehouse scale computers via sensible co-locations. In MICRO '11: Proceedings of The 44th Annual IEEE/ACM International Symposium on Microarchitecture, New York, NY, USA, 2011. ACM.

Digital Library

[20]

J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Increasing utilization in warehouse scale computers using bubble-up! Special Issue: IEEE Micro's Top Picks from 2011 Computer Architecture Conferences, 2012.

Digital Library

[21]

J. Mars, N. Vachharajani, R. Hundt, and M. Soffa. Contention aware execution: online contention detection and response. CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, Apr 2010.

Digital Library

[22]

D. Meisner, B. Gold, and T. Wenisch. Powernap: eliminating server idle power. ASPLOS '09: Proceeding of the 14th international conference on Architectural support for programming languages and operating systems, Feb 2009.

Digital Library

[23]

R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: managing performance interference effects for qos-aware clouds. EuroSys '10, Apr 2010.

Digital Library

[24]

K. Nesbit, N. Aggarwal, J. Laudon, and J. Smith. Fair queuing memory systems. MICRO 2006, pages 208 -- 222, 2006.

Digital Library

[25]

M. Qureshi and Y. Patt. Utility-based cache partitioning: A lowoverhead, high-performance, runtime mechanism to partition shared caches. MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Dec 2006.

Digital Library

[26]

G. Ren, E. Tune, T. Moseley, Y. Shi, S. Rus, and R. Hundt. Googlewide profiling: A continuous profiling infrastructure for data centers. IEEE Micro, 30:65--79, 2010.

Digital Library

[27]

S. Rus, R. Ashok, and D. Li. Automated locality optimization based on the reuse distance of string operations. CGO '11, pages 181--190, Apr 2011.

Digital Library

[28]

A. Sandberg, D. Eklöv, and E. Hagersten. Reducing cache pollution through detection and elimination of non-temporal memory accesses. SC 2010, Nov 2010.

Digital Library

[29]

L. Soares, D. Tam, and M. Stumm. Reducing the harmful effects of last-level cache polluters with an os-level, software-only pollute buffer. Micro 2008, pages 258 -- 269, 2008.

Digital Library

[30]

S. Son, M. Kandemir, M. Karakoy, and D. Chakrabarti. A compilerdirected data prefetching scheme for chip multiprocessors. PPoPP 2009, Feb 2009.

Digital Library

[31]

S. Srikantaiah, M. Kandemir, and M. Irwin. Adaptive set pinning: managing shared caches in chip multiprocessors. ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems, Mar 2008.

Digital Library

[32]

L. Tang, J. Mars, and M. L. Soffa. Compiling for niceness: mitigating contention for qos in warehouse scale computers. In Proceedings of the Tenth International Symposium on Code Generation and Optimization, CGO '12, pages 1--12, New York, NY, USA, 2012. ACM.

Digital Library

[33]

L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. The impact of memory subsystem resource sharing on datacenter applications. In ISCA, pages 283--294, 2011.

Digital Library

[34]

D. Xu, C. Wu, and P.-C. Yew. On mitigating memory bandwidth contention through bandwidth-aware scheduling. PACT 2010, Sep 2010.

Digital Library

[35]

T. Yang, T. Liu, E. D. Berger, S. F. Kaplan, and J. E. B. Moss. Redline: first class support for interactivity in commodity operating systems. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, OSDI'08, pages 73--86, Berkeley, CA, USA, 2008. USENIX Association.

Digital Library

[36]

E. Zhang, Y. Jiang, and X. Shen. Does cache sharing on modern cmp matter to the performance of contemporary multithreaded programs? PPoPP 2010, pages 203--212, 2010.

Digital Library

[37]

S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. ASPLOS 2010, Mar 2010.

Digital Library

Cited By

Xu HSong SMao Z(2024)Characterizing the Performance of Emerging Deep Learning, Graph, and High Performance Computing Workloads Under Interference2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00098(468-477)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00098
Moradi HWang WZhu D(2023)Online Performance Modeling and Prediction for Single-VM Applications in Multi-Tenant CloudsIEEE Transactions on Cloud Computing10.1109/TCC.2021.307869011:1(97-110)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TCC.2021.3078690
Kim SGenc HNikiforov VAsanović KNikolić BShao Y(2023)MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071035(828-841)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071035
Show More Cited By

Index Terms

ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Runtime environments
      2. Source code generation

Recommendations

ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers
ASPLOS '13

As multicore processors with expanding core counts continue to dominate the server market, the overall utilization of the class of datacenters known as warehouse scale computers (WSCs) depends heavily on colocation of multiple workloads on each server ...
ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers
ASPLOS '13

As multicore processors with expanding core counts continue to dominate the server market, the overall utilization of the class of datacenters known as warehouse scale computers (WSCs) depends heavily on colocation of multiple workloads on each server ...
Contention aware execution: online contention detection and response
CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization

Cross-core application interference due to contention for shared on-chip and off-chip resources pose a significant challenge to providing application level quality of service (QoS) guarantees on commodity multicore micro-architectures. Unexpected cross-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

March 2013

574 pages

ISBN:9781450318709

DOI:10.1145/2451116

General Chair:
Vivek Sarkar
Rice University, USA
,
Program Chair:
Rastislav Bodik
University of California, Berkeley, USA

ACM SIGARCH Computer Architecture News Volume 41, Issue 1
ASPLOS '13
March 2013
540 pages
ISSN:0163-5964
DOI:10.1145/2490301
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 48, Issue 4
ASPLOS '13
April 2013
540 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2499368
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '13

Sponsor:

ASPLOS '13: Architectural Support for Programming Languages and Operating Systems

March 16 - 20, 2013

Texas, Houston, USA

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

55
Total Citations
View Citations
675
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu HSong SMao Z(2024)Characterizing the Performance of Emerging Deep Learning, Graph, and High Performance Computing Workloads Under Interference2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00098(468-477)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00098
Moradi HWang WZhu D(2023)Online Performance Modeling and Prediction for Single-VM Applications in Multi-Tenant CloudsIEEE Transactions on Cloud Computing10.1109/TCC.2021.307869011:1(97-110)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TCC.2021.3078690
Kim SGenc HNikiforov VAsanović KNikolić BShao Y(2023)MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071035(828-841)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071035
Kim IHwang JWang WHumphrey M(2022)Guaranteeing Performance SLAs of Cloud Applications Under Resource StormsIEEE Transactions on Cloud Computing10.1109/TCC.2020.298537210:2(1329-1343)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TCC.2020.2985372
Chen SJiang YDelimitrou CMartinez J(2022)PIMCloud: QoS-Aware Resource Management of Latency-Critical Applications in Clouds with Processing-in-Memory2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00083(1086-1099)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00083
Moradi HWang WFernandez AZhu D(2020)uPredict: A User-Level Profiler-Based Predictive Framework in Multi-Tenant Clouds2020 IEEE International Conference on Cloud Engineering (IC2E)10.1109/IC2E48712.2020.00015(73-82)Online publication date: Apr-2020
https://doi.org/10.1109/IC2E48712.2020.00015
Moradi HWang WZhu D(2020)DiHi: Distributed and Hierarchical Performance Modeling of Multi-VM Cloud Running Applications2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS50907.2020.00002(1-10)Online publication date: Dec-2020
https://doi.org/10.1109/HPCC-SmartCity-DSS50907.2020.00002
Patel TTiwari D(2020)CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency-Critical Jobs for Warehouse Scale Computers2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00025(193-206)Online publication date: Feb-2020
https://doi.org/10.1109/HPCA47549.2020.00025
Nikas KPapadopoulou NGiantsidi DKarakostas VGoumas GKoziris N(2019)DICERProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337891(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337891
Kannan RLaurenzano MAhn JMars JTang L(2019)CaliperACM Transactions on Architecture and Code Optimization10.1145/332309016:3(1-25)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3323090
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents