research-article

Contention aware execution: online contention detection and response

Authors:
Jason Mars

University of Virginia, Charlottesville, VA, USA

University of Virginia, Charlottesville, VA, USA
View Profile

,
Neil Vachharajani

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Robert Hundt

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Mary Lou Soffa

University of Virginia, Charlottesville, VA, USA

University of Virginia, Charlottesville, VA, USA
View Profile

CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimizationApril 2010Pages 257–265https://doi.org/10.1145/1772954.1772991

Published:24 April 2010Publication History

CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization

Pages 257–265

ABSTRACT

Cross-core application interference due to contention for shared on-chip and off-chip resources pose a significant challenge to providing application level quality of service (QoS) guarantees on commodity multicore micro-architectures. Unexpected cross-core interference is especially problematic when considering latency-sensitive applications that are present in the web service data center application domains, such as web-search. The commonly used solution is to simply disallow the co-location of latency-sensitive applications and throughput-oriented batch applications on a single chip, leaving much of the processing capabilities of multicore micro-architectures underutilized. In this work we present a Contention Aware Execution Runtime (CAER) environment that provides a lightweight runtime solution that minimizes cross-core interference due to contention, while maximizing utilization. CAER leverages the ubiquitous performance monitoring capabilities present in current multicore processors to infer and respond to contention and requires no added hardware support. We present the design and implementation of the CAER environment, two separate contention detection heuristics, and approaches to respond to contention online. We evaluate our solution using the SPEC2006 benchmark suite. Our experiments show that when allowing co-location with CAER, as opposed to disallowing co-location, we are able to increase the utilization of the multicore CPU by 58% on average. Meanwhile CAER brings the overhead due to allowing co-location from 17% down to just 4% on average.

References

R. Azimi, D. K. Tam, L. Soares, and M. Stumm. Enhancing operating system support for multicore processors by using hardware performance monitoring. SIGOPS Oper. Syst. Rev., 43(2):56--65, 2009. Google ScholarDigital Library
G. E. Blelloch and P. B. Gibbons. Effectively sharing a cache among threads. In SPAA '04: Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, pages 235--244, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multi-processor architecture. In HPCA '05: Proceedings of the 11th International Symposium on High--Performance Computer Architecture, pages 340--351, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):1--26, 2008. Google ScholarDigital Library
J. Chang and G. S. Sohi. Cooperative cache partitioning for chip multiprocessors. In ICS '07: Proceedings of the 21st annual international conference on Supercomputing, pages 242--252, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on cmps. In SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pages 105--115, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
L. Cherkasova, Y. Fu, W. Tang, and A. Vahdat.Measuring and characterizing end-to-end internet service performance. ACM Trans. Internet Technol., 3(4):347--391, 2003. Google ScholarDigital Library
C. Ding and Y. Zhong. Predicting whole-program locality through reuse distance analysis. In PLDI '03: Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, pages 245--257, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
S. Eranian. Perfmon2. http://perfmon2.sourceforge.net/.Google Scholar
A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In ATEC '05: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 26--26, Berkeley, CA, USA, 2005. USENIX Association. Google ScholarDigital Library
A. Herdrich, R. Illikkal, R. Iyer, D. Newell, V. Chadha, and J. Moses. Rate-based qos techniques for cache/memory in cmp platforms. In ICS '09: Proceedings of the 23rd international conference on Supercomputing, pages 479--488, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A nuca substrate for flexible cmp cache sharing. In ICS '05: Proceedings of the 19th annual international conference on Supercomputing, pages 31--40, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
Intel Corporation. IA-32 Application Developer's Architecture Guide. Intel Corporation, Santa Clara, CA, USA, 2009.Google Scholar
R. Iyer. Cqos: a framework for enabling qos in shared caches of cmp platforms. In ICS '04: Proceedings of the 18th annual international conference on Supercomputing, pages 257--266, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt.Qos policies and architecture for cache/memory in cmp platforms. In SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 25--36, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 220--229, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT '04: Proceed-- ings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 111--122, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarDigital Library
S. Lohr. Demand for data puts engineers in spotlight. The New York Times, 2008. Published June 17th.Google Scholar
K. London, J. Dongarra, S. Moore, P. Mucci, K. Seymour, and T. Spencer. End-user tools for application performance analysis using hardware counters. In 14th Conference on Parallel and Distributed Computing Systems, August 2001.Google Scholar
J. Mars and R. Hundt. Scenario based optimization: A framework for statically enabling online optimizations. In CGO '09: Proceedings of the 2009 International Symposium on Code Generation and Optimization, pages 169--179, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
M. Moreto, F. J. Cazorla, A. Ramirez, R. Sakellariou, and M. Valero. Flexdcp: a qos framework for cmp architectures. SIGOPS Oper. Syst. Rev., 43(2):86--96, 2009. Google ScholarDigital Library
K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith. Fair queuing memory systems. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 208--222, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
K. J. Nesbit, J. Laudon, and J. E. Smith. Virtual private caches. In ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture, pages 57--68, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
N. Rafique, W.-T. Lim, and M. Thottethodi. Architectural support for operating system-driven cmp cache management. In PACT '06: Proceedings of the 15th international conference on Parallel architectures and compilation techniques, pages 2--12, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
R. Reddy and P. Petrov. Eliminating inter-process cache interference through cache reconfigurability for real-time and low-power embedded multi-tasking systems. In CASES '07: Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems, pages 198--207, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
L. Soares, D. Tam, and M. Stumm. Reducing the harmful effects of last-level cache polluters with an os-level, software-only pollute buffer. In MICRO '08: Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture, pages 258--269, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
V. Suhendra and T. Mitra. Exploring locking & partitioning for predictable shared caches on multi-cores. In DAC '08: Proceedings of the 45th annual Design Automation Conference, pages 300--303, New York, NY, USA, 2008. ACM. Google ScholarDigital Library

Index Terms

Recommendations

Directly characterizing cross core interference through contention synthesis
HiPEAC '11: Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers

In this paper, we present a direct methodology and framework for the measurement and characterization of an application's cross-core interference sensitivity on multicore microarchitectures. While prior works use indirect indicators, such as last level ...
Read More
ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

As multicore processors with expanding core counts continue to dominate the server market, the overall utilization of the class of datacenters known as warehouse scale computers (WSCs) depends heavily on colocation of multiple workloads on each server ...
Read More
ReQoS: reactive static/dynamic compilation for QoS in warehouse scale computers
ASPLOS '13

As multicore processors with expanding core counts continue to dominate the server market, the overall utilization of the class of datacenters known as warehouse scale computers (WSCs) depends heavily on colocation of multiple workloads on each server ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CGO '10: Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
April 2010
300 pages
ISBN:9781605586359
DOI:10.1145/1772954
General Chairs:
Andreas Moshovos
University of Toronto
,
Greg Steffan
University of Toronto
,
Program Chairs:
Kim Hazelwood
University of Virginia
,
David Kaeli
Northeastern University
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 April 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
contention
cross-core interference
dynamic techniques
execution runtimes
multicore
online adaptation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate312of1,061submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 90
  Total Citations
  View Citations
- 609
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.