research-article

A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness

Authors:

David A. Patterson,

Krste AsanovicAuthors Info & Claims

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

Pages 308 - 319

https://doi.org/10.1145/2485922.2485949

Published: 23 June 2013 Publication History

Abstract

Computing workloads often contain a mix of interactive, latency-sensitive foreground applications and recurring background computations. To guarantee responsiveness, interactive and batch applications are often run on disjoint sets of resources, but this incurs additional energy, power, and capital costs. In this paper, we evaluate the potential of hardware cache partitioning mechanisms and policies to improve efficiency by allowing background applications to run simultaneously with interactive foreground applications, while avoiding degradation in interactive responsiveness. We evaluate these tradeoffs using commercial x86 multicore hardware that supports cache partitioning, and find that real hardware measurements with full applications provide different observations than past simulation-based evaluations. Co-scheduling applications without LLC partitioning leads to a 10% energy improvement and average throughput improvement of 54% compared to running tasks separately, but can result in foreground performance degradation of up to 34% with an average of 6%. With optimal static LLC partitioning, the average energy improvement increases to 12% and the average throughput improvement to 60%, while the worst case slowdown is reduced noticeably to 7% with an average slowdown of only 2%. We also evaluate a practical low-overhead dynamic algorithm to control partition sizes, and are able to realize the potential performance guarantees of the optimal static approach, while increasing background throughput by an additional 19%.

References

[1]

Apple Inc. iOS App Programming Guide. http://developer.apple.com/library/ios/DOCUMENTATION/iPhone/Conceptual/iPhoneOSProgrammingGuide/iPhoneAppProgrammingGuide.pdf.

[2]

L. A. Barroso and U. Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2009.

[3]

S. Beamer, K. Asanovic, and D. A. Patterson. Searching for a parent instead of fighting over children: A fast breadth-first search implementation for graph500. Technical Report UCB/EECS-2011-117, EECS Department, University of California, Berkeley, Nov 2011.

[4]

C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011.

Digital Library

[5]

S. Bird, B. Smith, K. Asanović, and D. A. Patterson. PACORA: Dynamically Optimizing Resource Allocations for Interactive Applications. Technical report, University of California, Berkeley, April 2013.

[6]

S. M. Blackburn et al. The DaCapo benchmarks: Java benchmarking development and analysis. In OOPSLA, pages 169--190, 2006.

Digital Library

[7]

F. J. Cazorla, P. M. W. Knijnenburg, R. Sakellariou, E. Fernandez, A. Ramirez, and M. Valero. Predictable Performance in SMT Processors: Synergy between the OS and SMTs. IEEE Trans. Computers, 55(7):785--799, 2006.

Digital Library

[8]

D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA, pages 340--351, 2005.

Digital Library

[9]

S. Cho and L. Jin. Managing distributed, shared l2 caches through os-level page allocation. In MICRO, pages 455--468, 2006.

Digital Library

[10]

J. Chong, G. Friedland, A. Janin, N. Morgan, and C. Oei. Opportunities and challenges of parallelizing speech recognition. In HotPar, 2010.

Digital Library

[11]

S. Eranian. Perfmon2: a flexible performance monitoring interface for linux. In Ottawa Linux Symposium, pages 269--288, 2006.

[12]

H. Esmaeilzadeh, T. Cao, X. Yang, S. M. Blackburn, and K. S. McKinley. Looking back and looking forward: power, performance, and upheaval. Commun. ACM, 55(7):105--114, July 2012.

Digital Library

[13]

A. Fedorova, S. Blagodurov, and S. Zhuravlev. Managing contention for shared resources on multicore processors. Commun. ACM, 53(2):49--57, 2010.

Digital Library

[14]

M. Ferdman, A. Adileh, Y. O. Koçberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In ASPLOS, pages 37--48, 2012.

Digital Library

[15]

L. Gidra, G. Thomas, J. Sopena, and M. Shapiro. Assessing the scalability of garbage collectors on many cores. In PLOS, pages 1--5, 2011.

Digital Library

[16]

F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A framework for providing quality of service in chip multi-processors. In MICRO, 2007.

Digital Library

[17]

J. L. Hennessy and D. A. Patterson. Computer Architecture - A Quantitative Approach (5. ed.). Morgan Kaufmann, 2012.

Digital Library

[18]

Intel Corp. Intel 64 and ia-32 architectures optimization reference manual, June 2011.

[19]

Intel Corp. Intel 64 and ia-32 architectures software developer's manual, March 2012.

[20]

R. R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni, D. Newell, Y. Solihin, L. R. Hsu, and S. K. Reinhardt. QoS policies and architecture for cache/memory in CMP platforms. In SIGMETRICS, pages 25--36, 2007.

Digital Library

[21]

A. Jaleel. Memory characterization of workloads using instrumentation-driven simulation -- a pin-based memory characterization of the spec cpu2000 and spec cpu2006 benchmark suites. Technical report, VSSAD, Intel Corporation, 2007.

[22]

S. Kamil. Stencil probe, 2012. http://www.cs.berkeley.edu/~skamil/projects/stencilprobe/.

[23]

J. W. Lee, M. C. Ng, and K. Asanovic. Globally-synchronized frames for guaranteed quality-of-service in on-chip networks. In ISCA, pages 89--100, 2008.

Digital Library

[24]

J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In HPCA, pages 367--378, feb. 2008.

[25]

L. A. Meyerovich, M. E. Torok, E. Atkinson, and R. Bodik. Parallel schedule synthesis for attribute grammars. In PPoPP, 2013.

Digital Library

[26]

M. Moreto, F. J. Cazorla, A. Ramirez, R. Sakellariou, and M. Valero. FlexDCP: a QoS framework for CMP architectures. SIGOPS Oper. Syst. Rev., 43(2):86--96, 2009.

Digital Library

[27]

Perfmon2 webpage. perfmon2.sourceforge.net/.

[28]

A. Phansalkar, A. Joshi, and L. K. John. Analysis of redundancy and application balance in the spec cpu2006 benchmark suite. In ISCA, pages 412--423, 2007.

Digital Library

[29]

M. K. Qureshi and Y. N. Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In MICRO, pages 423--432, 2006.

Digital Library

[30]

D. Sanchez and C. Kozyrakis. Vantage: Scalable and Efficient Fine-Grain Cache Partitioning. In ISCA), June 2011.

Digital Library

[31]

E. Schurman and J. Brutlag. The user and business impact of server delays, additional bytes, and http chunking in web search. In Velocity, 2009.

[32]

Standard Performance Evaluation Corporation. SPEC CPU 2006 benchmark suite. http://www.spec.org.

[33]

G. E. Suh, S. Devadas, and L. Rudolph. A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning. In HPCA, pages 117--128, 2002.

Digital Library

[34]

D. Tam, R. Azimi, L. Soares, and M. Stumm. Managing shared l2 caches on multicore systems in software. In WIOSCA, 2007.

[35]

D. K. Tam, R. Azimi, L. Soares, and M. Stumm. Rapidmrc: approximating l2 miss rate curves on commodity systems for online optimizations. In ASPLOS, pages 121--132, 2009.

Digital Library

[36]

L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soffa. The impact of memory subsystem resource sharing on datacenter applications. In ISCA, pages 283--294, 2011.

Digital Library

[37]

C.-J. Wu and M. Martonosi. Characterization and dynamic mitigation of intra-application cache interference. In ISPASS, pages 2--11, 2011.

Digital Library

[38]

Y. Xie and G. H. Loh. Scalable shared-cache management by containing thrashing workloads. In HiPEAC, pages 262--276, 2010.

Digital Library

[39]

E. Z. Zhang, Y. Jiang, and X. Shen. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In PPoPP, pages 203--212, 2010.

Digital Library

Cited By

Xu HSong SMao Z(2024)Characterizing the Performance of Emerging Deep Learning, Graph, and High Performance Computing Workloads Under Interference2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00098(468-477)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00098
Xiao JXiang YWang XLuo YPimentel AWang ZGallivan KNikolopoulos DBeivide RGallopoulos E(2023)FLORIA: A Fast and Featherlight Approach for Predicting Cache PerformanceProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593740(25-36)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593740
Peeck JHapka RErnst R(2023)Efficient hard real-time implementation of CNNs on multi-core architectures2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC57700.2023.00020(79-90)Online publication date: Jun-2023
https://doi.org/10.1109/COMPSAC57700.2023.00020
Show More Cited By

Recommendations

A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness
ICSA '13

Computing workloads often contain a mix of interactive, latency-sensitive foreground applications and recurring background computations. To guarantee responsiveness, interactive and batch applications are often run on disjoint sets of resources, but ...
Hardware techniques to improve cache efficiency
IPC-Based Cache Partitioning: An IPC-Oriented Dynamic Shared Cache Partitioning Mechanism
ICHIT '08: Proceedings of the 2008 International Conference on Convergence and Hybrid Information Technology

In a chip-multiprocessor with a shared cache structure, the last level cache is shared by multiple applications executing simultaneously. The competing accesses from different applications degrade the system performance, resulting in non-predicting ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ISCA '13: Proceedings of the 40th Annual International Symposium on Computer Architecture

June 2013

686 pages

ISBN:9781450320795

DOI:10.1145/2485922

General Chair:
Avi Mendelson
Technion

ACM SIGARCH Computer Architecture News Volume 41, Issue 3
ICSA '13
June 2013
666 pages
ISSN:0163-5964
DOI:10.1145/2508148
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

IEEE CS

In-Cooperation

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Nvidia
Intel Corporation
Agència de Gestió d'Ajuts Universitaris i de Recerca
Mountain Equipment Co-op
University of California
Samsung
Nokia
Microsoft
Oracle
Ministerio de Economía y Competitividad
MEC/Fulbright Fellowship

Conference

ISCA'13

Sponsor:

ISCA'13: The 40th Annual International Symposium on Computer Architecture

June 23 - 27, 2013

Tel-Aviv, Israel

Acceptance Rates

ISCA '13 Paper Acceptance Rate 56 of 288 submissions, 19%;

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

105
Total Citations
View Citations
1,235
Total Downloads

Downloads (Last 12 months)57
Downloads (Last 6 weeks)5

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu HSong SMao Z(2024)Characterizing the Performance of Emerging Deep Learning, Graph, and High Performance Computing Workloads Under Interference2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00098(468-477)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00098
Xiao JXiang YWang XLuo YPimentel AWang ZGallivan KNikolopoulos DBeivide RGallopoulos E(2023)FLORIA: A Fast and Featherlight Approach for Predicting Cache PerformanceProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593740(25-36)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593740
Peeck JHapka RErnst R(2023)Efficient hard real-time implementation of CNNs on multi-core architectures2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC57700.2023.00020(79-90)Online publication date: Jun-2023
https://doi.org/10.1109/COMPSAC57700.2023.00020
Shen YXiao JPimentel AGrosser TLee K(2022)TCPS: a task and cache-aware partitioned scheduler for hard real-time multi-core systemsProceedings of the 23rd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3519941.3535067(37-49)Online publication date: 14-Jun-2022
https://dl.acm.org/doi/10.1145/3519941.3535067
Li ZSen TShen HChuah M(2022)A Study on the Impact of Memory DoS Attacks on Cloud Applications and Exploring Real-Time Detection SchemesIEEE/ACM Transactions on Networking10.1109/TNET.2022.314489530:4(1644-1658)Online publication date: Aug-2022
https://doi.org/10.1109/TNET.2022.3144895
Heo TWang YCui WHuh JZhang L(2022)Adaptive Page Migration Policy With Huge Pages in Tiered Memory SystemsIEEE Transactions on Computers10.1109/TC.2020.303668671:1(53-68)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TC.2020.3036686
Song WDelimitrou CShen ZRenesse RWeatherspoon HBenmohamed LVaulx FMahmoudi C(2021)CacheInspectorACM Transactions on Architecture and Code Optimization10.1145/345737318:3(1-25)Online publication date: 8-Jun-2021
https://dl.acm.org/doi/10.1145/3457373
Saez JCastro FFanizzi GPrieto-Matias M(2021)LFOC+: A Fair OS-level Cache-Clustering Policy for Commodity Multicore SystemsIEEE Transactions on Computers10.1109/TC.2021.3112970(1-1)Online publication date: 2021
https://doi.org/10.1109/TC.2021.3112970
Zhang YChen JJiang XLiu QSteiner IHerdrich AShu KDas RCui LJiang L(2021)LIBRA: Clearing the Cloud Through Dynamic Memory Bandwidth Management2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00073(815-826)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00073
Yim YJang HJin H(2021)QoS for best‐effort batch jobs in container‐based cloudConcurrency and Computation: Practice and Experience10.1002/cpe.642235:15Online publication date: 31-May-2021
https://doi.org/10.1002/cpe.6422
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten