Article

Fast and fair: data-stream quality of service

Authors:

Glenn ReinmanAuthors Info & Claims

CASES '05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems

Pages 237 - 248

https://doi.org/10.1145/1086297.1086328

Published: 24 September 2005 Publication History

Abstract

Chip multiprocessors have the potential to exploit thread level parallelism, particularly in the context of embedded server farms where the available number of threads can be quite high. Unfortunately, both per-core and overall throughput are significantly impacted by the organization of the lowest level on-chip cache. On-chip caches for CMPs must be able to handle the increased demand and contention of multiple cores. To complicate the problem, cache demand changes dynamically with phases changes, context switches, power saving features, and assignments to asymmetric cores.We propose PDAS, a distributed NUCA L2 cache design with an adaptive sharing mechanism. Each core independently measures its dynamic need, and all cache resources are managed to increase utilization, reduce migrations, and lower interference. Per-core performance degradation is bounded while overall throughput is optimized, thus qualitatively improving performance of embedded systems where quality-of-service is an important characteristic.In single thread mode, PDAS, on average, improves by 26%, 27%, and 13% over Private, Shared, and NUCA caches respectively. This improvement is achieved while reducing internal migrations on average by 82% as compared to the NUCA. With thread contention, PDAS increases its performance and power advantage over prior work. The average migration reduction over NUCA increases to over 90%, and average IPC improvements over NUCA are 30%, 14%, and 35% for 2T, 3T, and 4T scenarios.

References

[1]

V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger. Clock rate vs. ipc: The end of the road for conventional microprocessors. In Proceedings of the 27th Annual International Symposium on Computer Architecture, 2000.

Digital Library

[2]

D. Albonesi. Selective cache ways: On-demand cache resource allocation. In 32nd International Symposium on Microarchitecture, November 1999.

Digital Library

[3]

L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: a scalable architecture based on single-chip multiprocessing. In Proceedings of the 27th annual international symposium on Computer architecture, pages 282--293. ACM Press, 2000.

Digital Library

[4]

B. Beckmann and D. Wood. Managing wire delay in large chip-multiprocessor caches. In The 37th Annual IEEE/ACM International Symposium on Microarchitectur, 2004.

Digital Library

[5]

M. Bohr. Interconnect scaling - the real limiter to high-performance ulsi. In Tech. Dig. of the International Electron Devices Meeting, pages 241--244, December 1995.

[6]

Broadcom. Bcm1480 product brief.

[7]

D. C. Burger and T. M. Austin. The simplescalar tool set, version 2.0. Technical Report CS-TR-97-1342, U. of Wisconsin, Madison, June 1997.

Digital Library

[8]

C. Cascaval, J. Castanos, L. Ceze, M. Denneau, M. Gupta, D. Lieber, J. Moreira, K. Strauss, and Jr. H. Warren. Evaluation of a multithreaded architecture for cellular computing. In Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA), 2002.

Digital Library

[9]

D. Chiou, P. Jain, S. Devadas, and L. Rudolph. Dynamic cache partitioning via columnization. In Proceedings of Design Automation Conference, Los Angeles, June 2000.

[10]

Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, page 55. IEEE Computer Society, 2003.

Digital Library

[11]

T. Cormen, C. Leiserson, and R. Rivest. Introduction to algorithms. 1992.

Digital Library

[12]

A. Dhodapkar and J. E. Smith. Managing multi-configuration hardware via dynamic working set analysis. In 29th Annual International Symposium on Computer Architecture, May 2002.

Digital Library

[13]

S. Hily and A. Seznec. Standard memory hierarchy does not fit simultaneous multithreading. In Proceedings of MTEAC'98 Workshop, 1998.

[14]

H. Hofstee. Power efficient processor design and the cell processor. In Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA), 2005.

Digital Library

[15]

J. Huh, D. Burger, and S. W. Keckler. Exploring the design space of future cmps. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, pages 199--210. IEEE Computer Society, 2001.

Digital Library

[16]

C. Kim, D. Burger, and S. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, October 2002.

Digital Library

[17]

S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 2004 International Conference on Parallel Architectures and Compilation Techniques, 2004.

Digital Library

[18]

U. Holzle L. Barroso, J. Dean. Web search for a planet: The google cluster architecture. 19. X. Li, H. Negi, T. Mitra, and A. Roychoudhury. Design space exploration of caches using compressed traces. In Proceedings of the 18th annual international conference on Supercomputing, pages 116--125. ACM Press, 2004.

Digital Library

[19]

C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for cmps. In International Symposium on High-Performance Computer Architecture (HPCA-10), February 2004.

Digital Library

[20]

P. Michaud. Exploiting the cache capacity of a single-chip multi-core processor with execution migration. In 10th International Symposium on High Performance Computer Architecture, 2004.

Digital Library

[21]

K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, and K. Chang. The case for a single-chip multiprocessor. In Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, pages 2--11. ACM Press, 1996.

Digital Library

[22]

K. Olukotun P. Kongetira, K. Aingaran. Niagara: A 32-way multithreaded sparc processor.

[23]

P. Ranganathan, S. Adve, and N. Jouppi. Reconfigurable caches and their application to media processing. In Proceedings of the 27th Annual International Symposium on Computer Architecture, 2000.

Digital Library

[24]

T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, October 2002.

Digital Library

[25]

T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. In 30th Annual International Symposium on Computer Architecture, June 2003.

Digital Library

[26]

P. Shivakumar and Norman P. Jouppi. Cacti 3.0: An integrated cache timing, power, and area model. In Technical Report, 2001.

[27]

A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In In Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, November 2000.

Digital Library

[28]

J. Stokes. Inside the xbox 360.

[29]

H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cache memory. In IEEE transactions on Computers, 1992.

Digital Library

[30]

R. Sugumar and S. Abraham. Set-associative cache simulation using generalized binomial trees. ACM Trans. Comput. Syst., 13(1):32--56, 1995.

Digital Library

[31]

G. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, 2002.

Digital Library

[32]

G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. J. Supercomput., 28(1):7--26, 2004.

Digital Library

Cited By

Berger DJacobi CWalters CSonnelitter RCadigan MKlein M(2023)Enterprise-Class Multilevel Cache Design: Low Latency, Huge Capacity, and High ReliabilityIEEE Micro10.1109/MM.2022.319364243:1(58-66)Online publication date: 1-Jan-2023
https://doi.org/10.1109/MM.2022.3193642
Heo TWang YCui WHuh JZhang L(2022)Adaptive Page Migration Policy With Huge Pages in Tiered Memory SystemsIEEE Transactions on Computers10.1109/TC.2020.303668671:1(53-68)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TC.2020.3036686
Caheny PAlvarez LCasas MMoreto M(2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00085
Show More Cited By

Index Terms

Fast and fair: data-stream quality of service
1. Computer systems organization

Recommendations

Reactive NUCA: near-optimal block placement and replication in distributed caches

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Way adaptable D-NUCA caches

Non-uniform cache architecture (NUCA) aims to limit the wire-delay problem typical of large on-chip last level caches: by partitioning a large cache into several banks, with the latency of each one depending on its physical location and by employing a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CASES '05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems

September 2005

326 pages

ISBN:159593149X

DOI:10.1145/1086297

General Chairs:
Thomas M. Conte
North Carolina State University
,
Paolo Faraboschi
Hewlett-Packard Laboratories
,
Program Chairs:
Bill Mangione-Smith
Quantum Intellectual Property Services
,
Walid Najjar
University of California, Riverside

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 September 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CASES05

Sponsor:

CASES05: 2005 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

September 24 - 27, 2005

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 52 of 230 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

40
Total Citations
View Citations
598
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)2

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Berger DJacobi CWalters CSonnelitter RCadigan MKlein M(2023)Enterprise-Class Multilevel Cache Design: Low Latency, Huge Capacity, and High ReliabilityIEEE Micro10.1109/MM.2022.319364243:1(58-66)Online publication date: 1-Jan-2023
https://doi.org/10.1109/MM.2022.3193642
Heo TWang YCui WHuh JZhang L(2022)Adaptive Page Migration Policy With Huge Pages in Tiered Memory SystemsIEEE Transactions on Computers10.1109/TC.2020.303668671:1(53-68)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TC.2020.3036686
Caheny PAlvarez LCasas MMoreto M(2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00085
Zhao XAdileh AYu ZWang ZJaleel AEeckhout LManne SHunter HAltman E(2019)Adaptive memory-side last-level GPU cachingProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322235(411-423)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3307650.3322235
Mittal S(2017)A Survey of Techniques for Cache Partitioning in Multicore ProcessorsACM Computing Surveys10.1145/306239450:2(1-39)Online publication date: 10-May-2017
https://dl.acm.org/doi/10.1145/3062394
王冠(2016)The IBP Replacement Algorithm Based on Process BindingSoftware Engineering and Applications10.12677/SEA.2016.5302005:03(181-189)Online publication date: 2016
https://doi.org/10.12677/SEA.2016.53020
Chang JSohi G(2014)Cooperative cache partitioning for chip multiprocessorsACM International Conference on Supercomputing 25th Anniversary Volume10.1145/2591635.2667188(402-412)Online publication date: 10-Jun-2014
https://dl.acm.org/doi/10.1145/2591635.2667188
Zang WGordon-Ross A(2013)A survey on cache tuning from a power/energy perspectiveACM Computing Surveys10.1145/2480741.248074945:3(1-49)Online publication date: 3-Jul-2013
https://dl.acm.org/doi/10.1145/2480741.2480749
Luque CMoreto MCazorla FValero M(2013)Fair CPU time accounting in CMP+SMT processorsACM Transactions on Architecture and Code Optimization10.1145/2400682.24007099:4(1-25)Online publication date: 20-Jan-2013
https://dl.acm.org/doi/10.1145/2400682.2400709
Li BPeh LZhao LIyer R(2012)Dynamic QoS management for chip multiprocessorsACM Transactions on Architecture and Code Optimization10.1145/2355585.23555909:3(1-29)Online publication date: 5-Oct-2012
https://dl.acm.org/doi/10.1145/2355585.2355590
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten