skip to main content
10.1145/1086297.1086328acmconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
Article

Fast and fair: data-stream quality of service

Published: 24 September 2005 Publication History

Abstract

Chip multiprocessors have the potential to exploit thread level parallelism, particularly in the context of embedded server farms where the available number of threads can be quite high. Unfortunately, both per-core and overall throughput are significantly impacted by the organization of the lowest level on-chip cache. On-chip caches for CMPs must be able to handle the increased demand and contention of multiple cores. To complicate the problem, cache demand changes dynamically with phases changes, context switches, power saving features, and assignments to asymmetric cores.We propose PDAS, a distributed NUCA L2 cache design with an adaptive sharing mechanism. Each core independently measures its dynamic need, and all cache resources are managed to increase utilization, reduce migrations, and lower interference. Per-core performance degradation is bounded while overall throughput is optimized, thus qualitatively improving performance of embedded systems where quality-of-service is an important characteristic.In single thread mode, PDAS, on average, improves by 26%, 27%, and 13% over Private, Shared, and NUCA caches respectively. This improvement is achieved while reducing internal migrations on average by 82% as compared to the NUCA. With thread contention, PDAS increases its performance and power advantage over prior work. The average migration reduction over NUCA increases to over 90%, and average IPC improvements over NUCA are 30%, 14%, and 35% for 2T, 3T, and 4T scenarios.

References

[1]
V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger. Clock rate vs. ipc: The end of the road for conventional microprocessors. In Proceedings of the 27th Annual International Symposium on Computer Architecture, 2000.
[2]
D. Albonesi. Selective cache ways: On-demand cache resource allocation. In 32nd International Symposium on Microarchitecture, November 1999.
[3]
L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: a scalable architecture based on single-chip multiprocessing. In Proceedings of the 27th annual international symposium on Computer architecture, pages 282--293. ACM Press, 2000.
[4]
B. Beckmann and D. Wood. Managing wire delay in large chip-multiprocessor caches. In The 37th Annual IEEE/ACM International Symposium on Microarchitectur, 2004.
[5]
M. Bohr. Interconnect scaling - the real limiter to high-performance ulsi. In Tech. Dig. of the International Electron Devices Meeting, pages 241--244, December 1995.
[6]
Broadcom. Bcm1480 product brief.
[7]
D. C. Burger and T. M. Austin. The simplescalar tool set, version 2.0. Technical Report CS-TR-97-1342, U. of Wisconsin, Madison, June 1997.
[8]
C. Cascaval, J. Castanos, L. Ceze, M. Denneau, M. Gupta, D. Lieber, J. Moreira, K. Strauss, and Jr. H. Warren. Evaluation of a multithreaded architecture for cellular computing. In Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA), 2002.
[9]
D. Chiou, P. Jain, S. Devadas, and L. Rudolph. Dynamic cache partitioning via columnization. In Proceedings of Design Automation Conference, Los Angeles, June 2000.
[10]
Z. Chishti, M. D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, page 55. IEEE Computer Society, 2003.
[11]
T. Cormen, C. Leiserson, and R. Rivest. Introduction to algorithms. 1992.
[12]
A. Dhodapkar and J. E. Smith. Managing multi-configuration hardware via dynamic working set analysis. In 29th Annual International Symposium on Computer Architecture, May 2002.
[13]
S. Hily and A. Seznec. Standard memory hierarchy does not fit simultaneous multithreading. In Proceedings of MTEAC'98 Workshop, 1998.
[14]
H. Hofstee. Power efficient processor design and the cell processor. In Proceedings of the 8th International Symposium on High Performance Computer Architecture (HPCA), 2005.
[15]
J. Huh, D. Burger, and S. W. Keckler. Exploring the design space of future cmps. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, pages 199--210. IEEE Computer Society, 2001.
[16]
C. Kim, D. Burger, and S. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, October 2002.
[17]
S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 2004 International Conference on Parallel Architectures and Compilation Techniques, 2004.
[18]
U. Holzle L. Barroso, J. Dean. Web search for a planet: The google cluster architecture. 19. X. Li, H. Negi, T. Mitra, and A. Roychoudhury. Design space exploration of caches using compressed traces. In Proceedings of the 18th annual international conference on Supercomputing, pages 116--125. ACM Press, 2004.
[19]
C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing the last line of defense before hitting the memory wall for cmps. In International Symposium on High-Performance Computer Architecture (HPCA-10), February 2004.
[20]
P. Michaud. Exploiting the cache capacity of a single-chip multi-core processor with execution migration. In 10th International Symposium on High Performance Computer Architecture, 2004.
[21]
K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, and K. Chang. The case for a single-chip multiprocessor. In Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, pages 2--11. ACM Press, 1996.
[22]
K. Olukotun P. Kongetira, K. Aingaran. Niagara: A 32-way multithreaded sparc processor.
[23]
P. Ranganathan, S. Adve, and N. Jouppi. Reconfigurable caches and their application to media processing. In Proceedings of the 27th Annual International Symposium on Computer Architecture, 2000.
[24]
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, October 2002.
[25]
T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. In 30th Annual International Symposium on Computer Architecture, June 2003.
[26]
P. Shivakumar and Norman P. Jouppi. Cacti 3.0: An integrated cache timing, power, and area model. In Technical Report, 2001.
[27]
A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In In Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, November 2000.
[28]
J. Stokes. Inside the xbox 360.
[29]
H. S. Stone, J. Turek, and J. L. Wolf. Optimal partitioning of cache memory. In IEEE transactions on Computers, 1992.
[30]
R. Sugumar and S. Abraham. Set-associative cache simulation using generalized binomial trees. ACM Trans. Comput. Syst., 13(1):32--56, 1995.
[31]
G. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, 2002.
[32]
G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. J. Supercomput., 28(1):7--26, 2004.

Cited By

View all
  • (2023)Enterprise-Class Multilevel Cache Design: Low Latency, Huge Capacity, and High ReliabilityIEEE Micro10.1109/MM.2022.319364243:1(58-66)Online publication date: 1-Jan-2023
  • (2022)Adaptive Page Migration Policy With Huge Pages in Tiered Memory SystemsIEEE Transactions on Computers10.1109/TC.2020.303668671:1(53-68)Online publication date: 1-Jan-2022
  • (2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CASES '05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
September 2005
326 pages
ISBN:159593149X
DOI:10.1145/1086297
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 September 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CMP
  2. NUCA
  3. PDAS
  4. QOS
  5. adaptive
  6. bandwidth
  7. cache
  8. chip multiprocessor
  9. cluster
  10. data-stream
  11. distributed
  12. embedded
  13. memory wall
  14. migration
  15. non-uniform access
  16. partition
  17. per thread degradation
  18. phase

Qualifiers

  • Article

Conference

CASES05

Acceptance Rates

Overall Acceptance Rate 52 of 230 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)2
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Enterprise-Class Multilevel Cache Design: Low Latency, Huge Capacity, and High ReliabilityIEEE Micro10.1109/MM.2022.319364243:1(58-66)Online publication date: 1-Jan-2023
  • (2022)Adaptive Page Migration Policy With Huge Pages in Tiered Memory SystemsIEEE Transactions on Computers10.1109/TC.2020.303668671:1(53-68)Online publication date: 1-Jan-2022
  • (2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
  • (2019)Adaptive memory-side last-level GPU cachingProceedings of the 46th International Symposium on Computer Architecture10.1145/3307650.3322235(411-423)Online publication date: 22-Jun-2019
  • (2017)A Survey of Techniques for Cache Partitioning in Multicore ProcessorsACM Computing Surveys10.1145/306239450:2(1-39)Online publication date: 10-May-2017
  • (2016)The IBP Replacement Algorithm Based on Process BindingSoftware Engineering and Applications10.12677/SEA.2016.5302005:03(181-189)Online publication date: 2016
  • (2014)Cooperative cache partitioning for chip multiprocessorsACM International Conference on Supercomputing 25th Anniversary Volume10.1145/2591635.2667188(402-412)Online publication date: 10-Jun-2014
  • (2013)A survey on cache tuning from a power/energy perspectiveACM Computing Surveys10.1145/2480741.248074945:3(1-49)Online publication date: 3-Jul-2013
  • (2013)Fair CPU time accounting in CMP+SMT processorsACM Transactions on Architecture and Code Optimization10.1145/2400682.24007099:4(1-25)Online publication date: 20-Jan-2013
  • (2012)Dynamic QoS management for chip multiprocessorsACM Transactions on Architecture and Code Optimization10.1145/2355585.23555909:3(1-29)Online publication date: 5-Oct-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media