research-article

Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Authors:
Eddy Z. Zhang

The College of William and Mary, Williamsburg, VA, USA

The College of William and Mary, Williamsburg, VA, USA
View Profile

,
Yunlian Jiang

The College of William and Mary, Williamsburg, VA, USA

The College of William and Mary, Williamsburg, VA, USA
View Profile

,
Xipeng Shen

The College of William and Mary, Williamsburg, VA, USA

The College of William and Mary, Williamsburg, VA, USA
View Profile

PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingJanuary 2010Pages 203–212https://doi.org/10.1145/1693453.1693482

Published:09 January 2010Publication History

PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 203–212

ABSTRACT

Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention.

A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood.

In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.

References

R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers, 2001. Google ScholarDigital Library
C. Bienia, S. Kumar, and K. Li. PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on chip-multiprocessors. In Proceedings of the IEEE International Symposium on Workload Characterization, pages 47--56, 2008.Google ScholarCross Ref
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pages 72--81, 2008. Google ScholarDigital Library
S. Browne, C. Deane, G. Ho, and P. Mucci. PAPI: A portable interface to hardware performance counters. In Proceedings of Department of Defense HPCMP Users Group Conference, 1999.Google Scholar
J. Chang and G. Sohi. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st annual international conference on Supercomputing, pages 242--252, 2007. Google ScholarDigital Library
A. El-Moursy, R. Garg, D. H. Albonesi, and S. Dwarkadas. Compatible phase co-scheduling on a cmp of multi-threaded processors. In Proceedings of the International Parallel and Distribute Processing Symposium (IPDPS), 2006. Google ScholarDigital Library
A. Fedorova, M. Seltzer, and M. D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 25--38, 2007. Google ScholarDigital Library
Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 220--229, October 2008. Google ScholarDigital Library
Y. Jiang, K. Tian, and X. Shen. Combining locality analysis with online proactive job co-scheduling in chip multiprocessors. In Proceedings of The International Conference on High Performance Embedded Architectures and Compilation (HiPEAC), 2010. (to appear). Google ScholarDigital Library
R. Kumar and D. Tullsen. Compiling for instruction cache performance on a multithreaded architecture. In Proceedings of the International Symposium on Microarchitecture, pages 419--429, 2002. Google ScholarDigital Library
H. Li, S. Tandri, M. Stumm, and K. C. Sevcik. Locality and loop scheduling on NUMA multiprocessors. In Proceedings of the International Conference on Parallel Processing (ICPP), pages 140--147, 1993. Google ScholarDigital Library
C. Liao, Z. Liu, L. Huang, and B. Chapman. Evaluating OpenMP on chip multithreading platforms. In Proceedings of International Workshop on OpenMP, 2005. Google ScholarDigital Library
D. Nikolopoulos. Code and data transformations for improving shared cache performance on smt processors. In Proceedings of the International Symposium on High Performance Computing, pages 54--69, 2003.Google ScholarCross Ref
M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the International Symposium on Microarchitecture, pages 423--432, 2006. Google ScholarDigital Library
N. Rafique, W. Lim, and M. Thottethodi. Architectural support for operating system-driven CMP cache management. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 2--12, 2006. Google ScholarDigital Library
S. Sarkar and D. Tullsen. Compiler techniques for reducing data cache miss rate on a multithreaded architecture. In Proceedings of The HiPEAC International Conference on High Performance Embedded Architectures and Compilation, pages 353--368, 2008. Google ScholarDigital Library
A. Settle, J. L. Kihm, A. Janiszewski, and D. A. Connors. Architectural support for enhanced SMT job scheduling. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 63--73, 2004. Google ScholarDigital Library
X. Shen, Y. Zhong, and C. Ding. Locality phase prediction. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 165--176, 2004. Google ScholarDigital Library
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, pages 45--57, 2002. Google ScholarDigital Library
A. Snavely and D. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 66--76, 2000. Google ScholarDigital Library
A. Snavely, D. Tullsen, and G. Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, 2002. Google ScholarDigital Library
G. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture, pages 117--128, 2002. Google ScholarDigital Library
D. Tam, R. Azimi, and M. Stumm. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. SIGOPS Oper. Syst. Rev., 41(3):47--58, 2007. Google ScholarDigital Library
K. Tian, Y. Jiang, and X. Shen. A study on optimally co-scheduling jobs of different lengths on chip multiprocessors. In Proceedings of ACM Computing Frontiers, pages 41--50, 2009. Google ScholarDigital Library
N. Tuck and D. M. Tullsen. Initial observations of the simultaneous multithreading Pentium 4 processor. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, 2003. Google ScholarDigital Library
S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH- 2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture, 1995. Google ScholarDigital Library

Index Terms

Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Parallel programming languages
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Scheduling

Recommendations

Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?
PPoPP '10

Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention.

A number of studies have examined the ...
Read More
The Significance of CMP Cache Sharing on Contemporary Multithreaded Applications

Cache sharing on modern Chip Multiprocessors (CMPs) reduces communication latency among corunning threads, and also causes interthread cache contention. Most previous studies on the influence of cache sharing have concentrated on the design or ...
Read More
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10

Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2010
372 pages
ISBN:9781605588773
DOI:10.1145/1693453
General Chairs:
R. Govindarajan
Indian Institute of Science
,
David Padua
UIUC
,
Program Chair:
Mary Hall
University of Utah
ACM SIGPLAN Notices Volume 45, Issue 5
PPoPP '10
May 2010
346 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1837853
Issue’s Table of Contents
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 January 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
chip multiprocessors
parallel program optimizations
shared cache
thread scheduling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate230of1,014submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 116
  Total Citations
  View Citations
- 1,015
  Total Downloads
- Downloads (Last 12 months)30
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

The Significance of CMP Cache Sharing on Contemporary Multithreaded Applications

High performance cache replacement using re-reference interval prediction (RRIP)