ABSTRACT
Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention.
A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood.
In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.
- R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers, 2001. Google ScholarDigital Library
- C. Bienia, S. Kumar, and K. Li. PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on chip-multiprocessors. In Proceedings of the IEEE International Symposium on Workload Characterization, pages 47--56, 2008.Google ScholarCross Ref
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pages 72--81, 2008. Google ScholarDigital Library
- S. Browne, C. Deane, G. Ho, and P. Mucci. PAPI: A portable interface to hardware performance counters. In Proceedings of Department of Defense HPCMP Users Group Conference, 1999.Google Scholar
- J. Chang and G. Sohi. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st annual international conference on Supercomputing, pages 242--252, 2007. Google ScholarDigital Library
- A. El-Moursy, R. Garg, D. H. Albonesi, and S. Dwarkadas. Compatible phase co-scheduling on a cmp of multi-threaded processors. In Proceedings of the International Parallel and Distribute Processing Symposium (IPDPS), 2006. Google ScholarDigital Library
- A. Fedorova, M. Seltzer, and M. D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 25--38, 2007. Google ScholarDigital Library
- Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 220--229, October 2008. Google ScholarDigital Library
- Y. Jiang, K. Tian, and X. Shen. Combining locality analysis with online proactive job co-scheduling in chip multiprocessors. In Proceedings of The International Conference on High Performance Embedded Architectures and Compilation (HiPEAC), 2010. (to appear). Google ScholarDigital Library
- R. Kumar and D. Tullsen. Compiling for instruction cache performance on a multithreaded architecture. In Proceedings of the International Symposium on Microarchitecture, pages 419--429, 2002. Google ScholarDigital Library
- H. Li, S. Tandri, M. Stumm, and K. C. Sevcik. Locality and loop scheduling on NUMA multiprocessors. In Proceedings of the International Conference on Parallel Processing (ICPP), pages 140--147, 1993. Google ScholarDigital Library
- C. Liao, Z. Liu, L. Huang, and B. Chapman. Evaluating OpenMP on chip multithreading platforms. In Proceedings of International Workshop on OpenMP, 2005. Google ScholarDigital Library
- D. Nikolopoulos. Code and data transformations for improving shared cache performance on smt processors. In Proceedings of the International Symposium on High Performance Computing, pages 54--69, 2003.Google ScholarCross Ref
- M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the International Symposium on Microarchitecture, pages 423--432, 2006. Google ScholarDigital Library
- N. Rafique, W. Lim, and M. Thottethodi. Architectural support for operating system-driven CMP cache management. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 2--12, 2006. Google ScholarDigital Library
- S. Sarkar and D. Tullsen. Compiler techniques for reducing data cache miss rate on a multithreaded architecture. In Proceedings of The HiPEAC International Conference on High Performance Embedded Architectures and Compilation, pages 353--368, 2008. Google ScholarDigital Library
- A. Settle, J. L. Kihm, A. Janiszewski, and D. A. Connors. Architectural support for enhanced SMT job scheduling. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 63--73, 2004. Google ScholarDigital Library
- X. Shen, Y. Zhong, and C. Ding. Locality phase prediction. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 165--176, 2004. Google ScholarDigital Library
- T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, pages 45--57, 2002. Google ScholarDigital Library
- A. Snavely and D. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 66--76, 2000. Google ScholarDigital Library
- A. Snavely, D. Tullsen, and G. Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, 2002. Google ScholarDigital Library
- G. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture, pages 117--128, 2002. Google ScholarDigital Library
- D. Tam, R. Azimi, and M. Stumm. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. SIGOPS Oper. Syst. Rev., 41(3):47--58, 2007. Google ScholarDigital Library
- K. Tian, Y. Jiang, and X. Shen. A study on optimally co-scheduling jobs of different lengths on chip multiprocessors. In Proceedings of ACM Computing Frontiers, pages 41--50, 2009. Google ScholarDigital Library
- N. Tuck and D. M. Tullsen. Initial observations of the simultaneous multithreading Pentium 4 processor. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, 2003. Google ScholarDigital Library
- S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH- 2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture, 1995. Google ScholarDigital Library
Index Terms
- Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?
Recommendations
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?
PPoPP '10Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention.
A number of studies have examined the ...
The Significance of CMP Cache Sharing on Contemporary Multithreaded Applications
Cache sharing on modern Chip Multiprocessors (CMPs) reduces communication latency among corunning threads, and also causes interthread cache contention. Most previous studies on the influence of cache sharing have concentrated on the design or ...
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
Comments