skip to main content
10.1145/1693453.1693482acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Authors Info & Claims
Published:09 January 2010Publication History

ABSTRACT

Most modern Chip Multiprocessors (CMP) feature shared cache on chip. For multithreaded applications, the sharing reduces communication latency among co-running threads, but also results in cache contention.

A number of studies have examined the influence of cache sharing on multithreaded applications, but most of them have concentrated on the design or management of shared cache, rather than a systematic measurement of the influence. Consequently, prior measurements have been constrained by the reliance on simulators, the use of out-of-date benchmarks, and the limited coverage of deciding factors. The influence of CMP cache sharing on contemporary multithreaded applications remains preliminarily understood.

In this work, we conduct a systematic measurement of the influence on two kinds of commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, with a number of potentially important factors on program, OS, and architecture levels considered. The measurement shows some surprising results. Contrary to commonly perceived importance of cache sharing, neither positive nor negative effects from the cache sharing are significant for most of the program executions, regardless of the types of parallelism, input datasets, architectures, numbers of threads, and assignments of threads to cores. After a detailed analysis, we find that the main reason is the mismatch of current development and compilation of multithreaded applications and CMP architectures. By transforming the programs in a cache-sharing-aware manner, we observe up to 36% performance increase when the threads are placed on cores appropriately.

References

  1. R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Bienia, S. Kumar, and K. Li. PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on chip-multiprocessors. In Proceedings of the IEEE International Symposium on Workload Characterization, pages 47--56, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  3. C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pages 72--81, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Browne, C. Deane, G. Ho, and P. Mucci. PAPI: A portable interface to hardware performance counters. In Proceedings of Department of Defense HPCMP Users Group Conference, 1999.Google ScholarGoogle Scholar
  5. J. Chang and G. Sohi. Cooperative cache partitioning for chip multiprocessors. In Proceedings of the 21st annual international conference on Supercomputing, pages 242--252, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. El-Moursy, R. Garg, D. H. Albonesi, and S. Dwarkadas. Compatible phase co-scheduling on a cmp of multi-threaded processors. In Proceedings of the International Parallel and Distribute Processing Symposium (IPDPS), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Fedorova, M. Seltzer, and M. D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 25--38, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 220--229, October 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Jiang, K. Tian, and X. Shen. Combining locality analysis with online proactive job co-scheduling in chip multiprocessors. In Proceedings of The International Conference on High Performance Embedded Architectures and Compilation (HiPEAC), 2010. (to appear). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Kumar and D. Tullsen. Compiling for instruction cache performance on a multithreaded architecture. In Proceedings of the International Symposium on Microarchitecture, pages 419--429, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Li, S. Tandri, M. Stumm, and K. C. Sevcik. Locality and loop scheduling on NUMA multiprocessors. In Proceedings of the International Conference on Parallel Processing (ICPP), pages 140--147, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Liao, Z. Liu, L. Huang, and B. Chapman. Evaluating OpenMP on chip multithreading platforms. In Proceedings of International Workshop on OpenMP, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Nikolopoulos. Code and data transformations for improving shared cache performance on smt processors. In Proceedings of the International Symposium on High Performance Computing, pages 54--69, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  14. M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the International Symposium on Microarchitecture, pages 423--432, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. N. Rafique, W. Lim, and M. Thottethodi. Architectural support for operating system-driven CMP cache management. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 2--12, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Sarkar and D. Tullsen. Compiler techniques for reducing data cache miss rate on a multithreaded architecture. In Proceedings of The HiPEAC International Conference on High Performance Embedded Architectures and Compilation, pages 353--368, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Settle, J. L. Kihm, A. Janiszewski, and D. A. Connors. Architectural support for enhanced SMT job scheduling. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 63--73, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. X. Shen, Y. Zhong, and C. Ding. Locality phase prediction. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 165--176, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, pages 45--57, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Snavely and D. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 66--76, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Snavely, D. Tullsen, and G. Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture, pages 117--128, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Tam, R. Azimi, and M. Stumm. Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors. SIGOPS Oper. Syst. Rev., 41(3):47--58, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Tian, Y. Jiang, and X. Shen. A study on optimally co-scheduling jobs of different lengths on chip multiprocessors. In Proceedings of ACM Computing Frontiers, pages 41--50, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. N. Tuck and D. M. Tullsen. Initial observations of the simultaneous multithreading Pentium 4 processor. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH- 2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
          January 2010
          372 pages
          ISBN:9781605588773
          DOI:10.1145/1693453
          • cover image ACM SIGPLAN Notices
            ACM SIGPLAN Notices  Volume 45, Issue 5
            PPoPP '10
            May 2010
            346 pages
            ISSN:0362-1340
            EISSN:1558-1160
            DOI:10.1145/1837853
            Issue’s Table of Contents

          Copyright © 2010 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 January 2010

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate230of1,014submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader