ABSTRACT
Cache sharing among processors is important for Chip Multiprocessors to reduce inter-thread latency, but also brings cache contention, degrading program performance considerably. Recent studies have shown that job co-scheduling can effectively alleviate the contention, but it remains an open question how to efficiently find optimal co-schedules. Solving the question is critical for determining the potential of a co-scheduling system. This paper presents a theoretical analysis of the complexity of co-scheduling, proving its NP-completeness. Furthermore, for a special case when there are two sharers per chip, we propose an algorithm that finds the optimal co-schedules in polynomial time. For more complex cases, we design and evaluate a sequence of approximation algorithms, among which, the hierarchical matching algorithm produces near-optimal schedules and shows good scalability. This study facilitates the evaluation of co-scheduling systems, as well as offers some techniques directly usable in proactive job co-scheduling.
- J. R. Bulpin and I. A. Pratt. Hyper-threading aware process scheduling heuristics. In 2005 USENIX Annual Technical Conference, pages 103--106, 2005. Google ScholarDigital Library
- D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), 2005. Google ScholarDigital Library
- W. Cook and A. Rohe. Computing minimum-weight perfect matchings. INFORMS Journal on Computing, 11:138--148, 1999. Google ScholarDigital Library
- P. Denning. Thrashing: Its causes and prevention. In Proceedings of the AFIPS 1968 Fall Joint Computer Conference, volume 33, pages 915--922, 1968.Google Scholar
- M. DeVuyst, R. Kumar, and D. M. Tullsen. Exploiting unbalanced thread scheduling for energy and performance on a cmp of smt processors. In Proceedings of International Parallel and Distribute Processing Symposium (IPDPS), 2006. Google ScholarDigital Library
- J. Edmonds. Maximum matching and a polyhedron with 0,1-vertices. Journal of Research of the National Bureau of Standards B, 69B:125--130, 1965.Google ScholarCross Ref
- A. El-Moursy, R. Garg, D. H. Albonesi, and S. Dwarkadas. Compatible phase co-scheduling on a cmp of multi-threaded processors. In Proceedings of International Parallel and Distribute Processing Symposium (IPDPS), 2006. Google ScholarDigital Library
- A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In USENIX Annual Technical Conference, 2005. Google ScholarDigital Library
- A. Fedorova, M. Seltzer, and M. D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2007. Google ScholarDigital Library
- H. Gabow and R. E. Tarjan. Faster scaling algorithms for general graph-matching problems. Journal of ACM, 38:815--853, 1991. Google ScholarDigital Library
- M. Garey and D. Johnson. Computers and Intractability. Feeman, San Francisco, CA, 1979.Google Scholar
- L. R. Hsu, S. K. Reinhardt, R. Lyer, and S. Makineni. Communist, utilitarian, and capitalist cache policies on CMPs: caches as a shared resource. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2006. Google ScholarDigital Library
- J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. Keckler. A nuca substrate for flexible cmp cache sharing. In Proceedings of International Conference on Supercomputing, pages 31--40, 2005. Google ScholarDigital Library
- Y. Jiang and X. Shen. Exploration of the influence of program inputs on cmp co-scheduling. In European Conference on Parallel Computing (Euro-Par), August 2008. Google ScholarDigital Library
- R. Karp. Reducibility among combinatiorial problems. In R. Miller and J. Thatcher, editors, Complexity of Computer Computations, pages 85--103. Plenum Press, 1972.Google ScholarCross Ref
- S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2004. Google ScholarDigital Library
- R. Kumar, D. M. Tullsen, and N. P. Jouppi. Core architecture optimization for heterogeneous chip multiprocessors. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2006. Google ScholarDigital Library
- J. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE TCCA Newsletter, 1995. http://www.cs.virginia.edu/stream.Google Scholar
- P. Nagpurkar, M. Hind, C. Krintz, P. F. Sweeney, and V. Rajan. Online phase detection algorithms. In Proceedings of the International Symposium on Code Generation and Optimization, March 2006. Google ScholarDigital Library
- Nakijima and Pallipadi. Enhancements for hyperthreading technology in the operating system -- seeking the optimal scheduling. In Proceedings of USENIX Annual Technical Conference, 2002. Google ScholarDigital Library
- S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-sensitive scheduling for smt processors. Technical Report 2000-04-02, University of Washington, June 2000.Google Scholar
- N. Rafique, W. Lim, and M. Thottethodi. Architectural support for operating system-driven cmp cache management. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2006. Google ScholarDigital Library
- A. Settle, J. L. Kihm, A. Janiszewski, and D. A. Connors. Architectural support for enhanced smt job scheduling. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, pages 63--73, 2004. Google ScholarDigital Library
- X. Shen and J. Shaw. Scalable implementation of efficient locality approximation. In Proceedings of the International Workshop on Languages and Compilers for Parallel Computing, 2008. Google ScholarDigital Library
- X. Shen, J. Shaw, B. Meeker, and C. Ding. Locality approximation using time. In Proceedings of the ACM SIGPLAN Conference on Principles of Programming Languages (POPL), 2007. Google ScholarDigital Library
- X. Shen, Y. Zhong, and C. Ding. Locality phase prediction. In Proceedings of the Eleventh International Conference on Architect ural Support for Programming Languages and Operating Systems (ASPLOS XI), Boston, MA, 2004. Google ScholarDigital Library
- T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, October 2002. Google ScholarDigital Library
- A. Snavely and D. Tullsen. Symbiotic jobscheduling for a simultaneous multithreading processor. In Proceedings of ASPLOS, 2000. Google ScholarDigital Library
- A. Snavely, D. Tullsen, and G. Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, 2002. Google ScholarDigital Library
- H. Stone, J. Turek, and J. Wolf. Optimal partitioning of cache memory. IEEE Transactions on Computers, 41(9), 1992. Google ScholarDigital Library
- G. Suh, S. Devadas, and L. Rudolph. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, 2002.Google ScholarCross Ref
- N. Tuck and D. M. Tullsen. Initial observations of the simultaneous multithreading Pentium 4 processor. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques, New Orleans, Louisiana, September 2003. Google ScholarDigital Library
- X. Zhang, S. Dwarkadas, G. Folkmanis, and K. Shen. Processor hardware counter statistics as a first-class system resource. In Proceedings of the 11th Workshop on Hot Topics in Operating Systems, 2007. Google ScholarDigital Library
- Y. Zhong and W. Chang. Sampling-based program locality approximation. In Proceedings of the International Symposium on Memory Management, 2008. Google ScholarDigital Library
Index Terms
- Analysis and approximation of optimal co-scheduling on chip multiprocessors
Recommendations
The Complexity of Optimal Job Co-Scheduling on Chip Multiprocessors and Heuristics-Based Solutions
In Chip Multiprocessors (CMPs) architecture, it is common that multiple cores share some on-chip cache. The sharing may cause cache thrashing and contention among co-running jobs. Job co-scheduling is an approach to tackling the problem by assigning ...
A study on optimally co-scheduling jobs of different lengths on chip multiprocessors
CF '09: Proceedings of the 6th ACM conference on Computing frontiersCache sharing in Chip Multiprocessors brings cache contention among corunning processes, which often causes considerable degradation of program performance and system fairness. Recent studies have seen the effectiveness of job co-scheduling in ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs
Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
Comments