ABSTRACT
Shared-memory multiprocessors have dominated all platforms from high-end to desktop computers. On such platforms, it is well known that the interconnect between the processors and the main memory has become a major bottleneck. The bandwidth-aware job scheduling is an effective and relatively easy-to-implement way to relieve the bandwidth contention. Previous policies understood that bandwidth saturation hurt the throughput of parallel jobs so they scheduled the jobs to let the total bandwidth requirement equal to the system peak bandwidth. However, we found that intra-quantum fine-grained bandwidth contention still happened due to a program's irregular fluctuation in memory access intensity, which is mostly ignored in previous policies.
In this paper, we quantify the impact of bandwidth contention on overall performance. We found that concurrent jobs could achieve a higher memory bandwidth utilization at the expense of super-linear performance degradation. Based on such an observation, we proposed a new workload scheduling policy. Its basic idea is that interference due to bandwidth contention could be minimized when bandwidth utilization is maintained at the level of average bandwidth requirement of the workload. Our evaluation is based on both SPEC 2006 and NPB workloads. The evaluation results on randomly generated workloads show that our policy could improve the system throughput by 4.1% on average over the native OS scheduler, and up to 11.7% improvement has been observed.
- }}Nas parallel benchmarks. http://www.nas.nasa.gov/resources/software/npb.html.Google Scholar
- }}The perfmon2 website. http://perfmon2.sourceforge.net/.Google Scholar
- }}The sream benchmark website. http://www.streambench.org/.Google Scholar
- }}C. D. Antonopoulos, D. S. Nikolopoulos, and T. S. Papatheodorou. Scheduling algorithms with bus bandwidth considerations for smps. In Proceedings of the 2003 International Conference on Parallel Processing (ICPP'03), page 547, Oct 2003.Google ScholarCross Ref
- }}C. D. Antonopoulos, D. S. Nikolopoulos, and T. S. Papatheodorou. Realistic workload scheduling policies for taming the memory bandwidth bottleneck of smps. In Proceedings of the 2004 IEEE/ACM International Conference on High Performance Computing (HiPC'04), pages 286--296, 2004. Google ScholarDigital Library
- }}D. Burger, J. R. Goodman, and A. Kägi. Memory bandwidth limitations of future microprocessors. In Proceedings of the 23rd annual international symposium on Computer architecture (ISCA'96), pages 78--89, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
- }}A. S. Dhodapkar and J. E. Smith. Comparing program phase detection techniques. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture (MICRO'03), page 217, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
- }}E. Ebrahimi, O. Mutlu, and Y. N. Patt. Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. In HPCA-15, 2009.Google ScholarCross Ref
- }}F. Guo, Y. Solihin, L. Zhao, and R. Iyer. A framework for providing quality of service in chip multi-processors. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'07), pages 343--355, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
- }}E. Koukis and N. Koziris. Memory and network bandwidth aware scheduling of multiprogrammed workloads on clusters of smps. In Proceedings of the 12th International Conference on Parallel and Distributed Systems (ICPADS'06), pages 345--354, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- }}A. Krste, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, W. L. Patterson, David A. andPlishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, University of California, Berkeley, 2006.Google Scholar
- }}J. Liedtke, M. Völp, and K. Elphinstone. Preliminary thoughts on memory-bus scheduling. In Proceedings of the 9th workshop on ACM SIGOPS European workshop, pages 207--210, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- }}J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In IEEE 14th International Symposium on High Performance Computer Architecture, pages 367--378, 2008.Google Scholar
- }}N. R. Mahapatra and B. Venkatrao. The processor-memory bottleneck: problems and solutions. Crossroads, page 2. Google ScholarDigital Library
- }}R. L. McGregor, C. D. Antonopoulos, and D. S. Nikolopoulos. Scheduling algorithms for effective thread pairing on hybrid multiprocessors. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05), page 28.1, 2005. Google ScholarDigital Library
- }}C. McNairy and R. Bhatia. Montecito: A dual-core, dual-thread itanium processor. IEEE Micro, 25:10--20, 2005. Google ScholarDigital Library
- }}T. Sherwood, S. Sair, and B. Calder. Phase tracking and prediction. SIGARCH Comput. Archit. News, 31(2):336--349, 2003. Google ScholarDigital Library
- }}D. K. Tam, R. Azimi, L. B. Soares, and M. Stumm. Rapidmrc: approximating l2 miss rate curves on commodity systems for online optimizations. In Proceeding of the 14th international conference on Architectural support for programming languages and operating systems (ASPLOS'09), pages 121--132, 2009. Google ScholarDigital Library
- }}J. Wang, S. Zhou, K. Ahmed, and W. Long. Lsbatch: A distributed load sharing batch system. Technical report, Computer Systems Research Institute, University of Toronto, 1993.Google Scholar
Index Terms
On mitigating memory bandwidth contention through bandwidth-aware scheduling
Recommendations
Writeback-aware bandwidth partitioning for multi-core systems with PCM
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesPhase-Change Memory (PCM) has emerged as a promising low-power candidate to replace DRAM in main memory. Hybrid memory architecture comprised of a large PCM and a small DRAM is a popular solution to mitigate undesirable characteristics of PCM writes. ...
Providing fairness on shared-memory multiprocessors via process scheduling
SIGMETRICS '12: Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer SystemsCompetition for shared memory resources on multiprocessors is the most dominant cause for slowing down applications and makes their performance varies unpredictably. It exacerbates the need for Quality of Service (QoS) on such systems. In this paper, we ...
Kronos: towards bus contention-aware job scheduling in warehouse scale computers
AbstractWhile researchers have proposed many techniques to mitigate the contention on the shared cache and memory bandwidth, none of them has considered the memory bus contention due to split lock. Our study shows that the split lock may cause 9X longer ...
Comments