ABSTRACT
Multi-socket Multi-core architectures with shared caches in each socket have become mainstream when a single multi-core chip cannot provide enough computing capacity for high performance computing. However, traditional task-stealing schedulers tend to pollute the shared cache and incur severe cache misses due to their randomness in stealing. To address the problem, this paper proposes a Cache Aware Task-Stealing (CATS) scheduler, which uses the shared cache efficiently with an online profiling method and schedules tasks with shared data to the same socket. CATS adopts an online DAG partitioner based on the profiling information to ensure tasks with shared data can efficiently utilize the shared cache. One outstanding novelty of CATS is that it does not require any extra user-provided information. Experimental results show that CATS can improve the performance of memory-bound programs up to 74.4% compared with the traditional task-stealing scheduler.
- U. Acar, G. Blelloch, and R. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321--347, 2002.Google ScholarCross Ref
- E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of openmp tasks. IEEE Transactions on Parallel and Distributed Systems, 20(3):404--418, 2009. Google ScholarDigital Library
- R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In ICS'05, pages 101--110. ACM, 2005. Google ScholarDigital Library
- M. Berger and J. Oliger. Adaptive mesh refinement for hyperbolic partial differential equations. Journal of computational Physics, 53(3):484--512, 1984.Google Scholar
- G. Blelloch, R. Chowdhury, P. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In SODA'08, pages 501--510. Society for Industrial and Applied Mathematics, 2008. Google ScholarDigital Library
- G. Blelloch, J. Fineman, P. Gibbons, and H. V. Simhadri. Scheduling irregular parallel computations on hierarchical caches. In SPAA'11, San Jose, California, June 2011. Google ScholarDigital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed computing, 37(1):55--69, Aug. 1996. Google ScholarDigital Library
- D. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA, 1997. Google ScholarDigital Library
- Q. Chen, Y. Chen, Z. Huang, and M. Guo. WATS: Workload-Aware Task Scheduling in Asymmetric Multi-core Architectures. In IPDPS'12. IEEE, 2012. Google ScholarDigital Library
- Q. Chen, Z. Huang, M. Guo, and J. Zhou. CAB: Cache-aware Bi-tier task-stealing in Multi-socket Multi-core architecture. In ICPP'11, Taipei, Taiwan, 2011. IEEE. Google ScholarDigital Library
- S. Chen, P. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. Mowry, et al. Scheduling threads for constructive cache sharing on CMPs. In SPAA'07, page 115. ACM, 2007. Google ScholarDigital Library
- R. Cole and V. Ramachandran. Analysis of Randomized Work Stealing with False Sharing. ArXiv e-prints, Mar. 2011.Google Scholar
- X. Ding, K. Wang, and X. Zhang. ULCC: a user-level facility for optimizing shared cache performance on multicores. In PPoPP'11, pages 103--112, 2011. Google ScholarDigital Library
- M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI'98, pages 212--223, Montreal, Canada, June 1998. ACM. Google ScholarDigital Library
- A. Gerasoulis and T. Yang. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and DistributedComputing, 16(4):276--291, 1992.Google ScholarCross Ref
- W. Gropp, E. Lusk, and A. Skjellum. Using MPI:portable parallel programming with the message passing interface. MIT Press, 1999. Google ScholarDigital Library
- Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In IPDPS'09, pages 1--12. IEEE, 2009. Google ScholarDigital Library
- Y. Guo, J. Zhao, V. Cave, and V. Sarkar. Slaw: a scalable locality-aware adaptive work-stealing scheduler. In IPDPS'10, 2010.Google ScholarCross Ref
- J. Lee and J. Palsberg. Featherweight X10: a core calculus for async-finish parallelism. In PPoPP'10, pages 25--36. ACM, 2010. Google ScholarDigital Library
- C. Leiserson. The Cilk++ concurrency platform. In DAC'09, pages 522--527. ACM, 2009. Google ScholarDigital Library
- M. M. Michael, M. T. Vechev, and V. A. Saraswat. Idempotent work stealing. In PPoPP'09, pages 45--54. ACM, 2009. Google ScholarDigital Library
- S. L. Olivier, A. K. Porterfield, K. B. Wheeler, and J. F. Prins. Scheduling task parallelism on multi-socket multicore systems. In ROSS'11, pages 49--56. ACM, 2011. Google ScholarDigital Library
- S. Perarnau, M. Tchiboukdjian, and G. Huard. Controlling cache utilization of hpc applications. In ICS'11, pages 295--304. ACM, 2011. Google ScholarDigital Library
- J.-N. Quintin and F. Wagner. Hierarchical work-stealing. In EuroPar'10, pages 217--229. Springer-Verlag, 2010. Google ScholarDigital Library
- J. Reinders. Intel threading building blocks. O'Reilly, 2007. Google ScholarDigital Library
- D. Tam, R. Azimi, L. Soares, and M. Stumm Rapidmrc: Approximating l2 miss rate curves on commodity systems for online optimizations. ACM Sigplan Notices, 44(3):121--132, 2009. Google ScholarDigital Library
- L. Xiang, T. Chen, Q. Shi, and W. Hu. Less reused filter: improving l2 cache performance via filtering less reused lines. In ICS'09, pages 68--79. ACM, 2009. Google ScholarDigital Library
- T. Yang, C. Lin, and C. Yang. Cache-aware task scheduling on multi-core architecture. In VLSI-DAT'10, pages 139--142. IEEE, 2010.Google Scholar
- J. Zhang, Z. Huang, W. Chen, Q. Huang, and W. Zheng. Maotai: View-Oriented Parallel Programming on CMT processors. In ICPP'08, pages 636--643. IEEE, 2008. Google ScholarDigital Library
Index Terms
- CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures
Recommendations
Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory
Parallel computers now start to adopt Bandwidth-Asymmetric Memory architecture that consists of traditional DRAM memory and new High Bandwidth Memory (HBM) for high memory bandwidth. However, existing task schedulers suffer from low bandwidth usage and ...
CAB: Cache Aware Bi-tier Task-Stealing in Multi-socket Multi-core Architecture
ICPP '11: Proceedings of the 2011 International Conference on Parallel ProcessingModern multi-core computers often adopt a multi-socket multi-core architecture with shared caches in each socket. However, traditional task-stealing schedulers tend to pollute the shared cache and incur more cache misses due to their random stealing. To ...
Adaptive Cache Aware Bitier Work-Stealing in Multisocket Multicore Architectures
Modern multicore computers often adopt a multisocket multicore architecture with shared caches in each socket. However, traditional work-stealing schedulers tend to pollute the shared cache and incur more cache misses due to their random stealing. To ...
Comments