research-article

CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures

Authors:
Quan Chen

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Minyi Guo

Department of Computer Science and Engineering, Shanghai, China

Department of Computer Science and Engineering, Shanghai, China
View Profile

,
Zhiyi Huang

University of Otago, Dunedin, New Zealand

University of Otago, Dunedin, New Zealand
View Profile

ICS '12: Proceedings of the 26th ACM international conference on SupercomputingJune 2012Pages 163–172https://doi.org/10.1145/2304576.2304599

Published:25 June 2012Publication History

ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Pages 163–172

ABSTRACT

Multi-socket Multi-core architectures with shared caches in each socket have become mainstream when a single multi-core chip cannot provide enough computing capacity for high performance computing. However, traditional task-stealing schedulers tend to pollute the shared cache and incur severe cache misses due to their randomness in stealing. To address the problem, this paper proposes a Cache Aware Task-Stealing (CATS) scheduler, which uses the shared cache efficiently with an online profiling method and schedules tasks with shared data to the same socket. CATS adopts an online DAG partitioner based on the profiling information to ensure tasks with shared data can efficiently utilize the shared cache. One outstanding novelty of CATS is that it does not require any extra user-provided information. Experimental results show that CATS can improve the performance of memory-bound programs up to 74.4% compared with the traditional task-stealing scheduler.

References

U. Acar, G. Blelloch, and R. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321--347, 2002.Google ScholarCross Ref
E. Ayguade, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, and G. Zhang. The design of openmp tasks. IEEE Transactions on Parallel and Distributed Systems, 20(3):404--418, 2009. Google ScholarDigital Library
R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In ICS'05, pages 101--110. ACM, 2005. Google ScholarDigital Library
M. Berger and J. Oliger. Adaptive mesh refinement for hyperbolic partial differential equations. Journal of computational Physics, 53(3):484--512, 1984.Google Scholar
G. Blelloch, R. Chowdhury, P. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In SODA'08, pages 501--510. Society for Industrial and Applied Mathematics, 2008. Google ScholarDigital Library
G. Blelloch, J. Fineman, P. Gibbons, and H. V. Simhadri. Scheduling irregular parallel computations on hierarchical caches. In SPAA'11, San Jose, California, June 2011. Google ScholarDigital Library
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed computing, 37(1):55--69, Aug. 1996. Google ScholarDigital Library
D. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA, 1997. Google ScholarDigital Library
Q. Chen, Y. Chen, Z. Huang, and M. Guo. WATS: Workload-Aware Task Scheduling in Asymmetric Multi-core Architectures. In IPDPS'12. IEEE, 2012. Google ScholarDigital Library
Q. Chen, Z. Huang, M. Guo, and J. Zhou. CAB: Cache-aware Bi-tier task-stealing in Multi-socket Multi-core architecture. In ICPP'11, Taipei, Taiwan, 2011. IEEE. Google ScholarDigital Library
S. Chen, P. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. Mowry, et al. Scheduling threads for constructive cache sharing on CMPs. In SPAA'07, page 115. ACM, 2007. Google ScholarDigital Library
R. Cole and V. Ramachandran. Analysis of Randomized Work Stealing with False Sharing. ArXiv e-prints, Mar. 2011.Google Scholar
X. Ding, K. Wang, and X. Zhang. ULCC: a user-level facility for optimizing shared cache performance on multicores. In PPoPP'11, pages 103--112, 2011. Google ScholarDigital Library
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In PLDI'98, pages 212--223, Montreal, Canada, June 1998. ACM. Google ScholarDigital Library
A. Gerasoulis and T. Yang. A comparison of clustering heuristics for scheduling directed acyclic graphs on multiprocessors. Journal of Parallel and DistributedComputing, 16(4):276--291, 1992.Google ScholarCross Ref
W. Gropp, E. Lusk, and A. Skjellum. Using MPI:portable parallel programming with the message passing interface. MIT Press, 1999. Google ScholarDigital Library
Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In IPDPS'09, pages 1--12. IEEE, 2009. Google ScholarDigital Library
Y. Guo, J. Zhao, V. Cave, and V. Sarkar. Slaw: a scalable locality-aware adaptive work-stealing scheduler. In IPDPS'10, 2010.Google ScholarCross Ref
J. Lee and J. Palsberg. Featherweight X10: a core calculus for async-finish parallelism. In PPoPP'10, pages 25--36. ACM, 2010. Google ScholarDigital Library
C. Leiserson. The Cilk++ concurrency platform. In DAC'09, pages 522--527. ACM, 2009. Google ScholarDigital Library
M. M. Michael, M. T. Vechev, and V. A. Saraswat. Idempotent work stealing. In PPoPP'09, pages 45--54. ACM, 2009. Google ScholarDigital Library
S. L. Olivier, A. K. Porterfield, K. B. Wheeler, and J. F. Prins. Scheduling task parallelism on multi-socket multicore systems. In ROSS'11, pages 49--56. ACM, 2011. Google ScholarDigital Library
S. Perarnau, M. Tchiboukdjian, and G. Huard. Controlling cache utilization of hpc applications. In ICS'11, pages 295--304. ACM, 2011. Google ScholarDigital Library
J.-N. Quintin and F. Wagner. Hierarchical work-stealing. In EuroPar'10, pages 217--229. Springer-Verlag, 2010. Google ScholarDigital Library
J. Reinders. Intel threading building blocks. O'Reilly, 2007. Google ScholarDigital Library
D. Tam, R. Azimi, L. Soares, and M. Stumm Rapidmrc: Approximating l2 miss rate curves on commodity systems for online optimizations. ACM Sigplan Notices, 44(3):121--132, 2009. Google ScholarDigital Library
L. Xiang, T. Chen, Q. Shi, and W. Hu. Less reused filter: improving l2 cache performance via filtering less reused lines. In ICS'09, pages 68--79. ACM, 2009. Google ScholarDigital Library
T. Yang, C. Lin, and C. Yang. Cache-aware task scheduling on multi-core architecture. In VLSI-DAT'10, pages 139--142. IEEE, 2010.Google Scholar
J. Zhang, Z. Huang, W. Chen, Q. Huang, and W. Zheng. Maotai: View-Oriented Parallel Programming on CMT processors. In ICPP'08, pages 636--643. IEEE, 2008. Google ScholarDigital Library

Index Terms

CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures
1. General and reference
  1. Cross-computing tools and techniques
    1. Measurement
    2. Metrics

Recommendations

Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory

Parallel computers now start to adopt Bandwidth-Asymmetric Memory architecture that consists of traditional DRAM memory and new High Bandwidth Memory (HBM) for high memory bandwidth. However, existing task schedulers suffer from low bandwidth usage and ...
Read More
CAB: Cache Aware Bi-tier Task-Stealing in Multi-socket Multi-core Architecture
ICPP '11: Proceedings of the 2011 International Conference on Parallel Processing

Modern multi-core computers often adopt a multi-socket multi-core architecture with shared caches in each socket. However, traditional task-stealing schedulers tend to pollute the shared cache and incur more cache misses due to their random stealing. To ...
Read More
Adaptive Cache Aware Bitier Work-Stealing in Multisocket Multicore Architectures

Modern multicore computers often adopt a multisocket multicore architecture with shared caches in each socket. However, traditional work-stealing schedulers tend to pollute the shared cache and incur more cache misses due to their random stealing. To ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing
June 2012
400 pages
ISBN:9781450313162
DOI:10.1145/2304576
General Chairs:
Utpal Banerjee
University of California at Irvine, USA
,
Kyle A. Gallivan
Florida State University, USA
,
Program Chairs:
Gianfranco Bilardi
Università degli Studi di Padova, Italy
,
Manolis G.H. Katevenis
FORTH and University of Crete, Greece
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 June 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache aware
cache misses
multi-socket multi-core
online profiling
task-stealing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 44
  Total Citations
  View Citations
- 381
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CATS: cache aware task-stealing based on online profiling in multi-socket multi-core architectures

ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory

CAB: Cache Aware Bi-tier Task-Stealing in Multi-socket Multi-core Architecture

Adaptive Cache Aware Bitier Work-Stealing in Multisocket Multicore Architectures