skip to main content
10.1145/2442516.2442562acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
poster

Work-stealing with configurable scheduling strategies

Published: 23 February 2013 Publication History

Abstract

Work-stealing systems are typically oblivious to the nature of the tasks they are scheduling. They do not know or take into account how long a task will take to execute or how many subtasks it will spawn. Moreover, task execution order is typically determined by an underlying task storage data structure, and cannot be changed. There are thus possibilities for optimizing task parallel executions by providing information on specific tasks and their preferred execution order to the scheduling system.
We investigate generalizations of work-stealing and introduce a framework enabling applications to dynamically provide hints on the nature of specific tasks using scheduling strategies. Strategies can be used to independently control both local task execution and steal order. Strategies allow optimizations on specific tasks, in contrast to more conventional scheduling policies that are typically global in scope. Strategies are composable and allow different, specific scheduling choices for different parts of an application simultaneously. We have implemented a work-stealing system based on our strategy framework. A series of benchmarks demonstrates beneficial effects that can be achieved with scheduling strategies.

References

[1]
U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321--347, 2002.
[2]
N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. Theory of Computing Systems, 34(2):115--144, 2001.
[3]
P. Berenbrink, T. Friedetzky, and L. A. Goldberg. The natural work-stealing algorithm is stable. In In Proceedings of the 42nd IEEE Symposium on Foundations of Computer Science (FOCS, pages 178--187, 2001.
[4]
R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):55--69, 1996.
[5]
R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5):720--748, 1999.
[6]
P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, OOPSLA, pages 519--538, New York, NY, USA, 2005. ACM.
[7]
J. Clausen and J. L. Traff. Implementation of parallel branch-and-bound algorithms -- experiences with the graph partitioning problem. Annals of Operations Research, 33:331--349, 1991.
[8]
R. Cole and V. Ramachandran. Resource oblivious sorting on multicores. In Automata, Languages and Programming, 37th International Colloquium (ICALP) Proceedings, Part I, volume 6198 of Lecture Notes in Computer Science, pages 226--237, 2010.
[9]
T. G. Crainic, B. L. Cun, and C. Roucairol. Parallel branch-and-bound algorithms. In E.-G. Talbi, editor, Parallel Combinatorial Optimization, pages 1--28. Wiley, 2006.
[10]
F. Evans, S. Skiena, and A. Varshney. Optimizing triangle strips for fast rendering. In Visualization'96. Proceedings., pages 319--326. IEEE, 1996.
[11]
K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. In ACM/IEEE Supercomputing, page 83, 2006.
[12]
Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12, may 2009.
[13]
Y. Guo, J. Zhao, V. Cavé, and V. Sarkar. SLAW: A scalable locality-aware adaptive work-stealing scheduler. In 24th IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1--12, 2010.
[14]
K. T. Herley, A. Pietracaprina, and G. Pucci. Fast deterministic parallel branch-and-bound. Parallel Processing Letters, 9(3):325--333, 1999.
[15]
M. Houston, J. Y. Park, M. Ren, T. J. Knight, K. Fatahalian, A. Aiken, W. J. Dally, and P. Hanrahan. A portable runtime interface for multi-level memory hierarchies. In 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), pages 143--152, 2008.
[16]
R. M. Karp and Y. Zhang. Randomized parallel algorithms for backtrack search and branch-and-bound computation. Journal of the ACM, 40(3):765--789, 1993.
[17]
A. Kukanov and M. J. Voss. The foundations for scalable multi-core software in Intel Threading Building Blocks. Intel Technology Journal, 11(4), 2007.
[18]
C. E. Leiserson. The Cilk
[19]
concurrency platform. The Journal of Supercomputing, 51(3):244--257, 2010.
[20]
A. Lenharth, D. Nguyen, and K. Pingali. Priority queues are not good concurrent priority schedulers. Technical Report TR-11--39, Department of Computer Science, The University of Texas at Austin, 2011.
[21]
S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C. Tseng. Uts: An unbalanced tree search benchmark. Languages and Compilers for Parallel Computing, pages 235--250, 2007.
[22]
C. H. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, 1982.
[23]
P. Sanders. Fast priority queues for parallel branch-and-bound. In Parallel Algorithms for Irregularly Structured Problems, Second International Workshop, (IRREGULAR), volume 980 of Lecture Notes in Computer Science, pages 379--393, 1995.
[24]
F. Song, A. YarKhan, and J. Dongarra. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 19:1--19:11, New York, NY, USA, 2009. ACM.
[25]
M. Squillante and E. Lazowska. Using processor-cache affinity information in shared-memory multiprocessor scheduling. IEEE Transactions on Parallel and Distributed Systems, 4(2):131--143, feb 1993.
[26]
B. Weissman. Performance counters and state sharing annotations: a unified approach to thread locality. In Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, ASPLOS-VIII, pages 127--138, New York, NY, USA, 1998. ACM.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '13: Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
February 2013
332 pages
ISBN:9781450319225
DOI:10.1145/2442516
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 48, Issue 8
    PPoPP '13
    August 2013
    309 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2517327
    Issue’s Table of Contents

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 February 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. priorities
  2. scheduler hints
  3. strategies
  4. work-stealing

Qualifiers

  • Poster

Conference

PPoPP '13
Sponsor:

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)3
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Handling Data Skew for Aggregation in Spark SQL Using Task StealingInternational Journal of Parallel Programming10.1007/s10766-020-00657-zOnline publication date: 25-Mar-2020
  • (2019)Fairness in responsive parallelismProceedings of the ACM on Programming Languages10.1145/33416853:ICFP(1-30)Online publication date: 26-Jul-2019
  • (2017)DemoMatch: API discovery from demonstrationsACM SIGPLAN Notices10.1145/3140587.306238652:6(64-78)Online publication date: 14-Jun-2017
  • (2017)Responsive parallel computation: bridging competitive and cooperative threadingACM SIGPLAN Notices10.1145/3140587.306237052:6(677-692)Online publication date: 14-Jun-2017
  • (2017)Responsive parallel computation: bridging competitive and cooperative threadingProceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3062341.3062370(677-692)Online publication date: 14-Jun-2017
  • (2017)Trends in Data Locality Abstractions for HPC SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.270314928:10(3007-3020)Online publication date: 1-Oct-2017
  • (2017)SWAS: Stealing Work Using Approximate System-Load Information2017 46th International Conference on Parallel Processing Workshops (ICPPW)10.1109/ICPPW.2017.51(309-318)Online publication date: Aug-2017
  • (2015)Partial evaluation of machine codeACM SIGPLAN Notices10.1145/2858965.281432150:10(860-879)Online publication date: 23-Oct-2015
  • (2015)AutoMO: automatic inference of memory order parameters for C/C++11ACM SIGPLAN Notices10.1145/2858965.281428650:10(221-240)Online publication date: 23-Oct-2015
  • (2014)Data structures for task-based priority schedulingACM SIGPLAN Notices10.1145/2692916.255527849:8(379-380)Online publication date: 6-Feb-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media