ABSTRACT
Work-stealing systems are typically oblivious to the nature of the tasks they are scheduling. They do not know or take into account how long a task will take to execute or how many subtasks it will spawn. Moreover, task execution order is typically determined by an underlying task storage data structure, and cannot be changed. There are thus possibilities for optimizing task parallel executions by providing information on specific tasks and their preferred execution order to the scheduling system.
We investigate generalizations of work-stealing and introduce a framework enabling applications to dynamically provide hints on the nature of specific tasks using scheduling strategies. Strategies can be used to independently control both local task execution and steal order. Strategies allow optimizations on specific tasks, in contrast to more conventional scheduling policies that are typically global in scope. Strategies are composable and allow different, specific scheduling choices for different parts of an application simultaneously. We have implemented a work-stealing system based on our strategy framework. A series of benchmarks demonstrates beneficial effects that can be achieved with scheduling strategies.
- U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321--347, 2002.Google ScholarCross Ref
- N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. Theory of Computing Systems, 34(2):115--144, 2001.Google ScholarCross Ref
- P. Berenbrink, T. Friedetzky, and L. A. Goldberg. The natural work-stealing algorithm is stable. In In Proceedings of the 42nd IEEE Symposium on Foundations of Computer Science (FOCS, pages 178--187, 2001. Google ScholarDigital Library
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 37(1):55--69, 1996. Google ScholarDigital Library
- R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM, 46(5):720--748, 1999. Google ScholarDigital Library
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, OOPSLA, pages 519--538, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- J. Clausen and J. L. Traff. Implementation of parallel branch-and-bound algorithms -- experiences with the graph partitioning problem. Annals of Operations Research, 33:331--349, 1991.Google ScholarCross Ref
- R. Cole and V. Ramachandran. Resource oblivious sorting on multicores. In Automata, Languages and Programming, 37th International Colloquium (ICALP) Proceedings, Part I, volume 6198 of Lecture Notes in Computer Science, pages 226--237, 2010. Google ScholarDigital Library
- T. G. Crainic, B. L. Cun, and C. Roucairol. Parallel branch-and-bound algorithms. In E.-G. Talbi, editor, Parallel Combinatorial Optimization, pages 1--28. Wiley, 2006.Google Scholar
- F. Evans, S. Skiena, and A. Varshney. Optimizing triangle strips for fast rendering. In Visualization'96. Proceedings., pages 319--326. IEEE, 1996. Google ScholarDigital Library
- K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. In ACM/IEEE Supercomputing, page 83, 2006. Google ScholarDigital Library
- Y. Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling policies for async-finish task parallelism. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12, may 2009. Google ScholarDigital Library
- Y. Guo, J. Zhao, V. Cavé, and V. Sarkar. SLAW: A scalable locality-aware adaptive work-stealing scheduler. In 24th IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pages 1--12, 2010.Google ScholarCross Ref
- K. T. Herley, A. Pietracaprina, and G. Pucci. Fast deterministic parallel branch-and-bound. Parallel Processing Letters, 9(3):325--333, 1999.Google ScholarCross Ref
- M. Houston, J. Y. Park, M. Ren, T. J. Knight, K. Fatahalian, A. Aiken, W. J. Dally, and P. Hanrahan. A portable runtime interface for multi-level memory hierarchies. In 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), pages 143--152, 2008. Google ScholarDigital Library
- R. M. Karp and Y. Zhang. Randomized parallel algorithms for backtrack search and branch-and-bound computation. Journal of the ACM, 40(3):765--789, 1993. Google ScholarDigital Library
- A. Kukanov and M. J. Voss. The foundations for scalable multi-core software in Intel Threading Building Blocks. Intel Technology Journal, 11(4), 2007.Google Scholar
- C. E. Leiserson. The CilkGoogle Scholar
- concurrency platform. The Journal of Supercomputing, 51(3):244--257, 2010. Google ScholarDigital Library
- A. Lenharth, D. Nguyen, and K. Pingali. Priority queues are not good concurrent priority schedulers. Technical Report TR-11--39, Department of Computer Science, The University of Texas at Austin, 2011.Google Scholar
- S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C. Tseng. Uts: An unbalanced tree search benchmark. Languages and Compilers for Parallel Computing, pages 235--250, 2007. Google ScholarCross Ref
- C. H. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, 1982. Google ScholarDigital Library
- P. Sanders. Fast priority queues for parallel branch-and-bound. In Parallel Algorithms for Irregularly Structured Problems, Second International Workshop, (IRREGULAR), volume 980 of Lecture Notes in Computer Science, pages 379--393, 1995. Google ScholarDigital Library
- F. Song, A. YarKhan, and J. Dongarra. Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 19:1--19:11, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- M. Squillante and E. Lazowska. Using processor-cache affinity information in shared-memory multiprocessor scheduling. IEEE Transactions on Parallel and Distributed Systems, 4(2):131--143, feb 1993. Google ScholarDigital Library
- B. Weissman. Performance counters and state sharing annotations: a unified approach to thread locality. In Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, ASPLOS-VIII, pages 127--138, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
Index Terms
- Work-stealing with configurable scheduling strategies
Recommendations
Work-stealing with configurable scheduling strategies
PPoPP '13Work-stealing systems are typically oblivious to the nature of the tasks they are scheduling. They do not know or take into account how long a task will take to execute or how many subtasks it will spawn. Moreover, task execution order is typically ...
Adaptive work-stealing with parallelism feedback
Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted ...
Work-stealing without the baggage
OOPSLA '12: Proceedings of the ACM international conference on Object oriented programming systems languages and applicationsWork-stealing is a promising approach for effectively exploiting software parallelism on parallel hardware. A programmer who uses work-stealing explicitly identifies potential parallelism and the runtime then schedules work, keeping otherwise idle ...
Comments