Abstract
This paper studies the problem of efficiently schedulling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is “work stealing,” in which processors needing work steal computational threads from other processors. In this paper, we give the first provably good work-stealing scheduler for multithreaded computations with dependencies.
Specifically, our analysis shows that the expected time to execute a fully strict computation on P processors using our work-stealing scheduler is T1/P + O(T ∞ , where T1 is the minimum serial execution time of the multithreaded computation and (T ∞ is the minimum execution time with an infinite number of processors. Moreover, the space required by the execution is at most S1P, where S1 is the minimum serial space requirement. We also show that the expected total communication of the algorithm is at most O(PT ∞( 1 + nd)Smax), where Smax is the size of the largest activation record of any thread and nd is the maximum number of times that any thread synchronizes with its parent. This communication bound justifies the folk wisdom that work-stealing schedulers are more communication efficient than their work-sharing counterparts. All three of these bounds are existentially optimal to within a constant factor.
- ARORA, N. S.,BLUMOFE,R.D.,AND PLAXTON, C. G. 1998. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'98) (Puerto Vallarta, Mexico, June 28--July 2). ACM, New York, pp. 119-129. Google Scholar
- ARVIND,NIKHIL,R.S.,AND PINGALI, K. K. 1989. I-structures: Data structures for parallel computing. ACM Trans. Program. Lang. Syst. 11, 4 (Oct.), 598-632. Google Scholar
- BLELLOCH,G.E.,GIBBONS,P.B.,AND MATIAS, Y. 1995. Provably efficient scheduling for languages with fine-grained parallelism. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'95) (Santa Barbara, Calif., July 17-19). ACM, New York, pp. 1-12. Google Scholar
- BLELLOCH,G.E.,GIBBONS,P.B.,MATIAS, Y., AND NARLIKAR, G. J. 1997. Space-efficient scheduling of parallelism with synchronization variables. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'97) (Newport, R.I., June 22-25). ACM, New York, pp. 12-23. Google Scholar
- BLUMOFE, R. D. 1995. Executing multithreaded programs efficiently. Ph.D. thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. Also available as MIT Laboratory for Computer Science Technical Report MIT/LCS/TR-677. Google Scholar
- BLUMOFE,R.D.,FRIGO, M., JOERG,C.F.,LEISERSON,C.E.,AND RANDALL, K. H. 1996a. An analysis of dag-consistent distributed shared-memory algorithms. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'96) (Padua, Italy, June 24-26). ACM, New York, pp. 297-308. Google Scholar
- BLUMOFE,R.D.,FRIGO, M., JOERG,C.F.,LEISERSON,C.E.,AND RANDALL, K. H. 1996b. Dag-consistent distributed shared memory. In Proceedings of the 10th International Parallel Process-ing Symposium (IPPS) (Honolulu, Hawaii, April). IEEE Computer Society Press, Los Alamitos, Calif., pp. 132-141. Google Scholar
- BLUMOFE,R.D.,JOERG,C.F.,KUSZMAUL,B.C.,LEISERSON,C.E.,RANDALL,K.H.,AND ZHOU,Y. 1996c. Cilk: An efficient multithreaded runtime system. J. Paral. Dist. Comput. 37, 1 (Aug.), 55-69. Google Scholar
- BLUMOFE,R.D.,AND LEISERSON, C. E. 1998. Space-efficient scheduling of multithreaded compu-tations. SIAM J. Comput. 27, 1 (Feb.), 202-229. Google Scholar
- BLUMOFE,R.D.,AND LISIECKI, P. A. 1997. Adaptive and reliable parallel computing on networks of workstations. In Proceedings of the USENIX 1997 Annual Technical Conference on UNIX and Advanced Computing Systems (Anaheim, Calif., Jan.). USENIX Associates, Berkeley, Calif., pp. 133-147. Google Scholar
- BLUMOFE,R.D.,AND PAPADOPOULOS, D. 1998. The performance of work stealing in multipro-grammed environments. Tech. Rep. TR-98-13 (May). Dept. Computer Sciences, The University of Texas at Austin, Austin, Tex. Google Scholar
- BRENT, R. P. 1974. The parallel evaluation of general arithmetic expressions. J. ACM 21, 2 (Apr.), 201-206. Google Scholar
- BURTON, F. W. 1988. Storage management in virtual tree machines. IEEE Trans. Comput. 37,3 (Mar.), 321-328. Google Scholar
- BURTON, F. W. 1996. Guaranteeing good memory bounds for parallel programs. IEEE Trans. Softw. Eng. 22, 10 (Oct.), 762-773. Google Scholar
- BURTON,F.W.,AND SLEEP, M. R. 1981. Executing functional programs on a virtual tree of processors. In Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture (Portsmouth, N.H., Oct.). ACM, New York, N.Y., pp. 187-194. Google Scholar
- CHENG, G.-I. 1998. Algorithms for data-race detection in multithreaded programs. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technol-ogy.Google Scholar
- CHENG, G.-I., FENG, M., LEISERSON,C.E.,RANDALL,K.H.,AND STARK, A. F. 1998. Detecting data races in Cilk programs that use locks. In Proceedings of the 10th ACM Symposium on Parallel Algorithms and Architectures (SPAA'98) (Puerto Vallarta, Mexico, June 28-July 2). ACM, New York, pp. 298-309. Google Scholar
- CULLER,D.E.,AND ARVIND. 1988. Resource requirements of dataflow programs. In Proceedings of the 15th Annual International Symposium on Computer Architecture (ISCA) (Honolulu, Hawaii, May). IEEE Computer Society Press, Los Alamitos, Calif., pp. 141-150. Also available as MIT Laboratory for Computer Science, Computation Structures Group Memo 280. Google Scholar
- EAGER,D.L.,ZAHORJAN, J., AND LAZOWSKA, E. D. 1989. Speedup versus efficiency in parallel systems. IEEE Trans. Comput. 38, 3 (Mar.), 408-423. Google Scholar
- FELDMANN, R., MYSLIWIETZ, P., AND MONIEN, B. 1993. Game tree search on a massively parallel system. Adv. Comput. Chess 7, 203-219.Google Scholar
- FENG, M., AND LEISERSON, C. E. 1997. Efficient detection of determinacy races in Cilk programs. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'97) (Newport, R.I., June 22-25). ACM, New York, pp. 1-11. Google Scholar
- FINKEL, R., AND MANBER, U. 1987. DIB-A distributed implementation of backtracking. ACM Trans. Program. Lang. Syst. 9, 2 (Apr.), 235-256. Google Scholar
- FRIGO, M. 1998. The weakest reasonable memory model. Master's thesis, Dept. Electrical Engi-neering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass.Google Scholar
- FRIGO, M., LEISERSON,C.E.,AND RANDALL, K. H. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'98) (Montreal, Canada, June 17-19). ACM, New York. Google Scholar
- FRIGO, M., AND LUCHANGCO, V. 1998. Computation-centric memory models. In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'98) (Puerto Vallarta, Mexico, June 28-July 2). ACM, New York, pp. 240-249. Google Scholar
- GRAHAM, R. L. 1966. Bounds for certain multiprocessing anomalies. Bell Syst. Tech. J. 45, 1563-1581.Google Scholar
- GRAHAM, R. L. 1969. Bounds on multiprocessing timing anomalies. SIAM J. Appl. Math. 17,2 (Mar.), 416-429.Google Scholar
- HALBHERR, M., ZHOU, Y., AND JOERG, C. F. 1994. MIMD-style parallel programming with continuation-passing threads. In Proceedings of the 2nd International Workshop on Massive Parallel-ism: Hardware, Software, and Applications (Capri, Italy, Sept.). World Scientific, Singapore. (Also available as MIT Laboratory for Computer Science Computation Structures, Group Memo 355, March 1994. MIT, Cambridge, Mass.Google Scholar
- HALSTEAD,R.H.,JR. 1984. Implementation of Multilisp: Lisp on a multiprocessor. In Conference Record of the 1984 ACM Symposium on LISP and Functional Programming (Austin, Tex., Aug.) ACM, New York, pp. 9-17. Google Scholar
- JOERG, C. F. 1996. The Cilk System for Parallel Multithreaded Computing. Ph.D. dissertation. Dept. Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass. Google Scholar
- JOERG, C., AND KUSZMAUL, B. C. 1994. Massively parallel chess. In Proceedings of the 3rd DIMACS Parallel Implementation Challenge (Rutgers University, New Jersey, Oct. 1994).Google Scholar
- KARP,R.M.,AND ZHANG, Y. 1993. Randomized parallel algorithms for backtrack search and branch-and-bound computation. J. ACM 40, 3 (July), 765-789. Google Scholar
- KUSZMAUL, B. C. 1994. Synchronized MIMD computing. Ph.D. thesis, Dept. Electrical Engineer-ing and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass. Also available as MIT Laboratory for Computer Science Technical Report MIT/LCS/TR-645. Google Scholar
- LISIECKI, P. 1996. Macroscheduling in the Cilk network of workstations environment. Master's thesis, Dept. Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass.Google Scholar
- LIU, P., AIELLO, W., AND BHATT, S. 1993. An atomic model for message-passing. In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'93) (Velen, Germany, June 30-July 2). ACM, New York, pp. 154-163. Google Scholar
- MOHR, E., KRANZ,D.A.,AND HALSTEAD,R.H.,JR. 1991. Lazy task creation: A technique for increasing the granularity of parallel programs. IEEE Trans. Parall. Dist. Syst. 2, 3 (July), 264-280. Google Scholar
- PANDE,V.S.,JOERG,C.F.,GROSBERG,A.Y.,AND TANAKA, T. 1994. Enumerations of the Hamiltonian walks on a cubic sublattice. J. Phys. A 27.Google Scholar
- PAPADOPOULOS, D. P. 1998. Hood: A user-level thread library for multiprogramming multiproces-sors. Master's thesis, Dept. Computer Sciences, The University of Texas at Austin, Austin, Tex.Google Scholar
- RANADE, A. 1987. How to emulate shared memory. In Proceedings of the 28th Annual Symposium on Foundations of Computer Science (FOCS) (Los Angeles, Calif., Oct.). IEEE Computer Society Press, Los Alamitos, Calif., pp. 185-194.Google Scholar
- RANDALL, K. H. 1998. Cilk: Efficient multithreaded computing. Ph.D. dissertation. Dept. Electri-cal Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass. Google Scholar
- RUDOLPH, L., SLIVKIN-ALLALOUF, M., AND UPFAL, E. 1991. A simple load balancing scheme for task allocation in parallel machines. In Proceedings of the 3rd Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'91) (Hilton Head, S.C., July 21-24). ACM, New York, pp. 237-245. Google Scholar
- RUGGIERO,C.A.,AND SARGEANT, J. 1987. Control of parallelism in the Manchester dataflow machine. In Functional Programming Languages and Computer Architecture, Number 274 in Lecture Notes in Computer Science. Springer-Verlag, New York, pp. 1-15. Google Scholar
- STARK, A. F. 1998. Debugging multithreaded programs that incorporate user-level locking. Mas-ter's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Mass.Google Scholar
- VANDEVOORDE,M.T.,AND ROBERTS, E. S. 1988. WorkCrews: An abstraction for controlling parallelism. International Journal of Parallel Programming 17, 4 (Aug.), 347-366. Google Scholar
- WU, I.-C., AND KUNG, H. T. 1991. Communication complexity for parallel divide-and-conquer. In Proceedings of the 32nd Annual Symposium on Foundations of Computer Science (FOCS) (San Juan, Puerto Rico, Oct. 1991). IEEE Computer Society Press, Los Alamitos, Calif., pp. 151-162. Google Scholar
- ZHANG, Y., AND ORTYNSKI, A. 1994. The efficiency of randomized parallel backtrack search. In Proceedings of the 6th IEEE Symposium on Parallel and Distributed Processing (Dallas, Texas, Oct. 1994). IEEE Computer Society Press, Los Alamitos, Calif.Google Scholar
Index Terms
Scheduling multithreaded computations by work stealing
Recommendations
Adaptive work stealing with parallelism feedback
PPoPP '07: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programmingWe present an adaptive work-stealing thread scheduler, A-Steal, for fork-join multithreaded jobs, like those written using the Cilk multithreaded language or the Hood work-stealing library. The A-Steal algorithm is appropriate for large parallel servers ...
Adaptive work-stealing with parallelism feedback
Multiprocessor scheduling in a shared multiprogramming environment can be structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted ...
Scheduling multithreaded computations by work stealing
SFCS '94: Proceedings of the 35th Annual Symposium on Foundations of Computer ScienceThis paper studies the problem of efficiently scheduling fully strict (i.e., well-structured) multithreaded computations on parallel computers. A popular and practical method of scheduling this kind of dynamic MIMD-style computation is "work stealing," ...
Comments