Abstract
We propose a new approach, called cluster-based search (CBS), for scheduling large task graphs in parallel on a heterogeneous cluster of workstations connected by a high-speed network (e.g., using an ATM switch at OC-3 speed). The CBS algorithm uses a parallel random neighborhood search which works by refining multiple different initial schedules simultaneously using different workstations. The workstations communicate periodically to exchange their best solutions found thus far in order to direct the search to more promising regions in the search space. Heterogeneity of machines is exploited by the biased partitioning of the search space. The parallel random neighborhood search is fault-tolerant in that the workload of a failed workstation is automatically redistributed to other workstations so that the search can continue. We have implemented the CBS algorithm as a core function of our on-going development of SSI middleware for a Sun workstation cluster.
Similar content being viewed by others
References
I. Ahmad, Y.-K. Kwok, M.-Y. Wu, and W. Shu. CASCH: A software tool for automatic parallelization and scheduling of programs on multiprocessors. IEEE Concurrency Vol. 8, no. 4, pp. 21–33, October–December 2000.
I. Ahmad and Y.-K. Kwok. On exploiting task duplication in parallel program scheduling. IEEE Transactions on Parallel and Distributed Systems, 9(9):872–892, September 1998.
L. S. Cheung and Y.-K. Kwok. Fuzzy Load Balancing in a Distributed Object Computing Environment, Proceedings of the IASTED Int'l Conference on Applied Informatics, pp. 235–240, Feb. 2001.
M. Cosnard and M. Loi. Automatic task graphs generation techniques. Parallel Processing Letters, 5(4):527–538, December 1995.
W. K. Edwards. Core Jini. Prentice-Hall, Englewood Cliffs, NJ, 1999.
R. F. Freund and H. J. Siegel. Heterogeneous processing. IEEE Computer, 26(6):13–17, June 1993.
K. Hwang and X. Zu. Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, New York, 1998.
K. Hwang, H. Jin, E. Chow, C. L. Wang, and Z. Xu. Designing SSI clusters with hierachical checkpointing and single I/O space. IEEE Concurrency, 7(1):60–69, January/March 1999.
Y.-K. Kwok and I. Ahmad. Dynamic critical path scheduling: an effective technique for allocating tasks graphs to multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 7(5):506–521, May 1996.
Y.-K. Kwok and I. Ahmad. A parallel algorithm for compile-time scheduling of parallel programs on multiprocessors. In Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques, pp. 90–101, November 1997a.
Y.-K. Kwok and I. Ahmad. Efficient scheduling of arbitrary task graphs to multiprocessors using a parallel genetic algorithm. Journal of Parallel and Distributed Computing, 47(1):58–77, November 1997b.
Y.-K. Kwok and I. Ahmad. Benchmarking and comparison of the task graph scheduling algorithms. Journal of Parallel and Distributed Computing, 59(3):381–422, December 1999a.
Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys, 31(4):406–471, December 1999b.
Y.-K. Kwok, K. P. Chow, H. Jin, and K. Hwang. Comet: a communication-ef.cient load balancing strategy for multi-agent cluster computing. In Proceedings of ParCo'99, August 1999.
A. L. Liestman and R. H. Campbell. A fault-tolerant scheduling problem. IEEE Transactions on Software Engineering, SE-12(11):1089–1095, November 1986.
R. E. Lord, J. S. Kowalik, and S. P. Kumar. Solving linear algebraic equations on an MIMD computer. Journal of the ACM, 30(1):103–117, January 1983.
G. F. Pfister. In Search of Clusters. Second edition, Prentice-Hall, Englewood Cliffs, NJ 1998.
J. Wu. A fault-tolerant task-scheduling method for parallel processing systems. International Journal on Mini and Microcomputers, 13(3):135–138, 1991.
M.-Y. Wu and D. D. Gajski. Hypertool: a programming aid for message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 1(3):330–343, July 1990.
T. Yang and A. Gerasoulis. DSC: scheduling parallel tasks on an unbounded number of processors. IEEE Transactions on Parallel and Distributed Systems, 5(9):951–967, September 1994.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kwok, YK. Fault-Tolerant Parallel Scheduling of Tasks on a Heterogeneous High-Performance Workstation Cluster. The Journal of Supercomputing 19, 299–314 (2001). https://doi.org/10.1023/A:1011186732749
Issue Date:
DOI: https://doi.org/10.1023/A:1011186732749