Fault-Tolerant Parallel Scheduling of Tasks on a Heterogeneous High-Performance Workstation Cluster

Kwok, Yu-Kwong

doi:10.1023/A:1011186732749

Fault-Tolerant Parallel Scheduling of Tasks on a Heterogeneous High-Performance Workstation Cluster

Published: July 2001

Volume 19, pages 299–314, (2001)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yu-Kwong Kwok¹

38 Accesses
2 Citations
Explore all metrics

Abstract

We propose a new approach, called cluster-based search (CBS), for scheduling large task graphs in parallel on a heterogeneous cluster of workstations connected by a high-speed network (e.g., using an ATM switch at OC-3 speed). The CBS algorithm uses a parallel random neighborhood search which works by refining multiple different initial schedules simultaneously using different workstations. The workstations communicate periodically to exchange their best solutions found thus far in order to direct the search to more promising regions in the search space. Heterogeneity of machines is exploited by the biased partitioning of the search space. The parallel random neighborhood search is fault-tolerant in that the workload of a failed workstation is automatically redistributed to other workstations so that the search can continue. We have implemented the CBS algorithm as a core function of our on-going development of SSI middleware for a Sun workstation cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Defining Parallel Local Search Procedures with Neighborhood Combinators

Article 20 April 2022

Parallel Byzantine Fault Tolerance

Energy-Efficient and Fault-Tolerant Taskgraph Scheduling for Manycores and Grids

References

I. Ahmad, Y.-K. Kwok, M.-Y. Wu, and W. Shu. CASCH: A software tool for automatic parallelization and scheduling of programs on multiprocessors. IEEE Concurrency Vol. 8, no. 4, pp. 21–33, October–December 2000.
Google Scholar
I. Ahmad and Y.-K. Kwok. On exploiting task duplication in parallel program scheduling. IEEE Transactions on Parallel and Distributed Systems, 9(9):872–892, September 1998.
Google Scholar
L. S. Cheung and Y.-K. Kwok. Fuzzy Load Balancing in a Distributed Object Computing Environment, Proceedings of the IASTED Int'l Conference on Applied Informatics, pp. 235–240, Feb. 2001.
M. Cosnard and M. Loi. Automatic task graphs generation techniques. Parallel Processing Letters, 5(4):527–538, December 1995.
Google Scholar
W. K. Edwards. Core Jini. Prentice-Hall, Englewood Cliffs, NJ, 1999.
Google Scholar
R. F. Freund and H. J. Siegel. Heterogeneous processing. IEEE Computer, 26(6):13–17, June 1993.
Google Scholar
K. Hwang and X. Zu. Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, New York, 1998.
Google Scholar
K. Hwang, H. Jin, E. Chow, C. L. Wang, and Z. Xu. Designing SSI clusters with hierachical checkpointing and single I/O space. IEEE Concurrency, 7(1):60–69, January/March 1999.
Google Scholar
Y.-K. Kwok and I. Ahmad. Dynamic critical path scheduling: an effective technique for allocating tasks graphs to multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 7(5):506–521, May 1996.
Google Scholar
Y.-K. Kwok and I. Ahmad. A parallel algorithm for compile-time scheduling of parallel programs on multiprocessors. In Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques, pp. 90–101, November 1997a.
Y.-K. Kwok and I. Ahmad. Efficient scheduling of arbitrary task graphs to multiprocessors using a parallel genetic algorithm. Journal of Parallel and Distributed Computing, 47(1):58–77, November 1997b.
Google Scholar
Y.-K. Kwok and I. Ahmad. Benchmarking and comparison of the task graph scheduling algorithms. Journal of Parallel and Distributed Computing, 59(3):381–422, December 1999a.
Google Scholar
Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys, 31(4):406–471, December 1999b.
Google Scholar
Y.-K. Kwok, K. P. Chow, H. Jin, and K. Hwang. Comet: a communication-ef.cient load balancing strategy for multi-agent cluster computing. In Proceedings of ParCo'99, August 1999.
A. L. Liestman and R. H. Campbell. A fault-tolerant scheduling problem. IEEE Transactions on Software Engineering, SE-12(11):1089–1095, November 1986.
Google Scholar
R. E. Lord, J. S. Kowalik, and S. P. Kumar. Solving linear algebraic equations on an MIMD computer. Journal of the ACM, 30(1):103–117, January 1983.
Google Scholar
G. F. Pfister. In Search of Clusters. Second edition, Prentice-Hall, Englewood Cliffs, NJ 1998.
J. Wu. A fault-tolerant task-scheduling method for parallel processing systems. International Journal on Mini and Microcomputers, 13(3):135–138, 1991.
Google Scholar
M.-Y. Wu and D. D. Gajski. Hypertool: a programming aid for message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 1(3):330–343, July 1990.
Google Scholar
T. Yang and A. Gerasoulis. DSC: scheduling parallel tasks on an unbounded number of processors. IEEE Transactions on Parallel and Distributed Systems, 5(9):951–967, September 1994.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong
Yu-Kwong Kwok

Authors

Yu-Kwong Kwok
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kwok, YK. Fault-Tolerant Parallel Scheduling of Tasks on a Heterogeneous High-Performance Workstation Cluster. The Journal of Supercomputing 19, 299–314 (2001). https://doi.org/10.1023/A:1011186732749

Download citation

Issue Date: July 2001
DOI: https://doi.org/10.1023/A:1011186732749

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fault-Tolerant Parallel Scheduling of Tasks on a Heterogeneous High-Performance Workstation Cluster

Abstract

Access this article

Similar content being viewed by others

Defining Parallel Local Search Procedures with Neighborhood Combinators

Parallel Byzantine Fault Tolerance

Energy-Efficient and Fault-Tolerant Taskgraph Scheduling for Manycores and Grids

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Fault-Tolerant Parallel Scheduling of Tasks on a Heterogeneous High-Performance Workstation Cluster

Abstract

Access this article

Similar content being viewed by others

Defining Parallel Local Search Procedures with Neighborhood Combinators

Parallel Byzantine Fault Tolerance

Energy-Efficient and Fault-Tolerant Taskgraph Scheduling for Manycores and Grids

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation