Skip to main content
Log in

Fault-Tolerant Parallel Scheduling of Tasks on a Heterogeneous High-Performance Workstation Cluster

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

We propose a new approach, called cluster-based search (CBS), for scheduling large task graphs in parallel on a heterogeneous cluster of workstations connected by a high-speed network (e.g., using an ATM switch at OC-3 speed). The CBS algorithm uses a parallel random neighborhood search which works by refining multiple different initial schedules simultaneously using different workstations. The workstations communicate periodically to exchange their best solutions found thus far in order to direct the search to more promising regions in the search space. Heterogeneity of machines is exploited by the biased partitioning of the search space. The parallel random neighborhood search is fault-tolerant in that the workload of a failed workstation is automatically redistributed to other workstations so that the search can continue. We have implemented the CBS algorithm as a core function of our on-going development of SSI middleware for a Sun workstation cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. I. Ahmad, Y.-K. Kwok, M.-Y. Wu, and W. Shu. CASCH: A software tool for automatic parallelization and scheduling of programs on multiprocessors. IEEE Concurrency Vol. 8, no. 4, pp. 21–33, October–December 2000.

    Google Scholar 

  2. I. Ahmad and Y.-K. Kwok. On exploiting task duplication in parallel program scheduling. IEEE Transactions on Parallel and Distributed Systems, 9(9):872–892, September 1998.

    Google Scholar 

  3. L. S. Cheung and Y.-K. Kwok. Fuzzy Load Balancing in a Distributed Object Computing Environment, Proceedings of the IASTED Int'l Conference on Applied Informatics, pp. 235–240, Feb. 2001.

  4. M. Cosnard and M. Loi. Automatic task graphs generation techniques. Parallel Processing Letters, 5(4):527–538, December 1995.

    Google Scholar 

  5. W. K. Edwards. Core Jini. Prentice-Hall, Englewood Cliffs, NJ, 1999.

    Google Scholar 

  6. R. F. Freund and H. J. Siegel. Heterogeneous processing. IEEE Computer, 26(6):13–17, June 1993.

    Google Scholar 

  7. K. Hwang and X. Zu. Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, New York, 1998.

    Google Scholar 

  8. K. Hwang, H. Jin, E. Chow, C. L. Wang, and Z. Xu. Designing SSI clusters with hierachical checkpointing and single I/O space. IEEE Concurrency, 7(1):60–69, January/March 1999.

    Google Scholar 

  9. Y.-K. Kwok and I. Ahmad. Dynamic critical path scheduling: an effective technique for allocating tasks graphs to multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 7(5):506–521, May 1996.

    Google Scholar 

  10. Y.-K. Kwok and I. Ahmad. A parallel algorithm for compile-time scheduling of parallel programs on multiprocessors. In Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques, pp. 90–101, November 1997a.

  11. Y.-K. Kwok and I. Ahmad. Efficient scheduling of arbitrary task graphs to multiprocessors using a parallel genetic algorithm. Journal of Parallel and Distributed Computing, 47(1):58–77, November 1997b.

    Google Scholar 

  12. Y.-K. Kwok and I. Ahmad. Benchmarking and comparison of the task graph scheduling algorithms. Journal of Parallel and Distributed Computing, 59(3):381–422, December 1999a.

    Google Scholar 

  13. Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys, 31(4):406–471, December 1999b.

    Google Scholar 

  14. Y.-K. Kwok, K. P. Chow, H. Jin, and K. Hwang. Comet: a communication-ef.cient load balancing strategy for multi-agent cluster computing. In Proceedings of ParCo'99, August 1999.

  15. A. L. Liestman and R. H. Campbell. A fault-tolerant scheduling problem. IEEE Transactions on Software Engineering, SE-12(11):1089–1095, November 1986.

    Google Scholar 

  16. R. E. Lord, J. S. Kowalik, and S. P. Kumar. Solving linear algebraic equations on an MIMD computer. Journal of the ACM, 30(1):103–117, January 1983.

    Google Scholar 

  17. G. F. Pfister. In Search of Clusters. Second edition, Prentice-Hall, Englewood Cliffs, NJ 1998.

  18. J. Wu. A fault-tolerant task-scheduling method for parallel processing systems. International Journal on Mini and Microcomputers, 13(3):135–138, 1991.

    Google Scholar 

  19. M.-Y. Wu and D. D. Gajski. Hypertool: a programming aid for message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 1(3):330–343, July 1990.

    Google Scholar 

  20. T. Yang and A. Gerasoulis. DSC: scheduling parallel tasks on an unbounded number of processors. IEEE Transactions on Parallel and Distributed Systems, 5(9):951–967, September 1994.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kwok, YK. Fault-Tolerant Parallel Scheduling of Tasks on a Heterogeneous High-Performance Workstation Cluster. The Journal of Supercomputing 19, 299–314 (2001). https://doi.org/10.1023/A:1011186732749

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1011186732749

Navigation