Skip to main content
Log in

Workload management of cooperatively federated computing clusters

The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Cooperative resource sharing enables distinct organizations to form a federation of computing resources. The motivation behind cooperation is that organizations are likely to serve each other by trading unused CPU cycles given the existence of irregular usage patterns of their local resources. In this way, resource sharing would enable organizations to purchase resources at a feasible level while meeting peak computational throughput requirements. This federation results in community grid that must be managed. A functional broker is deployed to facilitate remote resource access within the community grid. A major issue is the problem of correlations in job arrivals caused by seasonal usage and/or coincident resource usage demand patterns. These correlations incur high levels of burstiness in job arrivals causing the job queue of the broker to grow to an extent such that its performance becomes severely impaired. Since job arrivals cannot be controlled, management strategies must be employed to admit jobs in a manner that can sustain a fair level of resource allocation performance at all participating organizations in the community. In this paper, we present a theoretical analysis of the problem of job traffic burstiness on resource allocation performance in order to elicit the general job management strategies to be employed. Based on the analysis, we define and justify a job management strategies for the resource broker to cope with overload conditions caused by job arrival correlations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

  1. Altenbernd P, Hansson H (1998) The slack method: A new method for static allocation of hard real-time tasks. Real-Time Systems 15(2):103–130

    Article  Google Scholar 

  2. Andrade N, Cirne W, Brasileiro F (2003) Our grid: An approach to easily assemble grids with equitable resource sharing. 9th Workshop on Job Scheduling Strategies for Parallel Processing, pp 53–68

  3. Atlas A, Bestavros A (1998) Slack stealing job admission control scheduling. Technical Report 1998-009, Boston University

  4. Basney J, Livny M (1999) High performance cluster computing, Prentice Hall PTR, vol. 1, chapt. 5.

  5. Brune M, Gehring J, Keller A, Reinefeld A (1999) Managing clusters of geographically distributedhigh-performance computers. Concurrency–-Practice and experience, 11(15):887–911

    Article  Google Scholar 

  6. Chaplin S, Katramatos D, Karpovich J, Grimshaw A (1999) Resource management in legion. Future Generation Computer Systems 15(5–6):583–594

    Article  Google Scholar 

  7. Davis RI, Tindell KW, Burns A (1993) Scheduling slack time in fixed priority preemptive systems. In IEEE Real-Time Systems Symposium, IEEE Computer Society Press, pp 222–231

  8. Epema D, Livny M, Dantzig RV, Evers X, Pruyne J (1996) A worldwide flock of condors: Load sharing among workstation clusters. Future Generation Computer Systems 12:53–65

    Article  Google Scholar 

  9. Ernemann C, Hamscher V, Streit A, Yahyapour R (2002) Enhanced algorithms for multi-site scheduling. GRID 2002, pp 219–231

  10. Frey J, Tannenbaum T, Foster I, Livny M, Tuecke S (2002) Condor-G: A computation management agent for multi-institutional grids. Cluster Computing 5:237–246

    Google Scholar 

  11. Islam M, Balaji P, Sadayappani P, Pandai DK (2003) QoPS: A QoS based scheme for parallel job scheduling. In Job Scheduling Strategies for Parallel Processing: 9th International Workshop

  12. Kleban S, Clearwater S (2003) Quelling queue storms. In 13th International Conference High-performance and Distributed Computing

  13. LSF Website. http://www.platform.com/products/LSF/

  14. Ramos-Thuel S, Lehoczky J (1993) On-line scheduling of hard deadline aperiodic tasks in fixed-priority systems. Real-Time Systems Symposium

  15. Ramos-Thuel S, Lehoczky J (1994) Algorithms for scheduling hard aperiodic tasks in fixed-priority systems using slack stealing. Real-Time Systems Symposium

  16. Shan H, Oliker L, Biswas R (2003) Job superscheduler architecture and performance in computational grid environments. In Supercomputing 2003

  17. Skovira J, Chan W, Zhou H, Lifka D (1996) The EASY-loadleveler api project. Job Scheduling Strategies for Parallel Processing, pp 41–47

  18. Sun Grid Engine 5.3 Website. http://wwws.sun.com/software/gridware/sge.html

  19. Talby D, Feitelson DG (1997) Supporting priorities and improving utilization of the ibm sp2 scheduler using slack based backfilling. In 13th Intl. Parallel Processing Symposium, pp 513–517

  20. Tia T, Deng Z, Shankar M, Storch M, Sun J, Wu L, Liu J (1997) Probabilistic performance guarantees for real-time tasks with varying computation times. In Real-Time Technology and Applications Symposium

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wentong Cai.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xavier, P., Cai, W. & Lee, BS. Workload management of cooperatively federated computing clusters. J Supercomput 36, 309–322 (2006). https://doi.org/10.1007/s11227-006-8300-7

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-006-8300-7

Keywords

Navigation