Robust Parallel Job Scheduling Infrastructure for Service-Oriented Grid Computing Systems

Abawajy, J. H.

doi:10.1007/11424925_132

J. H. Abawajy²⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3483))

Included in the following conference series:

International Conference on Computational Science and Its Applications

1622 Accesses
1 Citations

Abstract

Recent trends in grid computing development is moving towards a service-oriented architecture. With the momentum gaining for the service-oriented grid computing systems, the issue of deploying support for integrated scheduling and fault-tolerant approaches becomes paramount importance. To this end, we propose a scalable framework that loosely couples the dynamic job scheduling approach with the hybrid replications approach to schedule jobs efficiently while at the same time providing fault-tolerance. The novelty of the proposed framework is that it uses passive replication approach under high system load and active replication approach under low system loads. The switch between these two replication methods is also done dynamically and transparently.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abawajy, J.H., Dandamudi, S.P.: Parallel job scheduling on multicluster computing systems. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER 2003), pp. 11–21 (2003)
Google Scholar
Abawajy, J.H., Dandamudi, S.P.: A reconfigurable multi-layered grid scheduling infrastructure. In: Arabnia, H.R., Mun, Y. (eds.) Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2003, Las Vegas, Nevada, USA, June 23 - 26, vol. 1, pp. 138–144. CSREA Press (2003)
Google Scholar
Abawajy, J.H., Dandamudi, S.P.: Fault-tolerant grid resource management infrastructure. Journal of Neural, Parallel and Scientific Computations 12, 208–220 (2004)
Google Scholar
Abawajy, J.H.: Fault detection service architecture for grid computing systems. In: Laganá, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds.) ICCSA 2004. LNCS, vol. 3044, pp. 107–115. Springer, Heidelberg (2004)
Chapter Google Scholar
Birman, K.P.: The process group approach to reliable distributed computing. Technical report, Department of Computer Science, Cornell University (July 1991)
Google Scholar
Foster, I.: The grid: A new infrastructure for 21st century science. Physics Today 55(2), 42–47 (2002)
Article Google Scholar
Foster, I.T., Kesselman, C., Tuecke, S.: The anatomy of the grid - enabling scalable virtual organizations. CoRR, cs.AR/0103025 (2001)
Google Scholar
Gehring, J., Streit, A.: Robust resource management for metacomputers. In: HPDC 2000: Proceedings of the Ninth IEEE International Symposium on High Performance Distributed Computing (HPDC 2000), p. 105. IEEE Computer Society, Los Alamitos (2000)
Chapter Google Scholar
Hwang, S., Kesselman, C.: Gridworkflow: A flexible failure handling framework for the grid. In: 12th International Symposium on High-Performance Distributed Computing (HPDC-12 2003), Seattle, WA, USA, June 22-24, 2003, pp. 126–137. IEEE Computer Society, Los Alamitos (2003)
Chapter Google Scholar
Foster, I., Kesselman, C.: Globus: A Toolkit-Based Grid Architecture. In: The Grid: Blueprint for a Future Computing Infrastructure, pp. 259–278. Morgan Kaufmann, San Francisco (1998)
Google Scholar
Juan, L., Fisher Allan, L., Peter, S.: Fail-safe PVM: A Portable Package for Distributed Programming with Transparent Recovery. Technical report, CMU, Department of Computer Science (February 1993)
Google Scholar
Marzullo, K., Alvisi, L.: Waft: Support for fault-tolerance in wide-area object oriented systems. In: Proceedings of ISW 1998, pp. 5–10 (1998)
Google Scholar
Nguyen-Tuong, A., Grimshaw, A.S., Karprovich, J.F.: Fault-tolerance via replication in coarse grain data-flow. Technical Report CS-95-38, Department of Computer Science, University of Virginia (1995)
Google Scholar
Plank, J.S., Elwasif, W.R.: Experimental assessment of workstation failures and their impact on checkpointing systems. In: Symposium on FTC 1998, pp. 48–57 (1998)
Google Scholar
Anuraag, S., Alok, S., Avinash, S.: A scheduling model for grid computing systems. In: Proceedings of Grid 2001, pp. 111–123. IEEE Computer Society, Los Alamitos (2001)
Google Scholar
Schneider, F.B.: Byzantine generals in action: Implementing failstop processors. ACM Transactions on Computer Systems 2(2), 145–154 (1984)
Article Google Scholar
Stelling, P., Foster, I., Kesselman, C., von Laszewski, G., Lee, C.: A fault detection service for wide area distributed computations. In: Proc. 7th Symposium on High Performance Computing, pp. 268–278 (1998)
Google Scholar
Tierney, B., Crowley, B., Gunter, D., Holding, M., Lee, J., Thompson, M.: A monitoring sensor management system for grid environments. In: HPDC, pp. 97–104 (2000)
Google Scholar
Namyoon, W., Soonho, C., Hyungsoo, J., Park, Y., Park, H., Jungwhan, M., Heon, Y.Y.: Mpich-gf: Providing fault tolerance on grid environments. In: Proceedings of 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (2003)
Google Scholar
Weissman, J.B.: Fault-tolerant wide area parallel computation. In: Proceedings of IDDPS 2000 Workshops, pp. 1214–1225 (2000)
Google Scholar
Weissman, J.B.: Fault tolerant computing on the grid: What are my options? In: HPDC 1999: Proceedings of the The Eighth IEEE International Symposium on High Performance Distributed Computing, p. 26. IEEE Computer Society, Los Alamitos (1999)
Google Scholar
Xu, M.Q.: Effective metacomputing using LSF multicluster. In: CCGRID 2001: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, pp. 100–106. IEEE Computer Society, Los Alamitos (2001)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology, Deakin University, Geelong, VIC, Australia
J. H. Abawajy

Authors

J. H. Abawajy
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Mathematics and Computer Science, University of Perugia, via Vanvitelli, 1, I-06123, Perugia, Italy
Osvaldo Gervasi
Department of Computer Science, University of Calgary, 2500 University Drive N.W., T2N 1N4, Calgary, AB, Canada
Marina L. Gavrilova
William Norris Professor, Head of the Computer Science and Engineering Department, University of Minnesota, USA
Vipin Kumar
Department of Chemistry, University of Perugia, Via Elce di Sotto, 8, I-06123, Perugia, Italy
Antonio Laganá
Institute of High Performance Computing, IHCP, 1 Science Park Road, 01-01 The Capricorn, Singapore Science Park II, 117528, Singapore
Heow Pueh Lee
School of Computing, Soongsil University, Seoul, Korea
Youngsong Mun
Clayton School of IT, Monash University, 3800, Clayton, Australia
David Taniar
OptimaNumerics Ltd, Belfast, United Kingdom
Chih Jeng Kenneth Tan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abawajy, J.H. (2005). Robust Parallel Job Scheduling Infrastructure for Service-Oriented Grid Computing Systems. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2005. ICCSA 2005. Lecture Notes in Computer Science, vol 3483. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424925_132

Download citation

DOI: https://doi.org/10.1007/11424925_132
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25863-6
Online ISBN: 978-3-540-32309-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics