Skip to main content

Robust Parallel Job Scheduling Infrastructure for Service-Oriented Grid Computing Systems

  • Conference paper
Computational Science and Its Applications – ICCSA 2005 (ICCSA 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3483))

Included in the following conference series:

Abstract

Recent trends in grid computing development is moving towards a service-oriented architecture. With the momentum gaining for the service-oriented grid computing systems, the issue of deploying support for integrated scheduling and fault-tolerant approaches becomes paramount importance. To this end, we propose a scalable framework that loosely couples the dynamic job scheduling approach with the hybrid replications approach to schedule jobs efficiently while at the same time providing fault-tolerance. The novelty of the proposed framework is that it uses passive replication approach under high system load and active replication approach under low system loads. The switch between these two replication methods is also done dynamically and transparently.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abawajy, J.H., Dandamudi, S.P.: Parallel job scheduling on multicluster computing systems. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER 2003), pp. 11–21 (2003)

    Google Scholar 

  2. Abawajy, J.H., Dandamudi, S.P.: A reconfigurable multi-layered grid scheduling infrastructure. In: Arabnia, H.R., Mun, Y. (eds.) Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2003, Las Vegas, Nevada, USA, June 23 - 26, vol. 1, pp. 138–144. CSREA Press (2003)

    Google Scholar 

  3. Abawajy, J.H., Dandamudi, S.P.: Fault-tolerant grid resource management infrastructure. Journal of Neural, Parallel and Scientific Computations 12, 208–220 (2004)

    Google Scholar 

  4. Abawajy, J.H.: Fault detection service architecture for grid computing systems. In: Laganá, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds.) ICCSA 2004. LNCS, vol. 3044, pp. 107–115. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  5. Birman, K.P.: The process group approach to reliable distributed computing. Technical report, Department of Computer Science, Cornell University (July 1991)

    Google Scholar 

  6. Foster, I.: The grid: A new infrastructure for 21st century science. Physics Today 55(2), 42–47 (2002)

    Article  Google Scholar 

  7. Foster, I.T., Kesselman, C., Tuecke, S.: The anatomy of the grid - enabling scalable virtual organizations. CoRR, cs.AR/0103025 (2001)

    Google Scholar 

  8. Gehring, J., Streit, A.: Robust resource management for metacomputers. In: HPDC 2000: Proceedings of the Ninth IEEE International Symposium on High Performance Distributed Computing (HPDC 2000), p. 105. IEEE Computer Society, Los Alamitos (2000)

    Chapter  Google Scholar 

  9. Hwang, S., Kesselman, C.: Gridworkflow: A flexible failure handling framework for the grid. In: 12th International Symposium on High-Performance Distributed Computing (HPDC-12 2003), Seattle, WA, USA, June 22-24, 2003, pp. 126–137. IEEE Computer Society, Los Alamitos (2003)

    Chapter  Google Scholar 

  10. Foster, I., Kesselman, C.: Globus: A Toolkit-Based Grid Architecture. In: The Grid: Blueprint for a Future Computing Infrastructure, pp. 259–278. Morgan Kaufmann, San Francisco (1998)

    Google Scholar 

  11. Juan, L., Fisher Allan, L., Peter, S.: Fail-safe PVM: A Portable Package for Distributed Programming with Transparent Recovery. Technical report, CMU, Department of Computer Science (February 1993)

    Google Scholar 

  12. Marzullo, K., Alvisi, L.: Waft: Support for fault-tolerance in wide-area object oriented systems. In: Proceedings of ISW 1998, pp. 5–10 (1998)

    Google Scholar 

  13. Nguyen-Tuong, A., Grimshaw, A.S., Karprovich, J.F.: Fault-tolerance via replication in coarse grain data-flow. Technical Report CS-95-38, Department of Computer Science, University of Virginia (1995)

    Google Scholar 

  14. Plank, J.S., Elwasif, W.R.: Experimental assessment of workstation failures and their impact on checkpointing systems. In: Symposium on FTC 1998, pp. 48–57 (1998)

    Google Scholar 

  15. Anuraag, S., Alok, S., Avinash, S.: A scheduling model for grid computing systems. In: Proceedings of Grid 2001, pp. 111–123. IEEE Computer Society, Los Alamitos (2001)

    Google Scholar 

  16. Schneider, F.B.: Byzantine generals in action: Implementing failstop processors. ACM Transactions on Computer Systems 2(2), 145–154 (1984)

    Article  Google Scholar 

  17. Stelling, P., Foster, I., Kesselman, C., von Laszewski, G., Lee, C.: A fault detection service for wide area distributed computations. In: Proc. 7th Symposium on High Performance Computing, pp. 268–278 (1998)

    Google Scholar 

  18. Tierney, B., Crowley, B., Gunter, D., Holding, M., Lee, J., Thompson, M.: A monitoring sensor management system for grid environments. In: HPDC, pp. 97–104 (2000)

    Google Scholar 

  19. Namyoon, W., Soonho, C., Hyungsoo, J., Park, Y., Park, H., Jungwhan, M., Heon, Y.Y.: Mpich-gf: Providing fault tolerance on grid environments. In: Proceedings of 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (2003)

    Google Scholar 

  20. Weissman, J.B.: Fault-tolerant wide area parallel computation. In: Proceedings of IDDPS 2000 Workshops, pp. 1214–1225 (2000)

    Google Scholar 

  21. Weissman, J.B.: Fault tolerant computing on the grid: What are my options? In: HPDC 1999: Proceedings of the The Eighth IEEE International Symposium on High Performance Distributed Computing, p. 26. IEEE Computer Society, Los Alamitos (1999)

    Google Scholar 

  22. Xu, M.Q.: Effective metacomputing using LSF multicluster. In: CCGRID 2001: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, pp. 100–106. IEEE Computer Society, Los Alamitos (2001)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Abawajy, J.H. (2005). Robust Parallel Job Scheduling Infrastructure for Service-Oriented Grid Computing Systems. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2005. ICCSA 2005. Lecture Notes in Computer Science, vol 3483. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424925_132

Download citation

  • DOI: https://doi.org/10.1007/11424925_132

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25863-6

  • Online ISBN: 978-3-540-32309-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics