A resource management and fault tolerance services in grid computing

https://doi.org/10.1016/j.jpdc.2005.05.026Get rights and content

Abstract

In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur.

Section snippets

HwaMin Lee received her B.S. and M.E. degrees in computer science education from Korea University, Seoul, in 2000 and 2002, respectively. She is currently a Ph.D. candidate in computer science education from Korea University. Her research interests are in grid computing, distributed computing, fault-tolerant systems, and multi-agent system.

References (29)

  • N.T. Anh, Integrating fault-tolerance techniques in grid applications, Ph.D. Dissertation, August...
  • R. Buyya, D. Abramson, J. Giddy, Nimrod/G: an architecture of a resource management and scheduling system in a global...
  • K. Czajkowski et al.

    Grid Information Services for Distributed Resource Sharing

  • K. Czajkowski et al.

    Co-allocation services for computational grids

  • V. Dialani et al.

    Transparent fault tolerance for web services based architectures

  • P.A. Dinda

    Online Prediction of the Running Time of Tasks

    Cluster Comput.

    (2002)
  • I. Foster et al.

    Globus: a metacomputing infrastructure toolkit

    Int. J. Supercomputer Appl.

    (1997)
  • I. Foster et al.

    The Grid: Blueprint for a New Computing Infrastructure

    (1998)
  • I. Foster et al.

    The Grid 2: Blueprint for a New Computing Infrastructure

    (2004)
  • I. Foster et al.

    The anatomy of the grid: enabling scalable virtual organizations

    International J. Supercomputer Applications

    (2001)
  • I. Foster et al.

    A quality of service architecture that combines resource reservation and application adaptation

  • J. Frey, I. Foster, M. Livny, T. Tannenbaum, S. Tuecke, Condor-G: A Computation, Management Agent for...
  • A. Grimshaw et al.

    Legion—a view from 50,000 feet

  • S. Hwang, C. Kesselman, A generic failure detection service for the grid, Technical Report ISI-TR-568, USC Information...
  • Cited by (31)

    • A fault-tolerant scheduling system for computational grids

      2012, Computers and Electrical Engineering
      Citation Excerpt :

      Providing fault-tolerant in a grid environment, while optimizing resource scheduling and job execution, is a challenging task [2]. In computational grids, fault management is a very important and difficult problem for grid application developers [6]. Grid applications must have fault-tolerant services that detect faults and resolve them.

    • Service monitoring and differentiation techniques for resource allocation in the grid, on the basis of the level of service

      2011, Future Generation Computer Systems
      Citation Excerpt :

      In consequence, no meta-scheduler can guarantee that service providers will bind to the requirements of the jobs. Migration is generally accepted as the solution to correct a job that is incorrectly scheduled to run on a given resource that does not actually provide the level of service required by the job [6–10]. The process of migration of a running job to a different resource usually includes moving the required data, and possibly restarting the job from the beginning [11].

    • Trust and fault tolerance models in cloud computing: A review

      2019, International Journal of Scientific and Technology Research
    View all citing articles on Scopus

    HwaMin Lee received her B.S. and M.E. degrees in computer science education from Korea University, Seoul, in 2000 and 2002, respectively. She is currently a Ph.D. candidate in computer science education from Korea University. Her research interests are in grid computing, distributed computing, fault-tolerant systems, and multi-agent system.

    KwangSik Chung received the B.S. degree (1992), the M.S. degree (1995), and the Ph.D. degree (2000) in computer science and engineering from Korea University. He is currently a senior consultant at Samsung SDS Ltd. From September 2002 to November 2003, he is also a research fellow of Department of Computer Science at University College London. His research interests include distributed systems, fault-tolerant systems, and grid computing systems.

    SungHo Chin received his B.S. and M.E. degrees in computer science education from Korea University, Seoul, in 2002 and 2004, respectively. He is currently a Ph.D. course student in computer science education from Korea University. His research interests are in grid computing, distributed systems, and genetic algorithm.

    JongHyuk Lee received his B.S. degree in computer science education from Korea University, Seoul, in 2004. He is currently a Master course student in computer science education from Korea University. His research interests are in grid computing and distributed systems.

    DaeWon Lee received his B.E. degree in electrical engineering from SoonChunHyang University and M.E. degree in computer science education from Korea University, Korea, in 2001 and 2003, respectively. He is currently a Ph.D. candidate in computer science education from Korea University. His research interests are in grid computing, distributed systems, and mobile computing.

    Seongbin Park received bachelor's degree in the Department of Computer Science from Korea university in Seoul, Korea, and both master's degree and doctoral degree in the department of computer science from the University of Southern California. He is currently an assistant professor at the department of computer science education of Korea University in Seoul, Korea. His research interests include hypermedia, programming education, Semantic Web, music information retrieval, algorithms, and computer science education.

    HeonChang Yu received his B.S., M.S., and Ph.D. degrees in computer science from Korea University, Seoul, in 1989, 1991, and 1994, respectively. He is currently a professor in the Department of Computer Science Education at Korea University in Korea. He was a visiting professor at Georgia Institute of Technology in 2004. His research interests are in grid computing, distributed computing, and fault-tolerant systems.

    This research was supported by a Korea University Grant (2004).

    View full text