A resource management and fault tolerance services in grid computing

doi:10.1016/j.jpdc.2005.05.026

Journal of Parallel and Distributed Computing

Volume 65, Issue 11, November 2005, Pages 1305-1317

https://doi.org/10.1016/j.jpdc.2005.05.026 Get rights and content

Abstract

In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur.

Section snippets

HwaMin Lee received her B.S. and M.E. degrees in computer science education from Korea University, Seoul, in 2000 and 2002, respectively. She is currently a Ph.D. candidate in computer science education from Korea University. Her research interests are in grid computing, distributed computing, fault-tolerant systems, and multi-agent system.

References (29)

N.T. Anh, Integrating fault-tolerance techniques in grid applications, Ph.D. Dissertation, August...
R. Buyya, D. Abramson, J. Giddy, Nimrod/G: an architecture of a resource management and scheduling system in a global...
K. Czajkowski et al.
Grid Information Services for Distributed Resource Sharing
K. Czajkowski et al.
Co-allocation services for computational grids
V. Dialani et al.
Transparent fault tolerance for web services based architectures
P.A. Dinda
Online Prediction of the Running Time of Tasks
Cluster Comput.
(2002)
I. Foster et al.
Globus: a metacomputing infrastructure toolkit
Int. J. Supercomputer Appl.
(1997)
I. Foster et al.
The Grid: Blueprint for a New Computing Infrastructure
(1998)
I. Foster et al.
The Grid 2: Blueprint for a New Computing Infrastructure
(2004)
I. Foster et al.
The anatomy of the grid: enabling scalable virtual organizations
International J. Supercomputer Applications
(2001)

I. Foster et al.

A quality of service architecture that combines resource reservation and application adaptation

J. Frey, I. Foster, M. Livny, T. Tannenbaum, S. Tuecke, Condor-G: A Computation, Management Agent for...

A. Grimshaw et al.

Legion—a view from 50,000 feet

S. Hwang, C. Kesselman, A generic failure detection service for the grid, Technical Report ISI-TR-568, USC Information...

Cited by (31)

Automatic Fault Tolerant Software System for Desktop Grid Middleware
2016, Procedia Computer Science
Fault tolerance is a crucial constituent for research in desktop grid. The focus of this research paper will be on development of Fault tolerant software system taking care of computational power. Resources of alchemi grid necessitate for assemble available computational power in grid. The Alchemi desktop grid is important middleware for gather computational power by executors on different nodes. Executor related faults and failures can stop running grid any time. Executer flaws are exceptionally crucial in alchemi desktop grid middleware. The Available computational power is dependent on the number of executors. Alchemi Grid provides manual procedure for control on executors. The middleware has not integrated automatic system to control execution level deficiencies. This issue has not been addressed in alchemi desktop middleware. Today, we need an automatic software technique for reliable and consistent working of computational grid. This Research work has projected, designed and developed automatic software system to control the executor faults in alchemi middleware. Normal and defective executor nodes can be distinguished by regular monitoring software system. Automatic software system is helpful for monitoring and controlling the executor faults in Alchemi middleware. Executor can start and stop by automated system in milliseconds. Control on executor will put impact on available computational power in grid. Proposed automated software system has skilled to sense faulty executor node and correct the fault by start process on best Available node. Automated framework is capable to remove fault in executer node. Regular monitor system has used for the development of automated fault tolerant software system. The Best available node can be selected on the basis of memory usage or processing power usage on remote node.
A fault-tolerant scheduling system for computational grids
2012, Computers and Electrical Engineering
Citation Excerpt :
Providing fault-tolerant in a grid environment, while optimizing resource scheduling and job execution, is a challenging task [2]. In computational grids, fault management is a very important and difficult problem for grid application developers [6]. Grid applications must have fault-tolerant services that detect faults and resolve them.
Fault-tolerant scheduling is an important issue for computational grid systems, as grids typically consist of strongly varying and geographically distributed resources. The main scheduling strategy of most fault-tolerant scheduling systems depends on the response time and fault index when selecting a resource to execute a certain job.
In this paper, a scheduling system is presented that depends on a new factor called scheduling indicator in selecting resources. This factor comprises of the response time and the failure rate of grid resources. Whenever a grid scheduler has jobs to schedule on grid resources, it uses the scheduling indicator to generate the scheduling decisions. The main scheduling strategy of the system is to select resources that have the lowest tendency to fail. Extensive simulation experiments are conducted to quantify the performance of the proposed system. Experiments have shown that the proposed system can considerably improve grid performance in terms of throughput, unavailability, turnaround time, and fail tendency.
Service monitoring and differentiation techniques for resource allocation in the grid, on the basis of the level of service
2011, Future Generation Computer Systems
Citation Excerpt :
In consequence, no meta-scheduler can guarantee that service providers will bind to the requirements of the jobs. Migration is generally accepted as the solution to correct a job that is incorrectly scheduled to run on a given resource that does not actually provide the level of service required by the job [6–10]. The process of migration of a running job to a different resource usually includes moving the required data, and possibly restarting the job from the beginning [11].
The study of meta-scheduling of jobs in the Grid has been a recurrent topic on the literature. Many tools have been developed for allocating the jobs to the most appropriate resources according to specific application needs while balancing the workload among resources. However, few of them focus on evaluating the level of service attained by providers of Grid services. On the contrary, Grid generally offers best-effort service to all the applications. Without Quality of Service (QoS), the jobs are executed without any guarantee on the execution time or throughput. Also, the support for qualitative attributes, such as security, is limited to the information provided by the resources. The uncertainty on the final performance is an issue that affects user satisfaction, reducing the interest of Grid infrastructures. In this paper, we present GRIDIFF, a software architecture that covers all necessary steps to integrate QoS into the process of allocating resources for the execution of jobs in the Grid. GRIDIFF has been used in practice to differentiate the resources according to the fulfillment of the 100% of the QoS considering three groups with failure rates of 9.1%, 13.3% and 64.1%, respectively. Scalability has been evaluated through simulation in up to 7000 nodes that can handle up to 250 monitoring agents per node.
A framework for credit-driven smart manufacturing service configuration based on complex networks
2022, International Journal of Computer Integrated Manufacturing
Fuzzy Logic-based Robust Failure Handling Mechanism for Fog Computing
2021, arXiv
Trust and fault tolerance models in cloud computing: A review
2019, International Journal of Scientific and Technology Research

View all citing articles on Scopus

KwangSik Chung received the B.S. degree (1992), the M.S. degree (1995), and the Ph.D. degree (2000) in computer science and engineering from Korea University. He is currently a senior consultant at Samsung SDS Ltd. From September 2002 to November 2003, he is also a research fellow of Department of Computer Science at University College London. His research interests include distributed systems, fault-tolerant systems, and grid computing systems.

SungHo Chin received his B.S. and M.E. degrees in computer science education from Korea University, Seoul, in 2002 and 2004, respectively. He is currently a Ph.D. course student in computer science education from Korea University. His research interests are in grid computing, distributed systems, and genetic algorithm.

JongHyuk Lee received his B.S. degree in computer science education from Korea University, Seoul, in 2004. He is currently a Master course student in computer science education from Korea University. His research interests are in grid computing and distributed systems.

DaeWon Lee received his B.E. degree in electrical engineering from SoonChunHyang University and M.E. degree in computer science education from Korea University, Korea, in 2001 and 2003, respectively. He is currently a Ph.D. candidate in computer science education from Korea University. His research interests are in grid computing, distributed systems, and mobile computing.

Seongbin Park received bachelor's degree in the Department of Computer Science from Korea university in Seoul, Korea, and both master's degree and doctoral degree in the department of computer science from the University of Southern California. He is currently an assistant professor at the department of computer science education of Korea University in Seoul, Korea. His research interests include hypermedia, programming education, Semantic Web, music information retrieval, algorithms, and computer science education.

HeonChang Yu received his B.S., M.S., and Ph.D. degrees in computer science from Korea University, Seoul, in 1989, 1991, and 1994, respectively. He is currently a professor in the Department of Computer Science Education at Korea University in Korea. He was a visiting professor at Georgia Institute of Technology in 2004. His research interests are in grid computing, distributed computing, and fault-tolerant systems.

^☆: This research was supported by a Korea University Grant (2004).

View full text

A resource management and fault tolerance services in grid computing☆

Abstract

Section snippets

Grid Information Services for Distributed Resource Sharing

Co-allocation services for computational grids

Transparent fault tolerance for web services based architectures

Online Prediction of the Running Time of Tasks

Cluster Comput.

Globus: a metacomputing infrastructure toolkit

Int. J. Supercomputer Appl.

The Grid: Blueprint for a New Computing Infrastructure

The Grid 2: Blueprint for a New Computing Infrastructure

The anatomy of the grid: enabling scalable virtual organizations

International J. Supercomputer Applications

A quality of service architecture that combines resource reservation and application adaptation

Legion—a view from 50,000 feet