Abstract
Key issues to address in autonomic job recovery for cluster computing are recognizing job failure; understanding the failure sufficiently to know if and how to restart the job; and rapidly integrating this information into the cluster architecture so that the failure is better mitigated in the future. The Agent Based High Availability (ABHA) system provides an API and a collection of services for building autonomic batch job recovery into cluster computing environments. An agent API allows users to define agents for failure diagnosis and recovery. It is currently being evaluated in the U.S. Department of Energy’s STAR project.
Chapter PDF
Similar content being viewed by others
References
Chess, D., Kephart, J.: The Vision of Autonomic Computing. IEEE Computer Magazine, 41-50 (2003)
STAR experiment website, http://www.star.bnl.gov/
Condor project website, http://www.cs.wisc.edu/condor/
Platform Computing LSF, http://www.platform.com
Ganglia project website, http://ganglia.sourceforge.net/
JESS website, at http://herzberg.ca.sandia.gov/jess
Parallel Distributed Systems Facility website, http://www.nersc.gov/nusers/resources/PDSF/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 IFIP International Federation for Information Processing
About this paper
Cite this paper
Earl, C., Remolina, E., Ong, J., Brown, J., Kuszmaul, C., Stone, B. (2004). ABHA: A Framework for Autonomic Job Recovery. In: Sahai, A., Wu, F. (eds) Utility Computing. DSOM 2004. Lecture Notes in Computer Science, vol 3278. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30184-4_23
Download citation
DOI: https://doi.org/10.1007/978-3-540-30184-4_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23631-3
Online ISBN: 978-3-540-30184-4
eBook Packages: Springer Book Archive