ABHA: A Framework for Autonomic Job Recovery

Earl, Charles; Remolina, Emilio; Ong, Jim; Brown, John; Kuszmaul, Chris; Stone, Brad

doi:10.1007/978-3-540-30184-4_23

Charles Earl¹⁸,
Emilio Remolina¹⁸,
Jim Ong¹⁸,
John Brown¹⁹,
Chris Kuszmaul¹⁹ &
…
Brad Stone¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3278))

Included in the following conference series:

International Workshop on Distributed Systems: Operations and Management

449 Accesses

Abstract

Key issues to address in autonomic job recovery for cluster computing are recognizing job failure; understanding the failure sufficiently to know if and how to restart the job; and rapidly integrating this information into the cluster architecture so that the failure is better mitigated in the future. The Agent Based High Availability (ABHA) system provides an API and a collection of services for building autonomic batch job recovery into cluster computing environments. An agent API allows users to define agents for failure diagnosis and recovery. It is currently being evaluated in the U.S. Department of Energy’s STAR project.

Download to read the full chapter text

Chapter PDF

Reliability-Aware Distributed Computing Scheduling Policy

ExaLB: a mathematical framework for load balancing to support distributed exascale computing environments

Article 13 February 2023

sAirflow: Adopting Serverless in a Legacy Workflow Scheduler

References

Chess, D., Kephart, J.: The Vision of Autonomic Computing. IEEE Computer Magazine, 41-50 (2003)
Google Scholar
STAR experiment website, http://www.star.bnl.gov/
Condor project website, http://www.cs.wisc.edu/condor/
Platform Computing LSF, http://www.platform.com
Ganglia project website, http://ganglia.sourceforge.net/
JESS website, at http://herzberg.ca.sandia.gov/jess
Parallel Distributed Systems Facility website, http://www.nersc.gov/nusers/resources/PDSF/

Download references

Author information

Authors and Affiliations

Stottler Henke Associates, USA
Charles Earl, Emilio Remolina & Jim Ong
Pentum Group,Inc, USA
John Brown, Chris Kuszmaul & Brad Stone

Authors

Charles Earl
View author publications
You can also search for this author in PubMed Google Scholar
Emilio Remolina
View author publications
You can also search for this author in PubMed Google Scholar
Jim Ong
View author publications
You can also search for this author in PubMed Google Scholar
John Brown
View author publications
You can also search for this author in PubMed Google Scholar
Chris Kuszmaul
View author publications
You can also search for this author in PubMed Google Scholar
Brad Stone
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

HP Laboratories, Palo-Alto, CA, USA
Akhil Sahai
Security Lab, Dept. of Computer Science, Univ. of California, One Shields Ave., 95616, Davis, CA, USA
Felix Wu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Earl, C., Remolina, E., Ong, J., Brown, J., Kuszmaul, C., Stone, B. (2004). ABHA: A Framework for Autonomic Job Recovery. In: Sahai, A., Wu, F. (eds) Utility Computing. DSOM 2004. Lecture Notes in Computer Science, vol 3278. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30184-4_23

Download citation

DOI: https://doi.org/10.1007/978-3-540-30184-4_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23631-3
Online ISBN: 978-3-540-30184-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

ABHA: A Framework for Autonomic Job Recovery

Abstract

Chapter PDF

Similar content being viewed by others

Reliability-Aware Distributed Computing Scheduling Policy

ExaLB: a mathematical framework for load balancing to support distributed exascale computing environments

sAirflow: Adopting Serverless in a Legacy Workflow Scheduler

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

ABHA: A Framework for Autonomic Job Recovery

Abstract

Chapter PDF

Similar content being viewed by others

Reliability-Aware Distributed Computing Scheduling Policy

ExaLB: a mathematical framework for load balancing to support distributed exascale computing environments

sAirflow: Adopting Serverless in a Legacy Workflow Scheduler

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation