Self healing in System-S

Jacques-Silva, Gabriela; Challenger, Jim; Degenaro, Lou; Giles, James; Wagle, Rohit

doi:10.1007/s10586-008-0057-8

Self healing in System-S

Published: 28 May 2008

Volume 11, pages 247–257, (2008)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Gabriela Jacques-Silva¹,
Jim Challenger²,
Lou Degenaro²,
James Giles² &
…
Rohit Wagle²

81 Accesses
2 Citations
Explore all metrics

Abstract

Faults in a cluster are inevitable. The larger the cluster, the more likely the occurrence of some failure in hardware, in software, or by human error. System-S software must detect and self-repair failures while carrying out its prime directive—enabling stream processing program fragments to be distributed and connected to form complex applications. Depending on the type of failure, System-S may be able to continue with little or no disruption to potentially tens of thousands of interdependent and heterogeneous program fragments running across thousands of nodes.

We extend the work we previously presented on the self healing nature of the job manager component in System-S by presenting how it can handle failures of other system components, applications and network infrastructure. We also evaluate the recoverability of the job management orchestrator component of System-S, considering crash failures with and without error propagation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Mechanism for Stream Program Performance Recovery in Resource Limited Compute Clusters

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Article Open access 13 March 2024

Using Replication for Resilience on Exascale Systems

References

Amini, L., Jain, N., Sehgal, A., Silber, J., Verscheure, O.: Adaptive control of extreme-scale stream processing systems. In: ICDCS ’06: Proceedings of the 26th IEEE International Conference on Distributed Computing Systems, Washington, DC, USA, p. 71 (2006)
Balazinska, M., Balakrishnan, H., Madden, S., Stonebraker, M.: Fault-tolerance in the Borealis distributed stream processing system. In: Proc. of ACM SIGMOD ’05, New York, NY, USA, pp. 13–24 (2005)
Bauer, C., King, G.: Hibernate in Action. Manning Publications, New York (2005)
Google Scholar
Bohra, A., Neamtiu, I., Sultan, F.: Remote repair of operating system state using backdoors. In: Proc. of ICAC ’04, pp. 256–263. IEEE Computer Society, Washington (2004)
Google Scholar
Bolour, A.: Notes on the eclipse plug-in architecture. http://www.eclipse.org/articles/Article-Plug-in-architecture/plugin_architecture.html
Bronevetsky, G., Fernandes, R., Marques, D., Pingali, K., Stodghill, P.: Recent advances in checkpoint/recovery systems. In: Workshop on NSF Next Generation Software (2006)
Cha, H., Rudnick, E.M., Patel, J.H., Iyer, R.K., Choi, G.S.: A gate-level simulation environment for alpha-particle-induced transient faults. IEEE Trans. Comput. 45(11), 1248–1256 (1996)
Article MATH Google Scholar
Choi, G.S., Iyer, R.K., Saab, D.G.: Fault behavior dictionary for simulation of device-level transients. In: ICCAD ’93: Proceedings of the 1993 IEEE/ACM international conference on Computer-aided design, pp. 6–9. IEEE Computer Society Press, Los Alamitos (1993)
Google Scholar
Cooper, B.F., Schwan, K.: Distributed stream management using utility-driven self-adaptive middleware. In: Proc. of ICAC ’05, Washington, DC, USA, pp. 3–14 (2005)
Douglis, F., Branson, M., Hildrum, K., Rong, B., Ye, F.: Multi-site cooperative data stream analysis. SIGOPS Oper. Syst. Rev. 40(3), 31–37 (2006)
Article Google Scholar
Etsion, Y., Tsafrir, D.: A short survey of commercial batch schedulers. Technical Report 2005-13, Hebrew University (2005)
Hansen, J.G., Christiansen, E., Jul, E.: The laundromat model for autonomic cluster computing. In: Proc. of ICAC ’06, June 2006, pp. 114–123 (2006)
Iyer, R.K., Rossetti, D.J., Hsueh, M.C.: Measurement and modeling of computer reliability as affected by system activity. ACM Trans. Comput. Syst. 4(3), 214–237 (1986)
Article Google Scholar
Jacques-Silva, G., Challenger, J., Degenaro, L., Giles, J., Wagle, R.: Towards autonomic fault recovery in system-s. In: ICAC ’07: Proceedings of the Fourth International Conference on Autonomic Computing, p. 31. IEEE Computer Society, Washington (2007)
Chapter Google Scholar
Jain, N., Amini, L., Andrade, H., King, R., Park, Y., Selo, P., Venkatramani, C.: Design, implementation, and evaluation of the linear road benchmark on the stream processing core. In: Proc. of ACM SIGMOD ’06, pp. 431–442. ACM, New York (2006).
Google Scholar
Kephart, J.O., Chess, D.M.: The vision of autonomic computing. IEEE Comput. 36(1), 41–50 (2003)
Google Scholar
Lee, H.-H.S., Gu, G., Mudge, T.N.: An intrusion-tolerant and self-recoverable network service system using a security enhanced chip multiprocessor. In: Proc. of ICAC ’05, Washington, DC, USA, pp. 263–273 (2005)
Litzkow, M.J., Livny, M., Mutka, M.W.: Condor–a hunter of idle workstations. In: 8th International Conference on Distributed Computing Systems, pp. 104–111 (1988)
Whisnant, K., Iyer, R.K., Kalbarczyk, Z.T., Jones, P.H., III, Rennels, D.A., Some, R.: The effects of an armor-based sift environment on the performance and dependability of user applications. IEEE Trans. Softw. Eng. 30(4), 257–277 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Center for Reliable and High-Performance Computing, University of Illinois at Urbana Champaign, 1308 W. Main St., Urbana, IL, 61820, USA
Gabriela Jacques-Silva
IBM T.J. Watson Research Center, IBM Research, 19 Skyline Dr., Hawthorne, NY, 10532, USA
Jim Challenger, Lou Degenaro, James Giles & Rohit Wagle

Authors

Gabriela Jacques-Silva
View author publications
You can also search for this author in PubMed Google Scholar
Jim Challenger
View author publications
You can also search for this author in PubMed Google Scholar
Lou Degenaro
View author publications
You can also search for this author in PubMed Google Scholar
James Giles
View author publications
You can also search for this author in PubMed Google Scholar
Rohit Wagle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gabriela Jacques-Silva.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jacques-Silva, G., Challenger, J., Degenaro, L. et al. Self healing in System-S. Cluster Comput 11, 247–257 (2008). https://doi.org/10.1007/s10586-008-0057-8

Download citation

Received: 05 April 2008
Accepted: 24 April 2008
Published: 28 May 2008
Issue Date: September 2008
DOI: https://doi.org/10.1007/s10586-008-0057-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self healing in System-S

Abstract

Access this article

Similar content being viewed by others

A Mechanism for Stream Program Performance Recovery in Resource Limited Compute Clusters

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Using Replication for Resilience on Exascale Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Self healing in System-S

Abstract

Access this article

Similar content being viewed by others

A Mechanism for Stream Program Performance Recovery in Resource Limited Compute Clusters

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Using Replication for Resilience on Exascale Systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation