Skip to main content
Log in

Self healing in System-S

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Faults in a cluster are inevitable. The larger the cluster, the more likely the occurrence of some failure in hardware, in software, or by human error. System-S software must detect and self-repair failures while carrying out its prime directive—enabling stream processing program fragments to be distributed and connected to form complex applications. Depending on the type of failure, System-S may be able to continue with little or no disruption to potentially tens of thousands of interdependent and heterogeneous program fragments running across thousands of nodes.

We extend the work we previously presented on the self healing nature of the job manager component in System-S by presenting how it can handle failures of other system components, applications and network infrastructure. We also evaluate the recoverability of the job management orchestrator component of System-S, considering crash failures with and without error propagation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Amini, L., Jain, N., Sehgal, A., Silber, J., Verscheure, O.: Adaptive control of extreme-scale stream processing systems. In: ICDCS ’06: Proceedings of the 26th IEEE International Conference on Distributed Computing Systems, Washington, DC, USA, p. 71 (2006)

  2. Balazinska, M., Balakrishnan, H., Madden, S., Stonebraker, M.: Fault-tolerance in the Borealis distributed stream processing system. In: Proc. of ACM SIGMOD ’05, New York, NY, USA, pp. 13–24 (2005)

  3. Bauer, C., King, G.: Hibernate in Action. Manning Publications, New York (2005)

    Google Scholar 

  4. Bohra, A., Neamtiu, I., Sultan, F.: Remote repair of operating system state using backdoors. In: Proc. of ICAC ’04, pp. 256–263. IEEE Computer Society, Washington (2004)

    Google Scholar 

  5. Bolour, A.: Notes on the eclipse plug-in architecture. http://www.eclipse.org/articles/Article-Plug-in-architecture/plugin_architecture.html

  6. Bronevetsky, G., Fernandes, R., Marques, D., Pingali, K., Stodghill, P.: Recent advances in checkpoint/recovery systems. In: Workshop on NSF Next Generation Software (2006)

  7. Cha, H., Rudnick, E.M., Patel, J.H., Iyer, R.K., Choi, G.S.: A gate-level simulation environment for alpha-particle-induced transient faults. IEEE Trans. Comput. 45(11), 1248–1256 (1996)

    Article  MATH  Google Scholar 

  8. Choi, G.S., Iyer, R.K., Saab, D.G.: Fault behavior dictionary for simulation of device-level transients. In: ICCAD ’93: Proceedings of the 1993 IEEE/ACM international conference on Computer-aided design, pp. 6–9. IEEE Computer Society Press, Los Alamitos (1993)

    Google Scholar 

  9. Cooper, B.F., Schwan, K.: Distributed stream management using utility-driven self-adaptive middleware. In: Proc. of ICAC ’05, Washington, DC, USA, pp. 3–14 (2005)

  10. Douglis, F., Branson, M., Hildrum, K., Rong, B., Ye, F.: Multi-site cooperative data stream analysis. SIGOPS Oper. Syst. Rev. 40(3), 31–37 (2006)

    Article  Google Scholar 

  11. Etsion, Y., Tsafrir, D.: A short survey of commercial batch schedulers. Technical Report 2005-13, Hebrew University (2005)

  12. Hansen, J.G., Christiansen, E., Jul, E.: The laundromat model for autonomic cluster computing. In: Proc. of ICAC ’06, June 2006, pp. 114–123 (2006)

  13. Iyer, R.K., Rossetti, D.J., Hsueh, M.C.: Measurement and modeling of computer reliability as affected by system activity. ACM Trans. Comput. Syst. 4(3), 214–237 (1986)

    Article  Google Scholar 

  14. Jacques-Silva, G., Challenger, J., Degenaro, L., Giles, J., Wagle, R.: Towards autonomic fault recovery in system-s. In: ICAC ’07: Proceedings of the Fourth International Conference on Autonomic Computing, p. 31. IEEE Computer Society, Washington (2007)

    Chapter  Google Scholar 

  15. Jain, N., Amini, L., Andrade, H., King, R., Park, Y., Selo, P., Venkatramani, C.: Design, implementation, and evaluation of the linear road benchmark on the stream processing core. In: Proc. of ACM SIGMOD ’06, pp. 431–442. ACM, New York (2006).

    Google Scholar 

  16. Kephart, J.O., Chess, D.M.: The vision of autonomic computing. IEEE Comput. 36(1), 41–50 (2003)

    Google Scholar 

  17. Lee, H.-H.S., Gu, G., Mudge, T.N.: An intrusion-tolerant and self-recoverable network service system using a security enhanced chip multiprocessor. In: Proc. of ICAC ’05, Washington, DC, USA, pp. 263–273 (2005)

  18. Litzkow, M.J., Livny, M., Mutka, M.W.: Condor–a hunter of idle workstations. In: 8th International Conference on Distributed Computing Systems, pp. 104–111 (1988)

  19. Whisnant, K., Iyer, R.K., Kalbarczyk, Z.T., Jones, P.H., III, Rennels, D.A., Some, R.: The effects of an armor-based sift environment on the performance and dependability of user applications. IEEE Trans. Softw. Eng. 30(4), 257–277 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriela Jacques-Silva.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jacques-Silva, G., Challenger, J., Degenaro, L. et al. Self healing in System-S. Cluster Comput 11, 247–257 (2008). https://doi.org/10.1007/s10586-008-0057-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-008-0057-8

Keywords

Navigation