Abstract
Faults in a cluster are inevitable. The larger the cluster, the more likely the occurrence of some failure in hardware, in software, or by human error. System-S software must detect and self-repair failures while carrying out its prime directive—enabling stream processing program fragments to be distributed and connected to form complex applications. Depending on the type of failure, System-S may be able to continue with little or no disruption to potentially tens of thousands of interdependent and heterogeneous program fragments running across thousands of nodes.
We extend the work we previously presented on the self healing nature of the job manager component in System-S by presenting how it can handle failures of other system components, applications and network infrastructure. We also evaluate the recoverability of the job management orchestrator component of System-S, considering crash failures with and without error propagation.
Similar content being viewed by others
References
Amini, L., Jain, N., Sehgal, A., Silber, J., Verscheure, O.: Adaptive control of extreme-scale stream processing systems. In: ICDCS ’06: Proceedings of the 26th IEEE International Conference on Distributed Computing Systems, Washington, DC, USA, p. 71 (2006)
Balazinska, M., Balakrishnan, H., Madden, S., Stonebraker, M.: Fault-tolerance in the Borealis distributed stream processing system. In: Proc. of ACM SIGMOD ’05, New York, NY, USA, pp. 13–24 (2005)
Bauer, C., King, G.: Hibernate in Action. Manning Publications, New York (2005)
Bohra, A., Neamtiu, I., Sultan, F.: Remote repair of operating system state using backdoors. In: Proc. of ICAC ’04, pp. 256–263. IEEE Computer Society, Washington (2004)
Bolour, A.: Notes on the eclipse plug-in architecture. http://www.eclipse.org/articles/Article-Plug-in-architecture/plugin_architecture.html
Bronevetsky, G., Fernandes, R., Marques, D., Pingali, K., Stodghill, P.: Recent advances in checkpoint/recovery systems. In: Workshop on NSF Next Generation Software (2006)
Cha, H., Rudnick, E.M., Patel, J.H., Iyer, R.K., Choi, G.S.: A gate-level simulation environment for alpha-particle-induced transient faults. IEEE Trans. Comput. 45(11), 1248–1256 (1996)
Choi, G.S., Iyer, R.K., Saab, D.G.: Fault behavior dictionary for simulation of device-level transients. In: ICCAD ’93: Proceedings of the 1993 IEEE/ACM international conference on Computer-aided design, pp. 6–9. IEEE Computer Society Press, Los Alamitos (1993)
Cooper, B.F., Schwan, K.: Distributed stream management using utility-driven self-adaptive middleware. In: Proc. of ICAC ’05, Washington, DC, USA, pp. 3–14 (2005)
Douglis, F., Branson, M., Hildrum, K., Rong, B., Ye, F.: Multi-site cooperative data stream analysis. SIGOPS Oper. Syst. Rev. 40(3), 31–37 (2006)
Etsion, Y., Tsafrir, D.: A short survey of commercial batch schedulers. Technical Report 2005-13, Hebrew University (2005)
Hansen, J.G., Christiansen, E., Jul, E.: The laundromat model for autonomic cluster computing. In: Proc. of ICAC ’06, June 2006, pp. 114–123 (2006)
Iyer, R.K., Rossetti, D.J., Hsueh, M.C.: Measurement and modeling of computer reliability as affected by system activity. ACM Trans. Comput. Syst. 4(3), 214–237 (1986)
Jacques-Silva, G., Challenger, J., Degenaro, L., Giles, J., Wagle, R.: Towards autonomic fault recovery in system-s. In: ICAC ’07: Proceedings of the Fourth International Conference on Autonomic Computing, p. 31. IEEE Computer Society, Washington (2007)
Jain, N., Amini, L., Andrade, H., King, R., Park, Y., Selo, P., Venkatramani, C.: Design, implementation, and evaluation of the linear road benchmark on the stream processing core. In: Proc. of ACM SIGMOD ’06, pp. 431–442. ACM, New York (2006).
Kephart, J.O., Chess, D.M.: The vision of autonomic computing. IEEE Comput. 36(1), 41–50 (2003)
Lee, H.-H.S., Gu, G., Mudge, T.N.: An intrusion-tolerant and self-recoverable network service system using a security enhanced chip multiprocessor. In: Proc. of ICAC ’05, Washington, DC, USA, pp. 263–273 (2005)
Litzkow, M.J., Livny, M., Mutka, M.W.: Condor–a hunter of idle workstations. In: 8th International Conference on Distributed Computing Systems, pp. 104–111 (1988)
Whisnant, K., Iyer, R.K., Kalbarczyk, Z.T., Jones, P.H., III, Rennels, D.A., Some, R.: The effects of an armor-based sift environment on the performance and dependability of user applications. IEEE Trans. Softw. Eng. 30(4), 257–277 (2004)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jacques-Silva, G., Challenger, J., Degenaro, L. et al. Self healing in System-S. Cluster Comput 11, 247–257 (2008). https://doi.org/10.1007/s10586-008-0057-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-008-0057-8