Abstract
Fault tolerance mechanisms become indispensable as the number of processors increases in large systems. Measuring the effectiveness of such mechanisms before its implementation becomes mandatory. Research toward understanding the effects of different network parameters on the dependability parameters, like mean time to network failure or availability, becomes necessary. In this paper we analyse in detail such effects with a methodology proposed previously by us. This methodology is based on Markov chains and Analysis of Variance techniques. As a case study we analyse the effects of network size, mean time to node failure, mean time to node repair, mean time to network repair and coverage of the failure when using a 2D mesh network with a fault-tolerant mechanism (similar to the one used in the BlueGene/L system), that is able to remove rows and/or columns in the presence of failures.
Chapter PDF
Similar content being viewed by others
References
Ho, C.T., Stockmeyer, L.: A New Approach to Fault-Tolerant Wormhole Routing for Mesh-Connected Parallel Computers. IEEE Trans. Computers 53(4), 427–439 (2004)
Wu, J.: A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model. IEEE Trans. Computers 52(9), 1154–1169 (2003)
Jiang, Z., Wu, J., Wang, D.: A New Fault Information Model for Fault-Tolerant Adaptive and Minimal Routing in 3-D Meshes. In: Proc. Int’l Conf. Parallel Processing, June 2005, pp. 500–507 (2005)
Zhou, J.P., Lau, F.C.M.: Multi-Phase Minimal Fault-Tolerant Wormhole Routing in Meshes. Parallel Processing 30(3), 423–442 (2004)
Puente, V., Gregorio, J.A., Beivide, R., Vallejo, F.: A Low Cost Fault-Tolerant Packet Routing for Parallel Computers. In: Proc. Int’l Parallel and Distributed Processing Symp. (April 2003)
Puente, V., Gregorio, J.A., Vallejo, F., Beivide, R.: Immunet: A Cheap and Robust Fault-Tolerant Packet Routing Mechanism. In: Proc. Int’l Symp. Computer Architecture, June 2004, pp. 198–211 (2004)
The BlueGene/L Team, An Overview of the BlueGene/L Supercomputer. In: Proc. ACM/IEEE Conf. Supercomputing, November 2002, pp. 1–22 (2002)
Gara, A., et al.: Overview of the Blue Gene/L System Architecture. IBM J. Research & Development 49(2/3), 195–212 (2005)
Chirivella, V., Alcover, R.: A New Reliability Model for Interconnection Networks. In: Bode, A., Ludwig, T., Karl, W.C., Wismüller, R. (eds.) Euro-Par 2000. LNCS, vol. 1900, pp. 909–917. Springer, Heidelberg (2000)
Beaudry, M.D.: Performance-Related Reliability Measures for Computing Systems. IEEE Transactions on Computers 27(6), 540–547 (1978)
Bolch, G., Greiner, S., de Meer, H., Trivedi, K.S.: Queueing Networks and Markov Chains. Wiley Interscience, Hoboken (1998)
Dugan, J.B., Trivedi, K.S.: Coverage Modeling for Dependability Analysis of Fault-Tolerant Systems. IEEE Transactions on Computers 38(6), 775–787 (1989)
Scheffé, H.: The Analysis of Variance. Willey, New York (1959)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chirivella, V., Alcover, R., Flich, J., Duato, J. (2009). Dependability Analysis of a Fault-Tolerant Network Reconfiguring Strategy. In: Sips, H., Epema, D., Lin, HX. (eds) Euro-Par 2009 Parallel Processing. Euro-Par 2009. Lecture Notes in Computer Science, vol 5704. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03869-3_96
Download citation
DOI: https://doi.org/10.1007/978-3-642-03869-3_96
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03868-6
Online ISBN: 978-3-642-03869-3
eBook Packages: Computer ScienceComputer Science (R0)