Abstract
The intensive and continuous use of high-performance computers for executing computationally intensive applications, coupled with the large number of elements that make them up, dramatically increase the likelihood of failures during their operation.
The interconnection network is a critical part of high-performance computer systems that communicates and links together the processing units. Network faults have an extremely high impact because the occurrence of a single fault may prevent the correct finalization of applications.
This work focuses on the problem of fault tolerance for high-speed interconnection networks by designing a fault tolerant routing method. The goal is to solve a certain number of link and node failures, considering its impact, and occurrence probability. To accomplish this task we take advantage of communication path redundancy, by means of adaptive multipath routing approaches that fulfill the four phases of fault tolerance: error detection, damage confinement, error recovery, fault treatment and continuous service. Experiments show that our method allows applications to successfully finalize their execution in the presence of several number of faults, with an average performance value of 97% with respect to the fault-free scenarios.
Supported by the MEC-Spain under contract TIN2007-64974.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Barroso, L., Dean, J., Holzle, U.: Web search for a planet: The google cluster architecture. IEEE Micro. 23(2), 22–28 (2003)
Adiga, N., Almasi, G., Almasi, G., Aridor, Y., Barik, R., et al.: An overview of the BlueGene/L supercomputer. In: Supercomputing ACM/IEEE 2002 Conference, November 2002, p. 60 (2002)
Abd-El-Barr, M.: Design and analysis of reliable and fault-tolerant computer systems. Imperial College Press, London (2007)
Sem-Jacobsen, F., Skeie, T., Lysne, O., et al.: Siamese-twin: A dynamically fault-tolerant fat-tree. In: International Parallel and Distributed Processing Symposium (IPDPS 2005), April 2005, p. 100b. IEEE Computer Society Press, Los Alamitos (2005)
Mejia, A., Flich, J., Duato, J., Reinemo, S.A., Skeie, T.: Segment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori. In: International Parallel and Distributed Processing Symposium (IPDPS 2006), April 2006, p. 10. IEEE Computer Society Press, Los Alamitos (2006)
Puente, V., Gregorio, J.A.: Immucube: Scalable fault-tolerant routing for k-ary n-cube networks. IEEE Transactions on Parallel and Distributed Systems 18(6), 776–788 (2007)
Duato, J.: A theory of fault-tolerant routing in wormhole networks. IEEE Transactions on Parallel and Distributed Systems 8(8), 790–802 (1997)
Gómez, C., Gómez, M.E., López, P., Duato, J.: An efficient fault-tolerant routing methodology for fat-tree interconnection networks. In: Stojmenovic, I., Thulasiram, R.K., Yang, L.T., Jia, W., Guo, M., de Mello, R.F. (eds.) ISPA 2007. LNCS, vol. 4742, pp. 509–522. Springer, Heidelberg (2007)
Gómez, M.E., Nordbotten, N.A., Flich, J., López, P., Robles, A., Duato, J., Skeie, T., Lysne, O.: A routing methodology for achieving fault tolerance in direct networks. IEEE Transactions on Computers 55(4), 400–415 (2006)
Valiant, L.G., Brebner, G.J.: Universal schemes for parallel communication. In: STOC 1981: Proceedings of the thirteenth annual ACM symposium on Theory of computing, pp. 263–277. ACM, New York (1981)
Nordbotten, N.A., Skeie, T.: A routing methodology for dynamic fault tolerance in meshes and tori. In: Aluru, S., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2007. LNCS, vol. 4873, pp. 514–527. Springer, Heidelberg (2007)
Ho, C.-T., Stockmeyer, L.: A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers. IEEE Transactions on Computers 53(4), 427–438 (2004)
Franco, D., Garcés, I., Luque, E.: Distributed routing balancing for interconnection network communication. In: HIPC 1998. 5th International Conference On High Performance Computing, pp. 253–261 (1998)
Montañana, J.M., Flich, J., Robles, A., Lopez, P., Duato, J.: A transition-based fault-tolerant routing methodology for infiniband networks. In: Communication Architecture for Clusters 2004 (CAC 2004), International Parallel and Distributed Processing Symposium (IPDPS 2004), Santa Fe, New Mexico, USA, April 2004, p. 186. IEEE Computer Society Press, Los Alamitos (2004)
Montañana, J.M., Flich, J., Robles, A., Duato, J.: A scalable methodology for computing fault-free paths in infiniBand torus networks. In: Labarta, J., Joe, K., Sato, T. (eds.) ISHPC 2006 and ALPS 2006. LNCS, vol. 4759, pp. 79–92. Springer, Heidelberg (2008)
Montañana, J., Flich, J., Duato, J.: Epoch-based reconfiguration: Fast, simple, and effective dynamic network reconfiguration. In: International Parallel and Distributed Processing Symposium (IPDPS 2008), April 2008, pp. 1–12. IEEE Computer Society, Los Alamitos (2008)
Lugones, D., Franco, D., Luque, E.: Dynamic and distributed multipath routing policy for High-Speed cluster networks. In: 9th IEEE/ACM International Symposium on Cluster Computing and the Grid - CCGRID 2009, Shanghai, China (May 2009)
Jalote, P.: Fault tolerance in distributed systems. PTR Prentice Hall, Englewood Cliffs (1994)
Dao, B.V., Duato, J., Yalamanchili, S.: Dynamically configurable message flow control for fault-tolerant routing. IEEE Transactions on Parallel and Distributed Systems 10(1), 7–22 (1999)
InfiniBand Trade Association: InfiniBand architecture specification: release 1.2., vol. 1. InfiniBand Trade Association, Portland, OR (2004)
OPNET Technologies: Opnet modeler. (January 2009), http://www.opnet.com/
Lugones, D., Franco, D., Luque, E.: Modeling adaptive routing protocols in high speed interconnection networks. In: OPNETWORK 2008 Conference (2008)
Duato, J., Yalamanchili, S., Ni, L.M.: Interconnection networks. In: An Engineering Approach, Morgan Kaufmann, San Francisco (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zarza, G., Lugones, D., Franco, D., Luque, E. (2009). A Multipath Fault-Tolerant Routing Method for High-Speed Interconnection Networks. In: Sips, H., Epema, D., Lin, HX. (eds) Euro-Par 2009 Parallel Processing. Euro-Par 2009. Lecture Notes in Computer Science, vol 5704. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03869-3_99
Download citation
DOI: https://doi.org/10.1007/978-3-642-03869-3_99
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03868-6
Online ISBN: 978-3-642-03869-3
eBook Packages: Computer ScienceComputer Science (R0)