Abstract
In large cluster-based machines, fault-tolerance in the interconnection network is an issue of growing importance, since their increasing size rises the probability of failure. The topology used in these machines is usually a fat-tree. This paper proposes a new distributed fault-tolerant routing methodology for fat-trees. It does not require additional network hardware. It is scalable, since the required memory, switch hardware and routing delay do not depend on the network size. The methodology is based on enhancing the Interval Routing scheme with exclusion intervals. Exclusion intervals are associated to each switch output port, and represent the set of nodes that are unreachable from this port after a failure appears. We propose a mechanism to identify the exclusion intervals that must be updated after detecting a failure, and the values to write on them. Our methodology is able to support a relatively high number of network failures with a low degradation in network performance.
This work was supported by the Spanish MEC under Grant TIN2006-15516-C04-01, by CONSOLIDER-INGENIO 2010 under Grant CSD2006-00046 and by the European Commission in the context of the SCALA integrated project #27648 (FP6).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
ASCI Red Web Site, http://www.sandia.gov/ASCI/Red/
Bakker, E., et al.: Linear Interval Routing. Algorithms review 2, 45–61 (1991)
IBM BG/L Team: An Overview of BlueGene/L Supercomputer. ACM Supercomputing Conference (2002)
Broder, A., Fischer, M., Dolev, R., Simons, B.: Efficient fault-tolerant routings in networks. In: Proc. of the 16th annual ACM Symp. on Theory of Computing, ACM Press, New York (1984)
Chalsani, S., Raghavendra, C., Varma, A.: Fault-tolerant routing in MIN based supercomputers. In: Proc. of the 4th Int. Conf. on Supercomputing (1990)
Chong, F.T., et al.: Design and performance of multipath MIN architectures. In: Proc. of the 4th annual ACM Symp. on Parallel Algorithms and Architectures, ACM Press, New York (1992)
Duato, J., Yalamanchili, S., Ni, L.: Interconnection Networks. An Engineering Approach. Morgan Kaufmann, San Francisco (2004)
Earth Simulator Center, http://www.es.jamstec.go.jp/esc/eng/index.html
Gómez, M.E., López, P., Duato, J.: A Memory-Effective Routing Strategy for regular Interconnection Networks. In: Proc. Int. Parallel and Distributed Processing Symp. (2005)
Kamiura, N., Kodera, T., Matsui, N.: Design of a fault-tolerant multistage interconnection network with parallel duplicated switches. In: Proc. of the 15th IEEE Int. Symp. on Defect and Fault-Tolerance in VLSI Systems, IEEE Computer Society Press, Los Alamitos (2000)
Konstantinidou, S.: The selective extra stage butterfly. Transactions on Very Large-Scale Integration Systems (1993)
Lee, T.H., Chou, J.J.: Some directed graph theorems for testing the dynamic full access property of multistage interconnection networks. In: IEEE TENCON, IEEE Computer Society Press, Los Alamitos (1993)
Leighton, T., Maggs, B., Sitaraman, R.: On the fault-tolerant of some popular bounded-degree networks. SIAM J. Comput. 27(5) (1998)
Leiserson, C.E.: Fat-trees: Universal networks hardware-efficient supercomputing. IEEE Transactions on Computers 34(10) (1985)
Liu, J.: Microbenchmark Performance Comparision of High-Speed Cluster Interconnetcs. IEEE Micro (2004)
Martinez, J.C., Flich, J., Robles, A., Lopez, P., Duato, J.: Supporting Adaptive Routing in IBA Switches. Journal of Systems Architecture 49, 441–449 (2004)
Mun, Y., Youn, H.Y.: On performance evaluation of fault-tolerant multistage interconnection networks. In: Proc. of the 1992 ACM/SIGAPP Symp. on Applied Computing, ACM Press, New York (1992)
Quadrics Home Page, http://www.quadrics.com
Santoro, N., Khatib, R.: Labelling and Implicit Routing in Networks. Computer Journal 28(1), 5–8 (1985)
Scott, S.L., Thorson, G.M.: The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus. In: Symposium on High Performance Interconnects (1996)
Sem-Jacobsen, F.O., et al.: Siamese-Twin: A Dynamically Fault-tolerant Fat-tree. In: Proc. Int. Parallel and Distributed Processing Symp (2005)
Sengupta, J., Bansal, P.: Fault-tolerant routing in irregular MINs. IEEE Region 10 Int. Conf. on Global connectivity in Energy, Computer, Communication and Control 2 (1998)
Sharma, N.: Fault-tolerance of a MIN using hybrid redundancy. In: Proc. of the 27th Annual Simulation Symp. (1994)
Tera-10 at Commissariat a l’Energie Atomique, http://www.cea.fr
Valerio, M., et al.: Fault-tolerant orthogonal fat-trees as interconnection networks. In: Proc. 1st Int. Conf. on Algorithms and Architectures for Parallel Processing (1995)
Varma, A., Raghavendra, C.: Fault-tolerant routing in multistage interconnection networks. IEEE Trans. on Comput. 38(3), 385–393 (1989)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gómez, C., Gómez, M.E., López, P., Duato, J. (2007). An Efficient Fault-Tolerant Routing Methodology for Fat-Tree Interconnection Networks. In: Stojmenovic, I., Thulasiram, R.K., Yang, L.T., Jia, W., Guo, M., de Mello, R.F. (eds) Parallel and Distributed Processing and Applications. ISPA 2007. Lecture Notes in Computer Science, vol 4742. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74742-0_46
Download citation
DOI: https://doi.org/10.1007/978-3-540-74742-0_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74741-3
Online ISBN: 978-3-540-74742-0
eBook Packages: Computer ScienceComputer Science (R0)