Skip to main content

An Efficient Fault-Tolerant Routing Methodology for Fat-Tree Interconnection Networks

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4742))

Abstract

In large cluster-based machines, fault-tolerance in the interconnection network is an issue of growing importance, since their increasing size rises the probability of failure. The topology used in these machines is usually a fat-tree. This paper proposes a new distributed fault-tolerant routing methodology for fat-trees. It does not require additional network hardware. It is scalable, since the required memory, switch hardware and routing delay do not depend on the network size. The methodology is based on enhancing the Interval Routing scheme with exclusion intervals. Exclusion intervals are associated to each switch output port, and represent the set of nodes that are unreachable from this port after a failure appears. We propose a mechanism to identify the exclusion intervals that must be updated after detecting a failure, and the values to write on them. Our methodology is able to support a relatively high number of network failures with a low degradation in network performance.

This work was supported by the Spanish MEC under Grant TIN2006-15516-C04-01, by CONSOLIDER-INGENIO 2010 under Grant CSD2006-00046 and by the European Commission in the context of the SCALA integrated project #27648 (FP6).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. ASCI Red Web Site, http://www.sandia.gov/ASCI/Red/

  2. Bakker, E., et al.: Linear Interval Routing. Algorithms review 2, 45–61 (1991)

    MathSciNet  Google Scholar 

  3. IBM BG/L Team: An Overview of BlueGene/L Supercomputer. ACM Supercomputing Conference (2002)

    Google Scholar 

  4. Broder, A., Fischer, M., Dolev, R., Simons, B.: Efficient fault-tolerant routings in networks. In: Proc. of the 16th annual ACM Symp. on Theory of Computing, ACM Press, New York (1984)

    Google Scholar 

  5. Chalsani, S., Raghavendra, C., Varma, A.: Fault-tolerant routing in MIN based supercomputers. In: Proc. of the 4th Int. Conf. on Supercomputing (1990)

    Google Scholar 

  6. Chong, F.T., et al.: Design and performance of multipath MIN architectures. In: Proc. of the 4th annual ACM Symp. on Parallel Algorithms and Architectures, ACM Press, New York (1992)

    Google Scholar 

  7. Duato, J., Yalamanchili, S., Ni, L.: Interconnection Networks. An Engineering Approach. Morgan Kaufmann, San Francisco (2004)

    Google Scholar 

  8. Earth Simulator Center, http://www.es.jamstec.go.jp/esc/eng/index.html

  9. Gómez, M.E., López, P., Duato, J.: A Memory-Effective Routing Strategy for regular Interconnection Networks. In: Proc. Int. Parallel and Distributed Processing Symp. (2005)

    Google Scholar 

  10. Kamiura, N., Kodera, T., Matsui, N.: Design of a fault-tolerant multistage interconnection network with parallel duplicated switches. In: Proc. of the 15th IEEE Int. Symp. on Defect and Fault-Tolerance in VLSI Systems, IEEE Computer Society Press, Los Alamitos (2000)

    Google Scholar 

  11. Konstantinidou, S.: The selective extra stage butterfly. Transactions on Very Large-Scale Integration Systems (1993)

    Google Scholar 

  12. Lee, T.H., Chou, J.J.: Some directed graph theorems for testing the dynamic full access property of multistage interconnection networks. In: IEEE TENCON, IEEE Computer Society Press, Los Alamitos (1993)

    Google Scholar 

  13. Leighton, T., Maggs, B., Sitaraman, R.: On the fault-tolerant of some popular bounded-degree networks. SIAM J. Comput. 27(5) (1998)

    Google Scholar 

  14. Leiserson, C.E.: Fat-trees: Universal networks hardware-efficient supercomputing. IEEE Transactions on Computers 34(10) (1985)

    Google Scholar 

  15. Liu, J.: Microbenchmark Performance Comparision of High-Speed Cluster Interconnetcs. IEEE Micro (2004)

    Google Scholar 

  16. Martinez, J.C., Flich, J., Robles, A., Lopez, P., Duato, J.: Supporting Adaptive Routing in IBA Switches. Journal of Systems Architecture 49, 441–449 (2004)

    Article  Google Scholar 

  17. Mun, Y., Youn, H.Y.: On performance evaluation of fault-tolerant multistage interconnection networks. In: Proc. of the 1992 ACM/SIGAPP Symp. on Applied Computing, ACM Press, New York (1992)

    Google Scholar 

  18. Quadrics Home Page, http://www.quadrics.com

  19. Santoro, N., Khatib, R.: Labelling and Implicit Routing in Networks. Computer Journal 28(1), 5–8 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  20. Scott, S.L., Thorson, G.M.: The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus. In: Symposium on High Performance Interconnects (1996)

    Google Scholar 

  21. Sem-Jacobsen, F.O., et al.: Siamese-Twin: A Dynamically Fault-tolerant Fat-tree. In: Proc. Int. Parallel and Distributed Processing Symp (2005)

    Google Scholar 

  22. Sengupta, J., Bansal, P.: Fault-tolerant routing in irregular MINs. IEEE Region 10 Int. Conf. on Global connectivity in Energy, Computer, Communication and Control 2 (1998)

    Google Scholar 

  23. Sharma, N.: Fault-tolerance of a MIN using hybrid redundancy. In: Proc. of the 27th Annual Simulation Symp. (1994)

    Google Scholar 

  24. Tera-10 at Commissariat a l’Energie Atomique, http://www.cea.fr

  25. Valerio, M., et al.: Fault-tolerant orthogonal fat-trees as interconnection networks. In: Proc. 1st Int. Conf. on Algorithms and Architectures for Parallel Processing (1995)

    Google Scholar 

  26. Varma, A., Raghavendra, C.: Fault-tolerant routing in multistage interconnection networks. IEEE Trans. on Comput. 38(3), 385–393 (1989)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Ivan Stojmenovic Ruppa K. Thulasiram Laurence T. Yang Weijia Jia Minyi Guo Rodrigo Fernandes de Mello

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gómez, C., Gómez, M.E., López, P., Duato, J. (2007). An Efficient Fault-Tolerant Routing Methodology for Fat-Tree Interconnection Networks. In: Stojmenovic, I., Thulasiram, R.K., Yang, L.T., Jia, W., Guo, M., de Mello, R.F. (eds) Parallel and Distributed Processing and Applications. ISPA 2007. Lecture Notes in Computer Science, vol 4742. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74742-0_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-74742-0_46

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74741-3

  • Online ISBN: 978-3-540-74742-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics