Abstract
In large high-performance computing systems, the probability of component failure is high. At the same time, for a sustained system performance, reconfiguration is often needed to ensure high utilization of available resources. Reconfiguration in interconnection networks, like InfiniBand (IB), typically involves computation and distribution of a new set of routes in order to maintain connectivity and performance. In general, current routing algorithms do not consider the existing routes in a network when calculating new ones. Such configuration-oblivious routing might result in substantial modifications to the existing paths, and the reconfiguration becomes costly as it potentially involves a large number of source–destination pairs. In this paper, we propose a novel routing algorithm for IB-based fat-tree topologies, SlimUpdate. SlimUpdate employs path preservation techniques to achieve a decrease of up to 80 % in the number of total path modifications, as compared to the OpenSM’s fat-tree routing algorithm, in most reconfiguration scenarios. Furthermore, we present a metabase-aided re-routing method for fat-trees, based on destination leaf-switch multipathing. Our proposed method significantly reduces network reconfiguration overhead, while providing greater routing flexibility. On successive runs, our proposed method saves up to 85 % of the total routing time over the traditional re-routing scheme. Based on the metabase-aided routing, we also present a modified SlimUpdate routing algorithm to dynamically optimize routes for a given MPI node order.
Similar content being viewed by others
Notes
The OpenFabrics Enterprise Distribution (OFED) is the de facto standard software stack for deploying IB-based applications. http://openfabrics.org/.
Multi-homed nodes can be considered as distinct multiple nodes in the routing.
Available multipaths between leaf switches are different from switch-to-switch paths in the OpenSM’s fat-tree routing. The fat-tree routing algorithm uses single-path non-balanced switch-to-switch routing, as a relatively small amount of switch-to-switch traffic is anticipated.
The nodes connected to the same leaf switch have full bandwidth between them.
References
(2015) Top 500 Super Computer Sites. http://www.top500.org/, accessed November 25, 2015
Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J, et al (2008) Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech Rep 15
Cappello F, Geist A, Gropp W, Kale S, Kramer B, Snir M (2014) Toward exascale resilience: 2014 update. Supercomputing frontiers and innovations 1(1):5–28. doi:10.14529/jsfi1401015
Schroeder B, Gibson GA (2010) A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable and Secure Computing 7(4):337–350
Berl A, Gelenbe E, Di Girolamo M, Giuliani G, De Meer H, Dang MQ, Pentikousis K (2010) Energy-efficient cloud computing. The Computer Journal 53(7):1045–1051
Duato J, Lysne O, Pang R, Pinkston TM (2005) A theory for deadlock-free dynamic network reconfiguration. Part I. IEEE Transactions on Parallel and Distributed Systems 16(5):412–427
Lysne O, Montanana JM, Flich J, Duato J, Pinkston TM, Skeie T (2008) An efficient and deadlock-free network reconfiguration protocol. IEEE Transactions on Computers 57(6):762–779
Zahid F, Gran EG, Bogdanski B, Johnsen BD, Skeie T (2015a) SlimUpdate: Minimal Routing Update for Performance-Based Reconfigurations in Fat-Trees. In: 1st HiPINEB Workshop, IEEE International Conference on Cluster Computing (CLUSTER), 2015., IEEE, pp 849–856
Teodosiu D, Baxter J, Govil K, Chapin J, Rosenblum M, Horowitz M (1997) Hardware fault containment in scalable shared-memory multiprocessors. ACM SIGARCH Computer Architecture News 25(2):73–84
Schroeder MD, Birrell AD, Burrows M, Murray H, Needham RM, Rodeheffer TL, Satterthwaite EH, Thacker CP (1991) Autonet: A high-speed, self-configuring local area network using point-to-point links. IEEE Journal on Selected Areas in Communications 9(8):1318–1335
Sem-Jacobsen FO, Lysne O (2012) Topology agnostic dynamic quick reconfiguration for large-scale interconnection networks. In: Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 2012., IEEE Computer Society, pp 228–235
Domke J, Hoefler T, Matsuoka S (2014) Fail-in-place network design: interaction between topology, routing algorithm and failures. In: International Conference for High Performance Computing, Networking, Storage and Analysis, (SC), 2014, IEEE, pp 597–608
Zahid F, Gran EG, Bogdański B, Johnsen BD, Skeie T (2015b) A weighted fat-tree routing algorithm for efficient load-balancing in InfiniBand enterprise clusters. In: 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), 2015., IEEE
Zahid F, Gran EG, Bogdański B, Johnsen BD, Skeie T (2016) Efficient Network Isolation and Load Balancing in Multi-Tenant HPC Clusters. Future Generation Computer Systems. doi:10.1016/j.future.2016.04.003
Skeie T, Lysne O, Theiss I (2002) Layered Shortest Path (LASH) Routing in Irregular System Area Networks. In: International Parallel and Distributed Processing Symposium (IPDPS), 2002., Citeseer, vol 2, p 194
Mejia A, Flich J, Duato J, Reinemo SA, Skeie T (2006) Segment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori. In: 20th International Parallel and Distributed Processing Symposium (IPDPS), 2006., IEEE, pp 10–pp
Sem-Jacobsen FO, Skeie T, Lysne O, Duato J (2011) Dynamic fault tolerance in fat trees. IEEE Transactions on Computers 60(4):508–525
Zahavi E, Keslassy I, Kolodny A (2014) Quasi Fat Trees for HPC Clouds and Their Fault-Resilient Closed-Form Routing. In: Proceedings of the 22nd IEEE Annual Symposium on High-Performance Interconnects (HOTI), 2014., IEEE, pp 41–48
Tasoulas E, Gran EG, Johnsen BD, Begnum K, Skeie T (2015) Towards the InfiniBand SR-IOV vSwitch Architecture. In: 2015 IEEE International Conference on Cluster Computing (CLUSTER)., IEEE, pp 371–380
Lin XY, Chung YC, Huang TY (2004) A multiple LID routing scheme for fat-tree-based InfiniBand networks. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS), 2004
López P, Flich J, Duato J (2001) Deadlock-free routing in infiniband through destination renaming. In: International Conference on Parallel Processing, 2001., IEEE, pp 427–434
Nienaber W, Yuan X, Duan Z (2009) LID assignment in InfiniBand networks. IEEE Transactions on Parallel and Distributed Systems 20(4):484–497. doi:10.1109/TPDS.2008.144
(2015) InfiniBand Architecture Specification: Release 1.3. http://www.infinibandta.com/, accessed November 25, 2015
Bermúdez A, Casado R, Quiles FJ, Pinkston TM, Duato J (2003) On the infiniband subnet discovery process. In: Proceedings of the IEEE International Conference on Cluster Computing, 2003., IEEE, pp 512–517
Leiserson CE (1985) Fat-trees: universal networks for hardware-efficient supercomputing. IEEE Transactions on Computers 100(10):892–901
Petrini F, Vanneschi M (1997) k-ary n-trees: High performance networks for massively parallel architectures. In: Proceedings of the 11th International Parallel Processing Symposium, 1997., IEEE, pp 87–93
Öhring SR, Ibel M, Das SK, Kumar MJ (1995) On generalized fat trees. In: Proceedings of the 9th International Parallel Processing Symposium, 1995., IEEE, pp 37–44
Zahavi E (2010) D-Mod-K routing providing non-blocking traffic for shift permutations on real life fat trees. CCIT Report 776, Technion
Zahavi E (2012) Fat-tree routing and node ordering providing contention free traffic for MPI global collectives. Journal of Parallel and Distributed Computing 72(11):1423–1432
Huang W, Santhanaraman G, Jin HW, Gao Q, Panda DK (2006) Design of high performance MVAPICH2: MPI2 over InfiniBand. In: Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID), 2006., IEEE, vol 1, pp 43–48
Luszczek P, Dongarra J, Kepner J (2006) Design and implementation of the HPC Challenge benchmark suite. CT Watch Quarterly 2(4A):18–23
Hoefler T, Mehlan T, Lumsdaine A, Rehm W (2007) Netgauge: A Network Performance Measurement Framework. In: Proceedings of High Performance Computing and Communications, HPCC’07, Springer, vol 4782
(2015) The OSU Micro-benchmark Suite. http://mvapich.cse.ohio-state.edu/benchmarks/, accessed November 25, 2015
Schneider T, Hoefler T, Lumsdaine A (2009) ORCS: An oblivious routing congestion simulator. Indiana University, Computer Science Department
Bermúdez A, Casado R, Quiles FJ, Duato J (2004) Use of provisional routes to speed-up change assimilation in InfiniBand networks. In: Proceedings of 18th International Parallel and Distributed Processing Symposium (IPDPS), 2004., IEEE, p 186
T Hoefler, T Schneider, and A Lumsdaine (2008) Multistage switches are not crossbars: Effects of static routing in high-performance networks. In: IEEE International Conference on Cluster Computing, 2008., IEEE
Acknowledgments
The authors would like to thank Mellanox Technologies for providing some of the hardware we use in our experiments.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the Norwegian Research Council under the ERAC project (Project Number: 213283/O70).
Rights and permissions
About this article
Cite this article
Zahid, F., Gran, E.G., Bogdański, B. et al. Compact network reconfiguration in fat-trees. J Supercomput 72, 4438–4467 (2016). https://doi.org/10.1007/s11227-016-1759-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1759-y