Abstract
Dragonfly is a popular topology for current and future high-speed interconnection networks. The concept of gathering topology information to accelerate collective operations is a very hot research field. All-reduce operations are often used in the research fields of distributed machine learning (DML) and high-performance computing (HPC), because All-reduce is the key collective communication algorithm. The hierarchical characteristics of the dragonfly topology can be used to take advantage of the low communication delay of adjacent nodes to reduce the completion time of All-reduce operations. In this paper, we propose g-PAARD, a general proximity-aware All-reduce communication on the Dragonfly network. We study the impact of different routing mechanisms on the All-reduce algorithm, and their sensitivity to topology size and message size. Our results show that the proposed topology-aware algorithm can significantly reduce the communication delay, while having little impact on the network topology.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
For example, using standard algorithms, both nodes require at least 3 hops, and in some cases up to 6 hops, to facilitate communication at each step of the topology.
References
Hierarchical collectives in mpich2. Springer-Verlag (2009). https://doi.org/10.1007/978-3-642-03770-2_41
Design of a scalable infiniband topology service to enable network-topology-aware placement of processes. In: SC 2012: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (2013)
Archer, B.J., Vigil, B.M.: The trinity system. Technical report, Los Alamos National Lab. (LANL), Los Alamos, NM (United States) (2015)
Camarero, C., Vallejo, E., Beivide, R.: Topological characterization of hamming and dragonfly networks and its implications on routing. ACM Trans. Archit. Code Optim. (TACO) 11(4), 1–25 (2014)
Castelló, A., Quintana-OrtÃ, E.S., Duato, J.: Accelerating distributed deep neural network training with pipelined MPI allreduce. Cluster Comput. 24(4), 3797–3813 (2021). https://doi.org/10.1007/s10586-021-03370-9
De, K., et al.: Integration of panda workload management system with titan supercomputer at OLCF. J. Phys. Conf. Ser. 664, 092020. IOP Publishing (2015)
Faanes, G., et al.: Cray cascade: a scalable hpc system based on a dragonfly network. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. pp. 1–9. IEEE (2012)
Kandalla, K.C., Subramoni, H., Vishnu, A., Panda, D.K.: Designing topology-aware collective communication algorithms for large scale infiniband clusters: case studies with scatter and gather. In: IEEE International Symposium on Parallel & Distributed Processing (2010)
Kielmann, T., Hofman, R., Bal, H.E., Plaat, A., Bhoedjang, R.: MagPIe: MPI’s collective communication operations for clustered wide area systems. In: ACM SIGPLAN Notices (1999)
Kim, J., Dally, W.J., Scott, S., Abts, D.: Technology-driven, highly-scalable dragonfly topology. In: 2008 International Symposium on Computer Architecture, pp. 77–88. IEEE (2008)
Luo, X., et al.: HAN: a hierarchical autotuned collective communication framework. In: 2020 IEEE International Conference on Cluster Computing (CLUSTER), pp. 23–34 (2020). https://doi.org/10.1109/CLUSTER49012.2020.00013
Lusk, E., de Supinski, B.R., Gropp, W., Karonis, N.T., Bresnahan, J., Foster, I.: Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In: Parallel and Distributed Processing Symposium, International, p. 377. IEEE Computer Society, Los Alamitos, CA, USA, May 2000. https://doi.org/10.1109/IPDPS.2000.846009
Rabenseifner, R.: Optimization of collective reduction operations. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3036, pp. 1–9. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24685-5_1
Sensi, D., Girolamo, S., McMahon, K., Roweth, D., Hoefler, T.: An in-depth analysis of the slingshot interconnect. In: 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 481–494. IEEE Computer Society (2020)
Thakur, R., Gropp, W.D.: Improving the performance of collective operations in MPICH. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 257–267. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39924-7_38
Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005)
Acknowledgment
We thank the anonymous reviewers for their insightful comments. We gratefully acknowledge members of Tianhe interconnect group at NUDT for many inspiring conversations. The work was supported by the National Key R&D Program of China under Grant No. 2018YFB0204300, the Excellent Youth Foundation of Hunan Province (Dezun Dong), and the National Postdoctoeral Program for Innovative Talents under Grant No. BX20190091.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 IFIP International Federation for Information Processing
About this paper
Cite this paper
Ma, J., Dong, D., Li, C., Wu, K., Xiao, L. (2022). Evaluation of Topology-Aware All-Reduce Algorithm for Dragonfly Networks. In: Cérin, C., Qian, D., Gaudiot, JL., Tan, G., Zuckerman, S. (eds) Network and Parallel Computing. NPC 2021. Lecture Notes in Computer Science(), vol 13152. Springer, Cham. https://doi.org/10.1007/978-3-030-93571-9_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-93571-9_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93570-2
Online ISBN: 978-3-030-93571-9
eBook Packages: Computer ScienceComputer Science (R0)