Abstract
Dragonflies are one of the most promising topologies for the Exascale effort for their scalability and cost. Dragonflies achieve very high throughput under uniform traffic, but have a pathological behavior under other regular traffic patterns, some of them very common in HPC applications, such as the multi-dimensional stencil communication pattern or certain permutation patterns. A recent study showed that randomization of task placement greatly improves the performance of these pathological traffic patterns by increasing the similarity of the load they induce to a uniformly distributed load. In this work we provide a theoretical model that is able to predict the expected performance of a generic dragonfly network under uniform traffic and characterize performance-optimal, minimal cost dragonflies. We then match the predictions of this model with the performance obtained through the detailed simulation of a wide range of dragonfly configurations. In these same scenarios, we explore the performance of other non-uniform traffic patterns and investigate the impact of randomization techniques based on both task placement and indirect routing. For these previously unexplored traffic patterns, we obtain similar results to those obtained in previous works for the multi-dimensional stencil communication pattern: randomizing task placement and/or path choice is effective in improving the performance of pathological workloads. However, we also show that neither uniformization technique is able to close the gap between the performance of these traffic patterns and the ideal performance of uniform random traffic, leaving significant room for improvement (best achieved performance is only roughly \(50~\%\) of uniform performance).








Similar content being viewed by others
Notes
For topologies with a number of nodes that is not a power of two, we exercise the bit reversal pattern in the remaining nodes by recursively partitioning them in decreasing powers of two.
References
Arimilli B, Arimilli R, Chung V, Clark S, Denzel W, Drerup B, Hoefler T, Joyner J, Lewis J, Li J, Ni N, Rajamony R (2010) The PERCS high-performance interconnect. In: Proceedings of the 2010 18th IEEE symposium on high performance interconnects, HOTI ’10. IEEE Computer Society, Washington, pp 75–82. doi:10.1109/HOTI.2010.16
Bhatele A, Jain N, Gropp WD, Kale LV (2011) Avoiding hot-spots on two-level direct networks. In: Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11. ACM, New York, pp 76:1–76:11. doi:10.1145/2063384.2063486
Dally W, Towles B (2003) Principles and practices of interconnection networks. Morgan Kaufmann Publishers Inc., San Francisco
Faanes G, et al (2012) Cray Cascade: a scalable HPC system based on a dragonfly network. In: Proceedings of the international conference on high performance computing, networking, storage and analysis, SC ’12. IEEE Computer Society Press, Los Alamitos, pp 103:1–103:9. http://dl.acm.org/citation.cfm?id=2388996.2389136
García M, Vallejo E, Beivide R, Odriozola M, Camarero C, Valero M, Rodríguez G, Labarta J, Minkenberg C (2012) On-the-fly adaptive routing in high-radix hierarchical networks. In: Proceedings of the 41st International Conference on Parallel Processing (ICPP)
Kim J, Dally WJ, Scott S (2008) Technology-driven, highly-scalable dragonfly topology. SIGARCH Comput Archit News 36(3):77–88. doi:10.1145/1394608.1382129
Minkenberg C, Denzel W, Rodriguez G, Birke R (2012) End-to-end modeling and simulation of high-performance computing systems. In: Springer proceedings in physics: use cases of discrete event simulation: appliance and research, Springer, New York, p 201
Minkenberg C, Rodriguez G (2009) Trace-driven co-simulation of high-performance computing systems using OMNeT++. In: Proceedings of the 2nd international conference on simulation tools and techniques, Simutools ’09. Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering (ICST), Brussels. doi:10.4108/ICST.SIMUTOOLS2009.5521
Valiant LG (1982) A scheme for fast parallel communication. SIAM J Comput 11(2):350–361
Acknowledgments
This work is an extension of a previous work entitled “Randomizing task placement does not randomize traffic (enough)”, published in the Proceedings of the 2013 Interconnection Network Architecture: On-Chip, Multi-Chip (INA-OCMC), ACM, New York, US. Partially supported by the Spanish Government through an FPI scholarship.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Prisacari, B., Rodriguez, G., Jokanovic, A. et al. Randomizing task placement and route selection do not randomize traffic (enough). Des Autom Embed Syst 18, 171–182 (2014). https://doi.org/10.1007/s10617-014-9133-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10617-014-9133-x