Abstract
An emerging trend in High Performance Computing (HPC) systems that use hierarchical topologies (such as dragonfly) is that the applications are increasingly exhibiting high run-to-run performance variability. This poses a significant challenge for application developers, job schedulers, and system maintainers. One approach to address the performance variability is to use newly proposed network topologies such as megafly (or dragonfly+) that offer increased path diversity compared to a traditional fully connected dragonfly. Yet another approach is to use quality of service (QoS) traffic classes that ensure bandwidth guarantees. In this work, we select HPC application workloads that have exhibited performance variability on current 2-D dragonfly systems. We evaluate the baseline performance expectations of these workloads on megafly and 1-D dragonfly network models with comparably similar network configurations. Our results show that the megafly network, despite using fewer virtual channels (VCs) for deadlock avoidance than a dragonfly, performs as well as a fully connected 1-D dragonfly network. We then exploit the fact that megafly networks require fewer VCs to incorporate QoS traffic classes. We use bandwidth capping and traffic differentiation techniques to introduce multiple traffic classes in megafly networks. In some cases, our results show that QoS can completely mitigate application performance variability while causing minimal slowdown to the background network traffic.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The Scalable Workload Models code is available at the git repo: https://xgitlab.cels.anl.gov/codes/workloads.git.
References
Alizadeh, M., Kabbani, A., Edsall, T., Prabhakar, B., Vahdat, A., Yasuda, M.: Less is more: trading a little bandwidth for ultra-low latency in the data center. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 19–19. USENIX Association (2012)
Argonne Leadership Computing Facility (ALCF): Theta, Argonne’s Cray XC System. https://www.alcf.anl.gov/theta
Chen, S., Nahrstedt, K.: An overview of quality of service routing for next-generation high-speed networks: problems and solutions. IEEE Netw. 12(6), 64–79 (1998)
Cheng, A.S., Lovett, T.D., Parker, M.A.: Traffic class arbitration based on priority and bandwidth allocation. Google Patents, December 2016
Chunduri, S., et al.: Run-to-run variability on Xeon Phi based Cray XC systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 52. ACM (2017)
Chunduri, S., Parker, S., Balaji, P., Harms, K., Kumaran, K.: Characterization of MPI usage on a production supercomputer. In: Characterization of MPI Usage on a Production Supercomputer. IEEE (2018)
Curtis, A.R., Mogul, J.C., Tourrilhes, J., Yalagandula, P., Sharma, P., Banerjee, S.: DevoFlow: scaling flow management for high-performance networks. In: ACM SIGCOMM Computer Communication Review, vol. 41, pp. 254–265. ACM (2011)
Flajslik, M., Borch, E., Parker, M.A.: Megafly: a topology for exascale systems. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds.) ISC High Performance 2018. LNCS, vol. 10876, pp. 289–310. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92040-5_15
Jokanovic, A., Sancho, J.C., Labarta, J., Rodriguez, G., Minkenberg, C.: Effective quality-of-service policy for capacity high-performance computing systems. In: 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), pp. 598–607. IEEE (2012)
Kim, J., Dally, W.J., Scott, S., Abts, D.: Technology-driven, highly-scalable dragonfly topology. In: 35th International Symposium on Computer Architecture 2008. ISCA 2008, pp. 77–88. IEEE (2008)
Los Alamos National Laboratory: Trinity Cray XC40 system. http://www.lanl.gov/projects/trinity/
McKeown, N., et al.: OpenFlow: enabling innovation in campus networks. ACM SIGCOMM Comput. Commun. Rev. 38(2), 69–74 (2008)
Mubarak, M., et al.: Quantifying I/O and communication traffic interference on dragonfly networks equipped with burst buffers. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 204–215. IEEE (2017)
Mubarak, M., Carothers, C.D., Ross, R.B., Carns, P.H.: Enabling parallel simulation of large-scale HPC network systems. IEEE Trans. Parallel Distrib. Syst. 28(1), 87–100 (2017)
Mubarak, M., Ross, R.B.: Validation study of CODES dragonfly network model with Theta Cray XC system (2017). https://doi.org/10.2172/1356812
NERSC: Cori. https://www.nersc.gov/users/computational-systems/cori/
Shpiner, A., Haramaty, Z., Eliad, S., Zdornov, V., Gafni, B., Zahavi, E.: Dragonfly+: low cost topology for scaling datacenters. In: 2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB), pp. 1–8. IEEE (2017)
Thompson, J.: Scalable workload models for system simulations background and motivation. Technical report (2014). http://hpc.pnl.gov/modsim/2014/Presentations/Thompson.pdf
Won, J., Kim, G., Kim, J., Jiang, T., Parker, M., Scott, S.: Overcoming far-end congestion in large-scale networks. In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 415–427. IEEE (2015)
Yang, X., Jenkins, J., Mubarak, M., Ross, R.B., Lan, Z.: Watch out for the bully! job interference study on dragonfly network. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC16, pp. 750–760. IEEE (2016)
Zhou, Z., et al.: Improving batch scheduling on Blue Gene/Q by relaxing 5D torus network allocation constraints. In: 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 439–448. IEEE (2015)
Acknowledgment
This research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’s exascale computing imperative. The work has used resources from the Argonne’s Leadership Computing Facility (ALCF), Rensselaer’s CCI supercomputing center and Argonne’s Laboratory Computing Resource Center (LCRC).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Mubarak, M. et al. (2019). Evaluating Quality of Service Traffic Classes on the Megafly Network. In: Weiland, M., Juckeland, G., Trinitis, C., Sadayappan, P. (eds) High Performance Computing. ISC High Performance 2019. Lecture Notes in Computer Science(), vol 11501. Springer, Cham. https://doi.org/10.1007/978-3-030-20656-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-20656-7_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20655-0
Online ISBN: 978-3-030-20656-7
eBook Packages: Computer ScienceComputer Science (R0)