Evaluating Quality of Service Traffic Classes on the Megafly Network

Mubarak, Misbah; McGlohon, Neil; Musleh, Malek; Borch, Eric; Ross, Robert B.; Huggahalli, Ram; Chunduri, Sudheer; Parker, Scott; Carothers, Christopher D.; Kumaran, Kalyan

doi:10.1007/978-3-030-20656-7_1

Misbah Mubarak¹⁸,
Neil McGlohon²⁰,
Malek Musleh¹⁹,
Eric Borch¹⁹,
Robert B. Ross¹⁸,
Ram Huggahalli¹⁹,
Sudheer Chunduri²¹,
Scott Parker²¹,
Christopher D. Carothers²⁰ &
…
Kalyan Kumaran²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11501))

Included in the following conference series:

International Conference on High Performance Computing

1237 Accesses

Abstract

An emerging trend in High Performance Computing (HPC) systems that use hierarchical topologies (such as dragonfly) is that the applications are increasingly exhibiting high run-to-run performance variability. This poses a significant challenge for application developers, job schedulers, and system maintainers. One approach to address the performance variability is to use newly proposed network topologies such as megafly (or dragonfly+) that offer increased path diversity compared to a traditional fully connected dragonfly. Yet another approach is to use quality of service (QoS) traffic classes that ensure bandwidth guarantees. In this work, we select HPC application workloads that have exhibited performance variability on current 2-D dragonfly systems. We evaluate the baseline performance expectations of these workloads on megafly and 1-D dragonfly network models with comparably similar network configurations. Our results show that the megafly network, despite using fewer virtual channels (VCs) for deadlock avoidance than a dragonfly, performs as well as a fully connected 1-D dragonfly network. We then exploit the fact that megafly networks require fewer VCs to incorporate QoS traffic classes. We use bandwidth capping and traffic differentiation techniques to introduce multiple traffic classes in megafly networks. In some cases, our results show that QoS can completely mitigate application performance variability while causing minimal slowdown to the background network traffic.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Tunable Implementation of Quality-of-Service Classes for HPC Networks

Megafly: A Topology for Exascale Systems

Janus: a framework to boost HPC applications in the cloud based on SDN path provisioning

Article 23 November 2021

Notes

1.
The Scalable Workload Models code is available at the git repo: https://xgitlab.cels.anl.gov/codes/workloads.git.

References

Alizadeh, M., Kabbani, A., Edsall, T., Prabhakar, B., Vahdat, A., Yasuda, M.: Less is more: trading a little bandwidth for ultra-low latency in the data center. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 19–19. USENIX Association (2012)
Google Scholar
Argonne Leadership Computing Facility (ALCF): Theta, Argonne’s Cray XC System. https://www.alcf.anl.gov/theta
Chen, S., Nahrstedt, K.: An overview of quality of service routing for next-generation high-speed networks: problems and solutions. IEEE Netw. 12(6), 64–79 (1998)
Article Google Scholar
Cheng, A.S., Lovett, T.D., Parker, M.A.: Traffic class arbitration based on priority and bandwidth allocation. Google Patents, December 2016
Google Scholar
Chunduri, S., et al.: Run-to-run variability on Xeon Phi based Cray XC systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, p. 52. ACM (2017)
Google Scholar
Chunduri, S., Parker, S., Balaji, P., Harms, K., Kumaran, K.: Characterization of MPI usage on a production supercomputer. In: Characterization of MPI Usage on a Production Supercomputer. IEEE (2018)
Google Scholar
Curtis, A.R., Mogul, J.C., Tourrilhes, J., Yalagandula, P., Sharma, P., Banerjee, S.: DevoFlow: scaling flow management for high-performance networks. In: ACM SIGCOMM Computer Communication Review, vol. 41, pp. 254–265. ACM (2011)
Article Google Scholar
Flajslik, M., Borch, E., Parker, M.A.: Megafly: a topology for exascale systems. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds.) ISC High Performance 2018. LNCS, vol. 10876, pp. 289–310. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92040-5_15
Chapter Google Scholar
Jokanovic, A., Sancho, J.C., Labarta, J., Rodriguez, G., Minkenberg, C.: Effective quality-of-service policy for capacity high-performance computing systems. In: 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCC-ICESS), pp. 598–607. IEEE (2012)
Google Scholar
Kim, J., Dally, W.J., Scott, S., Abts, D.: Technology-driven, highly-scalable dragonfly topology. In: 35th International Symposium on Computer Architecture 2008. ISCA 2008, pp. 77–88. IEEE (2008)
Google Scholar
Los Alamos National Laboratory: Trinity Cray XC40 system. http://www.lanl.gov/projects/trinity/
McKeown, N., et al.: OpenFlow: enabling innovation in campus networks. ACM SIGCOMM Comput. Commun. Rev. 38(2), 69–74 (2008)
Article Google Scholar
Mubarak, M., et al.: Quantifying I/O and communication traffic interference on dragonfly networks equipped with burst buffers. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER), pp. 204–215. IEEE (2017)
Google Scholar
Mubarak, M., Carothers, C.D., Ross, R.B., Carns, P.H.: Enabling parallel simulation of large-scale HPC network systems. IEEE Trans. Parallel Distrib. Syst. 28(1), 87–100 (2017)
Article Google Scholar
Mubarak, M., Ross, R.B.: Validation study of CODES dragonfly network model with Theta Cray XC system (2017). https://doi.org/10.2172/1356812
NERSC: Cori. https://www.nersc.gov/users/computational-systems/cori/
Shpiner, A., Haramaty, Z., Eliad, S., Zdornov, V., Gafni, B., Zahavi, E.: Dragonfly+: low cost topology for scaling datacenters. In: 2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB), pp. 1–8. IEEE (2017)
Google Scholar
Thompson, J.: Scalable workload models for system simulations background and motivation. Technical report (2014). http://hpc.pnl.gov/modsim/2014/Presentations/Thompson.pdf
Won, J., Kim, G., Kim, J., Jiang, T., Parker, M., Scott, S.: Overcoming far-end congestion in large-scale networks. In: 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 415–427. IEEE (2015)
Google Scholar
Yang, X., Jenkins, J., Mubarak, M., Ross, R.B., Lan, Z.: Watch out for the bully! job interference study on dragonfly network. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC16, pp. 750–760. IEEE (2016)
Google Scholar
Zhou, Z., et al.: Improving batch scheduling on Blue Gene/Q by relaxing 5D torus network allocation constraints. In: 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 439–448. IEEE (2015)
Google Scholar

Download references

Acknowledgment

This research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’s exascale computing imperative. The work has used resources from the Argonne’s Leadership Computing Facility (ALCF), Rensselaer’s CCI supercomputing center and Argonne’s Laboratory Computing Resource Center (LCRC).

Author information

Authors and Affiliations

Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL, USA
Misbah Mubarak & Robert B. Ross
Intel Corporation, Santa Clara, CA, USA
Malek Musleh, Eric Borch & Ram Huggahalli
Rensselaer Polytechnic Institute, Troy, NY, USA
Neil McGlohon & Christopher D. Carothers
Argonne Leadership Computing Facility (ALCF), Argonne National Laboratory, Lemont, IL, USA
Sudheer Chunduri, Scott Parker & Kalyan Kumaran

Authors

Misbah Mubarak
View author publications
You can also search for this author in PubMed Google Scholar
Neil McGlohon
View author publications
You can also search for this author in PubMed Google Scholar
Malek Musleh
View author publications
You can also search for this author in PubMed Google Scholar
Eric Borch
View author publications
You can also search for this author in PubMed Google Scholar
Robert B. Ross
View author publications
You can also search for this author in PubMed Google Scholar
Ram Huggahalli
View author publications
You can also search for this author in PubMed Google Scholar
Sudheer Chunduri
View author publications
You can also search for this author in PubMed Google Scholar
Scott Parker
View author publications
You can also search for this author in PubMed Google Scholar
Christopher D. Carothers
View author publications
You can also search for this author in PubMed Google Scholar
Kalyan Kumaran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Misbah Mubarak .

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, UK
Michèle Weiland
Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Dresden, Germany
Guido Juckeland
Technical University of Munich, Munich, Germany
Carsten Trinitis
Ohio State University, Columbus, USA
Ponnuswamy Sadayappan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mubarak, M. et al. (2019). Evaluating Quality of Service Traffic Classes on the Megafly Network. In: Weiland, M., Juckeland, G., Trinitis, C., Sadayappan, P. (eds) High Performance Computing. ISC High Performance 2019. Lecture Notes in Computer Science(), vol 11501. Springer, Cham. https://doi.org/10.1007/978-3-030-20656-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-20656-7_1
Published: 17 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20655-0
Online ISBN: 978-3-030-20656-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics