skip to main content
10.1145/3409390.3409398acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Fast Modeling of Network Contention in Batch Point-to-point Communications by Packet-level Simulation with Dynamic Time-stepping

Published: 17 August 2020 Publication History

Abstract

Network contention has long been one of the root causes of performance loss in large-scale parallel applications. With the increasing importance of performance modeling to both large-scale application optimization and application-system co-design, the conflict of speed and accuracy in contention modeling is becoming prominent. Cycle-accurate network simulators are often too slow for large scale applications, while point-to-point analytical models are not accurate enough to capture the contention effects. To model the network contention in batch point-to-point communications, we propose a unified contention model after the flow-fair end-to-end congestion control mechanism. The model uses packet-level simulations to be accurate, but can be approximated by a flow-level semi-analytical model when messages are large enough, thus is fast. Furthermore, we propose a dynamic time-stepping technique which significantly speeds up the packet-level simulation with only minor accuracy loss. Experiments with typical communication patterns and application traces show that our model accurately predicates the communication time with an average error of 9%(fixed time step) and the dynamic time-stepping technique improve the simulation performance by up to 131 folds with an average accuracy loss of 10.5% for real application traces.

References

[1]
Abhinav Bhatele, Nikhil Jain, William D Gropp, and Laxmikant V Kale. 2011. Avoiding hot-spots on two-level direct networks. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 76.
[2]
Christopher D Carothers, David Bauer, and Shawn Pearce. 2002. ROSS: A high-performance, low-memory, modular Time Warp system. J. Parallel and Distrib. Comput. 62, 11 (2002), 1648–1669.
[3]
Henri Casanova, Arnaud Giersch, Arnaud Legrand, Martin Quinson, and Frédéric Suter. 2014. Versatile, scalable, and accurate simulation of distributed applications and platforms. J. Parallel and Distrib. Comput. 74, 10 (2014), 2899–2917.
[4]
Jesus Escudero-Sahuquillo, Pedro J Garcia, Francisco J Quiles, Sven-Arne Reinemo, Tor Skeie, Olav Lysne, and Jose Duato. 2014. A new proposal to deal with congestion in InfiniBand-based fat-trees. J. Parallel and Distrib. Comput. 74, 1 (2014), 1802–1819.
[5]
Matthew I Frank, Anant Agarwal, and Mary K Vernon. 1997. LoPC: modeling contention in parallel algorithms. Vol. 32. ACM.
[6]
Ernst Gunnar Gran, Magne Eimot, Sven-Arne Reinemo, Tor Skeie, Olav Lysne, Lars Paul Huse, and Gilad Shainer. 2010. First experiences with congestion control in InfiniBand hardware. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 1–12.
[7]
Ernst Gunnar Gran and Sven-Arne Reinemo. 2011. InfiniBand congestion control: modelling and validation. In Proceedings of the 4th International ICST Conference on Simulation Tools and Techniques. 390–397.
[8]
Ernst Gunnar Gran, Eitan Zahavi, Sven-Arne Reinemo, Tor Skeie, Gilad Shainer, and Olav Lysne. 2011. On the relation between congestion control, switch arbitration and fairness. In Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE Computer Society, 342–351.
[9]
E.L. Hahne. 1991. Round-robin Scheduling for Max-min Fairness in Data Networks. IEEE JSAC (1991), 1024–1039.
[10]
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine. 2010. LogGOPSim: Simulating Large-Scale Applications in the LogGOPS Model. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (Chicago, Illinois) (HPDC ’10). Association for Computing Machinery, New York, NY, USA, 597–604. https://doi.org/10.1145/1851476.1851564
[11]
Nikhil Jain, Abhinav Bhatele, Michael P Robson, Todd Gamblin, and Laxmikant V Kale. 2013. Predicting application performance using supervised learning on communication features. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 95.
[12]
Nikhil Jain, Abhinav Bhatele, Sam White, Todd Gamblin, and Laxmikant V Kale. 2016. Evaluating HPC works via simulation of parallel workloads. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 154–165.
[13]
Curtis L Janssen, Helgi Adalsteinsson, Scott Cranford, Joseph P Kenny, Ali Pinar, David A Evensky, and Jackson Mayo. 2010. A simulator for large-scale parallel computer architectures. International Journal of Distributed Systems and Technologies (IJDST) 1, 2(2010), 57–73.
[14]
Shoaib Kamil, Leonid Oliker, Ali Pinar, and John Shalf. 2009. Communication requirements and interconnect optimization for high-end scientific applications. IEEE Transactions on Parallel and Distributed Systems 21, 2 (2009), 188–202.
[15]
John Kim, William Dally, Steve Scott, and Dennis Abts. 2009. Cost-efficient dragonfly topology for large-scale systems. IEEE micro 29, 1 (2009), 33–40.
[16]
Jean-Yves Le Boudec. 2005. Rate adaptation, congestion control and fairness: A tutorial. Web page, November (2005).
[17]
Benyuan Liu, Daniel R Figueiredo, Yang Guo, Jim Kurose, and Don Towsley. 2001. A study of networks simulation efficiency: Fluid simulation vs. packet-level simulation. In Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No. 01CH37213), Vol. 3. IEEE, 1244–1253.
[18]
Kamesh Madduri, Khaled Z Ibrahim, Samuel Williams, Eun-Jin Im, Stephane Ethier, John Shalf, and Leonid Oliker. 2011. Gyrokinetic toroidal simulations on leading multi-and manycore HPC systems. In SC’11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–12.
[19]
Maxime Martinasso and Jean-François Méhaut. 2011. A contention-aware performance model for HPC-based networks: A case study of the InfiniBand network. In European Conference on Parallel Processing. Springer, 91–102.
[20]
Diego Rodriguez Martínez, José Carlos Cabaleiro, Tomás F Pena, Francisco F Rivera, and V Blanco. 2009. Accurate analytical performance model of communications in MPI applications. In 2009 IEEE International Symposium on Parallel & Distributed Processing. IEEE, 1–8.
[21]
Csaba Andras Moritz and Matthew I Frank. 2001. Logpg: Modeling network contention in message-passing programs. IEEE Transactions on Parallel and Distributed Systems 12, 4 (2001), 404–415.
[22]
Misbah Mubarak, Christopher D Carothers, Robert B Ross, and Philip Carns. 2017. Enabling parallel simulation of large-scale hpc network systems. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 87–100.
[23]
J. Nagle. 1987. On Packet Switches with Infinite Storage. IEEE Transactions on Communications 35, 4 (April 1987), 435–438. https://doi.org/10.1109/TCOM.1987.1096782
[24]
Juan A Rico-Gallego, Juan C Díaz-Martín, Ravi Reddy Manumachu, and Alexey L Lastovetsky. 2019. A Survey of Communication Performance Models for High-Performance Computing. ACM Computing Surveys (CSUR) 51, 6 (2019), 126.
[25]
Arun F Rodrigues, K Scott Hemmert, Brian W Barrett, Chad Kersey, Ron Oldfield, Marlo Weston, Rolf Risen, Jeanine Cook, Paul Rosenfeld, E CooperBalls, 2011. The structural simulation toolkit. SIGMETRICS Performance Evaluation Review 38, 4 (2011), 37–42.
[26]
Jose Renato Santos, Yoshio Turner, and G Janakiraman. 2003. End-to-end congestion control for InfiniBand. In IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No. 03CH37428), Vol. 2. IEEE, 1123–1133.
[27]
LG Valiant. 1990. A Bridging Model for Parallel Computation. Commun. ACM 33, 8 (1990). https://doi.org/10.1145/79173.79181
[28]
Jeffrey S Vetter and Frank Mueller. 2001. Communication characteristics of large-scale scientific applications for contemporary cluster architectures. In Proceedings 16th International Parallel and Distributed Processing Symposium. IEEE, 10–pp.
[29]
Jérôme Vienne. 2010. Prédiction de performances d’applications de calcul haute performance sur réseau Infiniband. Ph.D. Dissertation. Université de Grenoble.
[30]
Gengbin Zheng, Gunavardhan Kakulapati, and Laxmikant V Kalé. 2004. Bigsim: A parallel simulator for performance prediction of extremely large parallel machines. In 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings. IEEE, 78.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP Workshops '20: Workshop Proceedings of the 49th International Conference on Parallel Processing
August 2020
186 pages
ISBN:9781450388689
DOI:10.1145/3409390
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 August 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Batch Point-to-point Communication
  2. Dynamic Time-stepping
  3. Network Contention
  4. Packet-level Simulation
  5. Performance Modeling

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP Workshops '20
ICPP Workshops '20: Workshops
August 17 - 20, 2020
AB, Edmonton, Canada

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 43
    Total Downloads
  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media