research-article

Public Access

High-Performance PDES on Manycore Clusters

Authors:
Barry Williams

Binghamton University, Binghamton, NY, USA

Binghamton University, Binghamton, NY, USA
View Profile

,
Ali Eker

Binghamton University, Binghamton, NY, USA

Binghamton University, Binghamton, NY, USA
View Profile

,
Kenneth Chiu

Binghamton University, Binghamton, NY, USA

Binghamton University, Binghamton, NY, USA
View Profile

,
Dmitry Ponomarev

Binghamton University, Binghamton, NY, USA

Binghamton University, Binghamton, NY, USA
View Profile

SIGSIM-PADS '21: Proceedings of the 2021 ACM SIGSIM Conference on Principles of Advanced Discrete SimulationMay 2021Pages 153–164https://doi.org/10.1145/3437959.3459252

Published:01 June 2021Publication History

SIGSIM-PADS '21: Proceedings of the 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation

Pages 153–164

ABSTRACT

Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine-grain communication, especially in execution environments with high communication cost. Low latencies of on-chip communication in emerging manycore processors promise to substantially alleviate conventional PDES bottlenecks. However, scaling to manycore clusters requires balancing faster on chip communication with slower traditional network communication between cluster nodes. In this work, we investigate performance of PDES on a cluster of Intel's Knights Landing (KNL) processors, identify performance bottlenecks, and propose techniques to address them. Specifically, we propose three performance optimizations: (1) a new design of the communication buffer centered around the use of atomic compare-and-swap operations to reduce synchronization overhead between a dedicated communication thread and computation threads; (2) careful selection of the number of computation threads per communication thread to limit the pressure on each communication thread; and (3) balancing the timing of communication and computation threads to ensure their synchronized forward progress. Combined, these optimizations result in a 2X - 16X speedup over baseline implementations in ROSS simulator.

Supplemental Material

SIGSIM-PADS21-pads277.mp4

mp4

124.2 MB

Download

References

Abdelhalim Amer, Huiwei Lu, Yanjie Wei, Pavan Balaji, and Satoshi Matsuoka. 2015. MPI+threads: Runtime contention and remedies. ACM SIGPLAN Notices, Vol. 50, 8 (2015), 239--248. Google ScholarDigital Library
Peter D Barnes Jr, Christopher D Carothers, David R Jefferson, and Justin M LaPre. 2013. Warp speed: executing time warp on 1,966,080 cores. In Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. ACM, 327--336. Google ScholarDigital Library
D. Bauer, C. Carothers, and A. Holder. 2009. Scalable Time Warp on Bluegene Supercomputer. In Proc. of the ACM/IEEE/SCS Workshop on Principles of Advanced and Distributed Simulation (PADS). Google ScholarDigital Library
Darius Buntinas, Guillaume Mercier, and William Gropp. 2006 a. Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem. In Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on, Vol. 1. IEEE, 10--pp. Google ScholarDigital Library
Darius Buntinas, Guillaume Mercier, and William Gropp. 2006 b. Implementation and shared-memory evaluation of MPICH2 over the Nemesis communication subsystem. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 86--95. Google ScholarDigital Library
Valeria Cardellini, Alessandro Fanfarillo, and Salvatore Filippone. 2016. Overlapping communication with computation in MPI applications. (2016).Google Scholar
Stefano Carnà, Serena Ferracci, Emanuele De Santis, Alessandro Pellegrini, and Francesco Quaglia. 2019. Hardware-Assisted Incremental Checkpointing in Speculative Parallel Discrete Event Simulation. In 2019 Winter Simulation Conference (WSC). IEEE, 2759--2770. Google ScholarDigital Library
C. Carothers, D. Bauer, and S. Pearce. 2000. ROSS: A High-Performance, Low Memory, Modular Time Warp System. In Proc of the 11th Workshop on Parallel and Distributed Simulation (PADS). Google ScholarDigital Library
Christopher D. Carothers, RIchard M. Fujimoto, and Paul England. 1994 a. Effect of Communication overheads on Time Warp Performance: An Experimental Study. In Proc. of the 8th Workshop on Parallel and Distributed Simulation (PADS 94). Society for Computer Simulation, 118--125. Google ScholarDigital Library
Christopher D Carothers, Richard M Fujimoto, and Paul England. 1994 b. Effect of communication overheads on Time Warp performance: an experimental study. ACM SIGSIM Simulation Digest, Vol. 24, 1 (1994), 118--125. Google ScholarDigital Library
H. Chen, Y.Yao, and W. Tang. 2015. Can MIC Find Its Place in the World of PDES?. In Proceedings of International Symposium on Distributed Simulation and Real Time Systems (DS-RT). Google ScholarDigital Library
S. Das, R. Fujimoto, K. Panesar, D. Allison, and M. Hybinette. 1994. GTW: A Time Warp System for Shared Memory Multiprocessors. In Proceedings of the 1994 Winter Simulation Conference, J. D. Tew, S. Manivannan, D. A. Sadowski, and A. F. Seila (Eds.). 1332--1339. Google ScholarDigital Library
Ali Eker, Barry Williams, Kenneth Chiu, and Dmitry Ponomarev. 2019. Controlled asynchronous GVT: accelerating parallel discrete event simulation on many-core clusters. In Proceedings of the 48th International Conference on Parallel Processing. 1--10. Google ScholarDigital Library
Ali Eker, Barry Williams, Kenneth Chiu, and Dmitry Ponomarev. 2020. Demand-Driven PDES: exploiting Locality in Simulation Models. In Proceedings of the 2020 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 1--10. Google ScholarDigital Library
Ali Eker, Barry Williams, Nitesh Mishra, Dushyant Thakur, Kenneth Chiu, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2018. Performance Implications of Global Virtual Time Algorithms on a Knights Landing Processor. In 2018 IEEE/ACM 22nd International Symposium on Distributed Simulation and Real Time Applications (DS-RT). IEEE, 1--10. Google ScholarDigital Library
R. Fujimoto. 1990 a. Parallel Discrete Event Simulation. Commun. ACM, Vol. 33, 10 (Oct. 1990), 30--53. Google ScholarDigital Library
R. Fujimoto. 1990 b. Performance of Time Warp under synthetic workloads. Proceedings of the SCS Multiconference on Distributed Simulation, Vol. 22, 1 (Jan. 1990), 23--28.Google Scholar
R. Fujimoto and K. Panesar. 1995. Buffer Management in Shared-Memory Time Warp System. In Proceedings of the 9th Workshop on Parallel and Distributed Simulation (PADS 95). 149--156. Google ScholarDigital Library
R. M. Fujimoto. 1989. Time Warp on a Shared Memory Multiprocessor. Transactions of Society for Computer Simulation (July 1989), 211--239. Google ScholarDigital Library
R. M. Fujimoto and M. Hybinette. 1997. Computing Global Virtual Time in Shared-Memory Multiprocessors. ACM Transactions on Modeling and Computer Simulation, Vol. 7, 4 (1997), 425--446. Google ScholarDigital Library
W. Gropp, E. Lusk, and A. Skjellum. 1994. Using MPI: Portable Parallel Programming with the Message-Passing Interface .MIT Press, Cambridge, MA. Google ScholarDigital Library
Sounak Gupta and Philip A Wilsey. 2014. Lock-free pending event set management in time warp. In Proceedings of the 2nd ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 15--26. Google ScholarDigital Library
Timothy Harris. 2001. A Pragmatic Implementation of Non-Blocking Link-Lists. In Proceedings of the 15th International Conference on Distributed Computing. 300--314. Google ScholarDigital Library
A. Heinecke, K. Vaidanathan, M. Smelianskiy, A. Kobutov, R. Dubtsov, G. Henri, A. Shet, G. Chrysos, and P. Dubey. 2013. Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems based on Intel Xeon Phi Coprocessor. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS). Google ScholarDigital Library
Torsten Hoefler and Andrew Lumsdaine. 2008. Message progression in parallel computing-to thread or not to thread?. In Cluster Computing, 2008 IEEE International Conference on. IEEE, 213--222.Google ScholarCross Ref
Deepak Jagtap, Ketan Bahulkar, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2012a. Characterizing and Understanding PDES Behavior on Tilera Architecture. In Workshop on Principles of Advanced and Distributed Simulation (PADS 12). Google ScholarDigital Library
D. Jagtap, N.Abu-Ghazaleh, and D.Ponomarev. 2012b. Optimization of Parallel Discrete Event Simulator for Multi-core Systems. In International Parallel and Distributed Processing Symposium. Google ScholarDigital Library
D. Jefferson. 1985. Virtual Time. ACM Transactions on Programming Languages and Systems, Vol. 7, 3 (July 1985), 405--425. Google ScholarDigital Library
M. Lu, L. Zhang, H. Hyunh, Z. Ong, Y. Liang, B. He, R. Goh, and R. Huynh. 2013. Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor. In Proceedings of International Conference on Big Data.Google Scholar
Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, and Francesco Quaglia. 2016. A Lock-Free O(1) Event Pool and its Application to Share-Everything PDES Platforms. In 2016 IEEE/ACM 20th International Symposium on Distributed Simulation and Real Time Applications (DS-RT). IEEE, 53--60. Google ScholarDigital Library
Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, and Francesco Quaglia. 2017. A Conflict-Resilient Lock-Free Calendar Queue for Scalable Share-Everything PDES Platforms. In Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation). ACM, 15--26. Google ScholarDigital Library
Guillaume Mercier, Francc ois Trahay, Elisabeth Brunet, and Darius Buntinas. 2009. NewMadeleine: An efficient support for high-performance networks in MPICH2. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 1--12. Google ScholarDigital Library
Eric Mikida, Nikhil Jain, Laxmikant Kale, Elsa Gonsiorowski, Christopher D Carothers, Peter D Barnes Jr, and David Jefferson. 2016. Towards pdes in a message-driven paradigm: A preliminary case study using charm+. In Proceedings of the 2016 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 99--110. Google ScholarDigital Library
Eric Mikida and Laxmikant V Kale. 2019. An adaptive non-blocking GVT algorithm. In Proceedings of the ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 25--36. https://doi.org/10.1145/3316480.3322896 Google ScholarDigital Library
G. Misra, N. Kurkure, A. Das, M.Valmiki, S. Das, and A. Gupta. 2013. Evaluation of Rodinia Codes on Intel Xeon Phi. In Proceedings of the 4th International Conference on Intelligent Systems, Modelling and Simulation. Google ScholarDigital Library
S. Pennycook, C. Hughes, M. Smelianskiy, and S. Jarvis. 2013. Exploring SIMD for Molecular Dynamics Using Intel Xeon Processor and Intel Xeon Phi Coprocessors. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS). Google ScholarDigital Library
K. Perumalla. 2007. Scaling time warp-based discrete event execution to 104 processors on a Blue Gene supercomputer. In Proc. of the ACM Conference on Computing Frontiers (CF). Google ScholarDigital Library
A. Ramachandran, J. Vienne, R. Wijmgaart, L. Koesterke, and I. Sharapov. 2013. Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi. In Proceedings of International Conference on Parallel Processing (ICPP). Google ScholarDigital Library
Dhananjai M Rao. 2018. Performance comparison of cross memory attach capable MPI vs. multithreaded optimistic parallel simulations. In Proceedings of the 2018 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 37--48. Google ScholarDigital Library
Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation intel xeon phi product. IEEE Micro, Vol. 36, 2 (2016), 34--46. Google ScholarDigital Library
Hakan Sundell and Philippas Tsigas. 2005. Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems. In J. Parallel Distrib. Comput. 609--627. Google ScholarDigital Library
John D Valois. 1994. Implementing lock-free queues. In Proceedings of the seventh international conference on Parallel and Distributed Computing Systems. 64--69.Google Scholar
Jingjing Wang, Ketan Bahulkar, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2013. Can pdes scale in environments with heterogeneous delays?. In Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. ACM, 35--46. Google ScholarDigital Library
Jingjing Wang, Deepak Jagtap, Nael Abu-Ghazaleh, and Dmitry Ponomarev. 2014. Parallel discrete event simulation for multi-core systems: Analysis and optimization. IEEE Transactions on Parallel and Distributed Systems, Vol. 25, 6 (2014), 1574--1584. Google ScholarDigital Library
Barry Williams, Dmitry Ponomarev, Nael Abu-Ghazaleh, and Philip Wilsey. 2017. Performance characterization of parallel discrete event simulation on knights landing processor. In Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. ACM, 121--132. Google ScholarDigital Library
Biwei Xie, Xu Liu, Jianfeng Zhan, Zhen Jia, Yuqing Zhu, Lei Wang, and Lixin Zhang. 2015. Characterizing Data Analytics Workloads on Intel Xeon Phi. In Workload Characterization (IISWC), 2015 IEEE International Symposium on. IEEE, 114--115. Google ScholarDigital Library

Index Terms

High-Performance PDES on Manycore Clusters
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Computing methodologies
  1. Modeling and simulation
    1. Simulation types and techniques
      1. Discrete-event simulation
      2. Massively parallel and high-performance simulations
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Shared memory algorithms

Recommendations

Demand-Driven PDES: Exploiting Locality in Simulation Models
SIGSIM-PADS '20: Proceedings of the 2020 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation

Traditional parallel discrete event simulation (PDES) systems treat each simulation thread in the same manner, regardless of whether a thread has events to process in its input queue or not. At the same time, many real-life simulation models exhibit ...
Read More
Manycore performance-portability: Kokkos multidimensional array library
A New Overview of the Trilinos Project --Part 1

Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a ...
Read More
Analysis of computing and energy performance of multicore, NUMA, and manycore platforms for an irregular application
IA³ '13: Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms

The exponential growth in processor performance seems to have reached a turning point. Nowadays, energy efficiency is as important as performance and has become a critical aspect to the development of scalable systems. These strict energy constraints ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGSIM-PADS '21: Proceedings of the 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation
May 2021
181 pages
ISBN:9781450382960
DOI:10.1145/3437959
General Chairs:
Saikou Diallo
Old Dominion University, USA
,
Andreas Tolk
The MITRE Corporation, USA
,
Program Chair:
Philippe Giabbanelli
Miami University, USA
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cluster
communication
manycore
parallel discrete event simulation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate398of779submissions,51%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 155
  Total Downloads
- Downloads (Last 12 months)61
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

High-Performance PDES on Manycore Clusters

SIGSIM-PADS '21: Proceedings of the 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Demand-Driven PDES: Exploiting Locality in Simulation Models

Manycore performance-portability: Kokkos multidimensional array library

Analysis of computing and energy performance of multicore, NUMA, and manycore platforms for an irregular application