ABSTRACT
Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine-grain communication, especially in execution environments with high communication cost. Low latencies of on-chip communication in emerging manycore processors promise to substantially alleviate conventional PDES bottlenecks. However, scaling to manycore clusters requires balancing faster on chip communication with slower traditional network communication between cluster nodes. In this work, we investigate performance of PDES on a cluster of Intel's Knights Landing (KNL) processors, identify performance bottlenecks, and propose techniques to address them. Specifically, we propose three performance optimizations: (1) a new design of the communication buffer centered around the use of atomic compare-and-swap operations to reduce synchronization overhead between a dedicated communication thread and computation threads; (2) careful selection of the number of computation threads per communication thread to limit the pressure on each communication thread; and (3) balancing the timing of communication and computation threads to ensure their synchronized forward progress. Combined, these optimizations result in a 2X - 16X speedup over baseline implementations in ROSS simulator.
Supplemental Material
- Abdelhalim Amer, Huiwei Lu, Yanjie Wei, Pavan Balaji, and Satoshi Matsuoka. 2015. MPI+threads: Runtime contention and remedies. ACM SIGPLAN Notices, Vol. 50, 8 (2015), 239--248. Google ScholarDigital Library
- Peter D Barnes Jr, Christopher D Carothers, David R Jefferson, and Justin M LaPre. 2013. Warp speed: executing time warp on 1,966,080 cores. In Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. ACM, 327--336. Google ScholarDigital Library
- D. Bauer, C. Carothers, and A. Holder. 2009. Scalable Time Warp on Bluegene Supercomputer. In Proc. of the ACM/IEEE/SCS Workshop on Principles of Advanced and Distributed Simulation (PADS). Google ScholarDigital Library
- Darius Buntinas, Guillaume Mercier, and William Gropp. 2006 a. Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem. In Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on, Vol. 1. IEEE, 10--pp. Google ScholarDigital Library
- Darius Buntinas, Guillaume Mercier, and William Gropp. 2006 b. Implementation and shared-memory evaluation of MPICH2 over the Nemesis communication subsystem. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 86--95. Google ScholarDigital Library
- Valeria Cardellini, Alessandro Fanfarillo, and Salvatore Filippone. 2016. Overlapping communication with computation in MPI applications. (2016).Google Scholar
- Stefano Carnà, Serena Ferracci, Emanuele De Santis, Alessandro Pellegrini, and Francesco Quaglia. 2019. Hardware-Assisted Incremental Checkpointing in Speculative Parallel Discrete Event Simulation. In 2019 Winter Simulation Conference (WSC). IEEE, 2759--2770. Google ScholarDigital Library
- C. Carothers, D. Bauer, and S. Pearce. 2000. ROSS: A High-Performance, Low Memory, Modular Time Warp System. In Proc of the 11th Workshop on Parallel and Distributed Simulation (PADS). Google ScholarDigital Library
- Christopher D. Carothers, RIchard M. Fujimoto, and Paul England. 1994 a. Effect of Communication overheads on Time Warp Performance: An Experimental Study. In Proc. of the 8th Workshop on Parallel and Distributed Simulation (PADS 94). Society for Computer Simulation, 118--125. Google ScholarDigital Library
- Christopher D Carothers, Richard M Fujimoto, and Paul England. 1994 b. Effect of communication overheads on Time Warp performance: an experimental study. ACM SIGSIM Simulation Digest, Vol. 24, 1 (1994), 118--125. Google ScholarDigital Library
- H. Chen, Y.Yao, and W. Tang. 2015. Can MIC Find Its Place in the World of PDES?. In Proceedings of International Symposium on Distributed Simulation and Real Time Systems (DS-RT). Google ScholarDigital Library
- S. Das, R. Fujimoto, K. Panesar, D. Allison, and M. Hybinette. 1994. GTW: A Time Warp System for Shared Memory Multiprocessors. In Proceedings of the 1994 Winter Simulation Conference, J. D. Tew, S. Manivannan, D. A. Sadowski, and A. F. Seila (Eds.). 1332--1339. Google ScholarDigital Library
- Ali Eker, Barry Williams, Kenneth Chiu, and Dmitry Ponomarev. 2019. Controlled asynchronous GVT: accelerating parallel discrete event simulation on many-core clusters. In Proceedings of the 48th International Conference on Parallel Processing. 1--10. Google ScholarDigital Library
- Ali Eker, Barry Williams, Kenneth Chiu, and Dmitry Ponomarev. 2020. Demand-Driven PDES: exploiting Locality in Simulation Models. In Proceedings of the 2020 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 1--10. Google ScholarDigital Library
- Ali Eker, Barry Williams, Nitesh Mishra, Dushyant Thakur, Kenneth Chiu, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2018. Performance Implications of Global Virtual Time Algorithms on a Knights Landing Processor. In 2018 IEEE/ACM 22nd International Symposium on Distributed Simulation and Real Time Applications (DS-RT). IEEE, 1--10. Google ScholarDigital Library
- R. Fujimoto. 1990 a. Parallel Discrete Event Simulation. Commun. ACM, Vol. 33, 10 (Oct. 1990), 30--53. Google ScholarDigital Library
- R. Fujimoto. 1990 b. Performance of Time Warp under synthetic workloads. Proceedings of the SCS Multiconference on Distributed Simulation, Vol. 22, 1 (Jan. 1990), 23--28.Google Scholar
- R. Fujimoto and K. Panesar. 1995. Buffer Management in Shared-Memory Time Warp System. In Proceedings of the 9th Workshop on Parallel and Distributed Simulation (PADS 95). 149--156. Google ScholarDigital Library
- R. M. Fujimoto. 1989. Time Warp on a Shared Memory Multiprocessor. Transactions of Society for Computer Simulation (July 1989), 211--239. Google ScholarDigital Library
- R. M. Fujimoto and M. Hybinette. 1997. Computing Global Virtual Time in Shared-Memory Multiprocessors. ACM Transactions on Modeling and Computer Simulation, Vol. 7, 4 (1997), 425--446. Google ScholarDigital Library
- W. Gropp, E. Lusk, and A. Skjellum. 1994. Using MPI: Portable Parallel Programming with the Message-Passing Interface .MIT Press, Cambridge, MA. Google ScholarDigital Library
- Sounak Gupta and Philip A Wilsey. 2014. Lock-free pending event set management in time warp. In Proceedings of the 2nd ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 15--26. Google ScholarDigital Library
- Timothy Harris. 2001. A Pragmatic Implementation of Non-Blocking Link-Lists. In Proceedings of the 15th International Conference on Distributed Computing. 300--314. Google ScholarDigital Library
- A. Heinecke, K. Vaidanathan, M. Smelianskiy, A. Kobutov, R. Dubtsov, G. Henri, A. Shet, G. Chrysos, and P. Dubey. 2013. Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems based on Intel Xeon Phi Coprocessor. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS). Google ScholarDigital Library
- Torsten Hoefler and Andrew Lumsdaine. 2008. Message progression in parallel computing-to thread or not to thread?. In Cluster Computing, 2008 IEEE International Conference on. IEEE, 213--222.Google ScholarCross Ref
- Deepak Jagtap, Ketan Bahulkar, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2012a. Characterizing and Understanding PDES Behavior on Tilera Architecture. In Workshop on Principles of Advanced and Distributed Simulation (PADS 12). Google ScholarDigital Library
- D. Jagtap, N.Abu-Ghazaleh, and D.Ponomarev. 2012b. Optimization of Parallel Discrete Event Simulator for Multi-core Systems. In International Parallel and Distributed Processing Symposium. Google ScholarDigital Library
- D. Jefferson. 1985. Virtual Time. ACM Transactions on Programming Languages and Systems, Vol. 7, 3 (July 1985), 405--425. Google ScholarDigital Library
- M. Lu, L. Zhang, H. Hyunh, Z. Ong, Y. Liang, B. He, R. Goh, and R. Huynh. 2013. Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor. In Proceedings of International Conference on Big Data.Google Scholar
- Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, and Francesco Quaglia. 2016. A Lock-Free O(1) Event Pool and its Application to Share-Everything PDES Platforms. In 2016 IEEE/ACM 20th International Symposium on Distributed Simulation and Real Time Applications (DS-RT). IEEE, 53--60. Google ScholarDigital Library
- Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, and Francesco Quaglia. 2017. A Conflict-Resilient Lock-Free Calendar Queue for Scalable Share-Everything PDES Platforms. In Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation). ACM, 15--26. Google ScholarDigital Library
- Guillaume Mercier, Francc ois Trahay, Elisabeth Brunet, and Darius Buntinas. 2009. NewMadeleine: An efficient support for high-performance networks in MPICH2. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 1--12. Google ScholarDigital Library
- Eric Mikida, Nikhil Jain, Laxmikant Kale, Elsa Gonsiorowski, Christopher D Carothers, Peter D Barnes Jr, and David Jefferson. 2016. Towards pdes in a message-driven paradigm: A preliminary case study using charm+. In Proceedings of the 2016 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 99--110. Google ScholarDigital Library
- Eric Mikida and Laxmikant V Kale. 2019. An adaptive non-blocking GVT algorithm. In Proceedings of the ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 25--36. https://doi.org/10.1145/3316480.3322896 Google ScholarDigital Library
- G. Misra, N. Kurkure, A. Das, M.Valmiki, S. Das, and A. Gupta. 2013. Evaluation of Rodinia Codes on Intel Xeon Phi. In Proceedings of the 4th International Conference on Intelligent Systems, Modelling and Simulation. Google ScholarDigital Library
- S. Pennycook, C. Hughes, M. Smelianskiy, and S. Jarvis. 2013. Exploring SIMD for Molecular Dynamics Using Intel Xeon Processor and Intel Xeon Phi Coprocessors. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS). Google ScholarDigital Library
- K. Perumalla. 2007. Scaling time warp-based discrete event execution to 104 processors on a Blue Gene supercomputer. In Proc. of the ACM Conference on Computing Frontiers (CF). Google ScholarDigital Library
- A. Ramachandran, J. Vienne, R. Wijmgaart, L. Koesterke, and I. Sharapov. 2013. Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi. In Proceedings of International Conference on Parallel Processing (ICPP). Google ScholarDigital Library
- Dhananjai M Rao. 2018. Performance comparison of cross memory attach capable MPI vs. multithreaded optimistic parallel simulations. In Proceedings of the 2018 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 37--48. Google ScholarDigital Library
- Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation intel xeon phi product. IEEE Micro, Vol. 36, 2 (2016), 34--46. Google ScholarDigital Library
- Hakan Sundell and Philippas Tsigas. 2005. Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems. In J. Parallel Distrib. Comput. 609--627. Google ScholarDigital Library
- John D Valois. 1994. Implementing lock-free queues. In Proceedings of the seventh international conference on Parallel and Distributed Computing Systems. 64--69.Google Scholar
- Jingjing Wang, Ketan Bahulkar, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2013. Can pdes scale in environments with heterogeneous delays?. In Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. ACM, 35--46. Google ScholarDigital Library
- Jingjing Wang, Deepak Jagtap, Nael Abu-Ghazaleh, and Dmitry Ponomarev. 2014. Parallel discrete event simulation for multi-core systems: Analysis and optimization. IEEE Transactions on Parallel and Distributed Systems, Vol. 25, 6 (2014), 1574--1584. Google ScholarDigital Library
- Barry Williams, Dmitry Ponomarev, Nael Abu-Ghazaleh, and Philip Wilsey. 2017. Performance characterization of parallel discrete event simulation on knights landing processor. In Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. ACM, 121--132. Google ScholarDigital Library
- Biwei Xie, Xu Liu, Jianfeng Zhan, Zhen Jia, Yuqing Zhu, Lei Wang, and Lixin Zhang. 2015. Characterizing Data Analytics Workloads on Intel Xeon Phi. In Workload Characterization (IISWC), 2015 IEEE International Symposium on. IEEE, 114--115. Google ScholarDigital Library
Index Terms
- High-Performance PDES on Manycore Clusters
Recommendations
Demand-Driven PDES: Exploiting Locality in Simulation Models
SIGSIM-PADS '20: Proceedings of the 2020 ACM SIGSIM Conference on Principles of Advanced Discrete SimulationTraditional parallel discrete event simulation (PDES) systems treat each simulation thread in the same manner, regardless of whether a thread has events to process in its input queue or not. At the same time, many real-life simulation models exhibit ...
Manycore performance-portability: Kokkos multidimensional array library
A New Overview of the Trilinos Project --Part 1Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern manycore accelerator devices is a ...
Analysis of computing and energy performance of multicore, NUMA, and manycore platforms for an irregular application
IA3 '13: Proceedings of the 3rd Workshop on Irregular Applications: Architectures and AlgorithmsThe exponential growth in processor performance seems to have reached a turning point. Nowadays, energy efficiency is as important as performance and has become a critical aspect to the development of scalable systems. These strict energy constraints ...
Comments