skip to main content
10.1145/3437959.3459252acmconferencesArticle/Chapter ViewAbstractPublication PagespadsConference Proceedingsconference-collections
research-article
Public Access

High-Performance PDES on Manycore Clusters

Published:01 June 2021Publication History

ABSTRACT

Performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by fine-grain communication, especially in execution environments with high communication cost. Low latencies of on-chip communication in emerging manycore processors promise to substantially alleviate conventional PDES bottlenecks. However, scaling to manycore clusters requires balancing faster on chip communication with slower traditional network communication between cluster nodes. In this work, we investigate performance of PDES on a cluster of Intel's Knights Landing (KNL) processors, identify performance bottlenecks, and propose techniques to address them. Specifically, we propose three performance optimizations: (1) a new design of the communication buffer centered around the use of atomic compare-and-swap operations to reduce synchronization overhead between a dedicated communication thread and computation threads; (2) careful selection of the number of computation threads per communication thread to limit the pressure on each communication thread; and (3) balancing the timing of communication and computation threads to ensure their synchronized forward progress. Combined, these optimizations result in a 2X - 16X speedup over baseline implementations in ROSS simulator.

Skip Supplemental Material Section

Supplemental Material

SIGSIM-PADS21-pads277.mp4

mp4

124.2 MB

References

  1. Abdelhalim Amer, Huiwei Lu, Yanjie Wei, Pavan Balaji, and Satoshi Matsuoka. 2015. MPI+threads: Runtime contention and remedies. ACM SIGPLAN Notices, Vol. 50, 8 (2015), 239--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Peter D Barnes Jr, Christopher D Carothers, David R Jefferson, and Justin M LaPre. 2013. Warp speed: executing time warp on 1,966,080 cores. In Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. ACM, 327--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Bauer, C. Carothers, and A. Holder. 2009. Scalable Time Warp on Bluegene Supercomputer. In Proc. of the ACM/IEEE/SCS Workshop on Principles of Advanced and Distributed Simulation (PADS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Darius Buntinas, Guillaume Mercier, and William Gropp. 2006 a. Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem. In Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on, Vol. 1. IEEE, 10--pp. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Darius Buntinas, Guillaume Mercier, and William Gropp. 2006 b. Implementation and shared-memory evaluation of MPICH2 over the Nemesis communication subsystem. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 86--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Valeria Cardellini, Alessandro Fanfarillo, and Salvatore Filippone. 2016. Overlapping communication with computation in MPI applications. (2016).Google ScholarGoogle Scholar
  7. Stefano Carnà, Serena Ferracci, Emanuele De Santis, Alessandro Pellegrini, and Francesco Quaglia. 2019. Hardware-Assisted Incremental Checkpointing in Speculative Parallel Discrete Event Simulation. In 2019 Winter Simulation Conference (WSC). IEEE, 2759--2770. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Carothers, D. Bauer, and S. Pearce. 2000. ROSS: A High-Performance, Low Memory, Modular Time Warp System. In Proc of the 11th Workshop on Parallel and Distributed Simulation (PADS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Christopher D. Carothers, RIchard M. Fujimoto, and Paul England. 1994 a. Effect of Communication overheads on Time Warp Performance: An Experimental Study. In Proc. of the 8th Workshop on Parallel and Distributed Simulation (PADS 94). Society for Computer Simulation, 118--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Christopher D Carothers, Richard M Fujimoto, and Paul England. 1994 b. Effect of communication overheads on Time Warp performance: an experimental study. ACM SIGSIM Simulation Digest, Vol. 24, 1 (1994), 118--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Chen, Y.Yao, and W. Tang. 2015. Can MIC Find Its Place in the World of PDES?. In Proceedings of International Symposium on Distributed Simulation and Real Time Systems (DS-RT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Das, R. Fujimoto, K. Panesar, D. Allison, and M. Hybinette. 1994. GTW: A Time Warp System for Shared Memory Multiprocessors. In Proceedings of the 1994 Winter Simulation Conference, J. D. Tew, S. Manivannan, D. A. Sadowski, and A. F. Seila (Eds.). 1332--1339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Ali Eker, Barry Williams, Kenneth Chiu, and Dmitry Ponomarev. 2019. Controlled asynchronous GVT: accelerating parallel discrete event simulation on many-core clusters. In Proceedings of the 48th International Conference on Parallel Processing. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ali Eker, Barry Williams, Kenneth Chiu, and Dmitry Ponomarev. 2020. Demand-Driven PDES: exploiting Locality in Simulation Models. In Proceedings of the 2020 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ali Eker, Barry Williams, Nitesh Mishra, Dushyant Thakur, Kenneth Chiu, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2018. Performance Implications of Global Virtual Time Algorithms on a Knights Landing Processor. In 2018 IEEE/ACM 22nd International Symposium on Distributed Simulation and Real Time Applications (DS-RT). IEEE, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Fujimoto. 1990 a. Parallel Discrete Event Simulation. Commun. ACM, Vol. 33, 10 (Oct. 1990), 30--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Fujimoto. 1990 b. Performance of Time Warp under synthetic workloads. Proceedings of the SCS Multiconference on Distributed Simulation, Vol. 22, 1 (Jan. 1990), 23--28.Google ScholarGoogle Scholar
  18. R. Fujimoto and K. Panesar. 1995. Buffer Management in Shared-Memory Time Warp System. In Proceedings of the 9th Workshop on Parallel and Distributed Simulation (PADS 95). 149--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. M. Fujimoto. 1989. Time Warp on a Shared Memory Multiprocessor. Transactions of Society for Computer Simulation (July 1989), 211--239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. M. Fujimoto and M. Hybinette. 1997. Computing Global Virtual Time in Shared-Memory Multiprocessors. ACM Transactions on Modeling and Computer Simulation, Vol. 7, 4 (1997), 425--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Gropp, E. Lusk, and A. Skjellum. 1994. Using MPI: Portable Parallel Programming with the Message-Passing Interface .MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sounak Gupta and Philip A Wilsey. 2014. Lock-free pending event set management in time warp. In Proceedings of the 2nd ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 15--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Timothy Harris. 2001. A Pragmatic Implementation of Non-Blocking Link-Lists. In Proceedings of the 15th International Conference on Distributed Computing. 300--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Heinecke, K. Vaidanathan, M. Smelianskiy, A. Kobutov, R. Dubtsov, G. Henri, A. Shet, G. Chrysos, and P. Dubey. 2013. Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems based on Intel Xeon Phi Coprocessor. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Torsten Hoefler and Andrew Lumsdaine. 2008. Message progression in parallel computing-to thread or not to thread?. In Cluster Computing, 2008 IEEE International Conference on. IEEE, 213--222.Google ScholarGoogle ScholarCross RefCross Ref
  26. Deepak Jagtap, Ketan Bahulkar, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2012a. Characterizing and Understanding PDES Behavior on Tilera Architecture. In Workshop on Principles of Advanced and Distributed Simulation (PADS 12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Jagtap, N.Abu-Ghazaleh, and D.Ponomarev. 2012b. Optimization of Parallel Discrete Event Simulator for Multi-core Systems. In International Parallel and Distributed Processing Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Jefferson. 1985. Virtual Time. ACM Transactions on Programming Languages and Systems, Vol. 7, 3 (July 1985), 405--425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Lu, L. Zhang, H. Hyunh, Z. Ong, Y. Liang, B. He, R. Goh, and R. Huynh. 2013. Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor. In Proceedings of International Conference on Big Data.Google ScholarGoogle Scholar
  30. Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, and Francesco Quaglia. 2016. A Lock-Free O(1) Event Pool and its Application to Share-Everything PDES Platforms. In 2016 IEEE/ACM 20th International Symposium on Distributed Simulation and Real Time Applications (DS-RT). IEEE, 53--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, and Francesco Quaglia. 2017. A Conflict-Resilient Lock-Free Calendar Queue for Scalable Share-Everything PDES Platforms. In Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation). ACM, 15--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Guillaume Mercier, Francc ois Trahay, Elisabeth Brunet, and Darius Buntinas. 2009. NewMadeleine: An efficient support for high-performance networks in MPICH2. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Eric Mikida, Nikhil Jain, Laxmikant Kale, Elsa Gonsiorowski, Christopher D Carothers, Peter D Barnes Jr, and David Jefferson. 2016. Towards pdes in a message-driven paradigm: A preliminary case study using charm+. In Proceedings of the 2016 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 99--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Eric Mikida and Laxmikant V Kale. 2019. An adaptive non-blocking GVT algorithm. In Proceedings of the ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 25--36. https://doi.org/10.1145/3316480.3322896 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. Misra, N. Kurkure, A. Das, M.Valmiki, S. Das, and A. Gupta. 2013. Evaluation of Rodinia Codes on Intel Xeon Phi. In Proceedings of the 4th International Conference on Intelligent Systems, Modelling and Simulation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. Pennycook, C. Hughes, M. Smelianskiy, and S. Jarvis. 2013. Exploring SIMD for Molecular Dynamics Using Intel Xeon Processor and Intel Xeon Phi Coprocessors. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. K. Perumalla. 2007. Scaling time warp-based discrete event execution to 104 processors on a Blue Gene supercomputer. In Proc. of the ACM Conference on Computing Frontiers (CF). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Ramachandran, J. Vienne, R. Wijmgaart, L. Koesterke, and I. Sharapov. 2013. Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi. In Proceedings of International Conference on Parallel Processing (ICPP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Dhananjai M Rao. 2018. Performance comparison of cross memory attach capable MPI vs. multithreaded optimistic parallel simulations. In Proceedings of the 2018 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation intel xeon phi product. IEEE Micro, Vol. 36, 2 (2016), 34--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Hakan Sundell and Philippas Tsigas. 2005. Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems. In J. Parallel Distrib. Comput. 609--627. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. John D Valois. 1994. Implementing lock-free queues. In Proceedings of the seventh international conference on Parallel and Distributed Computing Systems. 64--69.Google ScholarGoogle Scholar
  43. Jingjing Wang, Ketan Bahulkar, Dmitry Ponomarev, and Nael Abu-Ghazaleh. 2013. Can pdes scale in environments with heterogeneous delays?. In Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. ACM, 35--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jingjing Wang, Deepak Jagtap, Nael Abu-Ghazaleh, and Dmitry Ponomarev. 2014. Parallel discrete event simulation for multi-core systems: Analysis and optimization. IEEE Transactions on Parallel and Distributed Systems, Vol. 25, 6 (2014), 1574--1584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Barry Williams, Dmitry Ponomarev, Nael Abu-Ghazaleh, and Philip Wilsey. 2017. Performance characterization of parallel discrete event simulation on knights landing processor. In Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. ACM, 121--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Biwei Xie, Xu Liu, Jianfeng Zhan, Zhen Jia, Yuqing Zhu, Lei Wang, and Lixin Zhang. 2015. Characterizing Data Analytics Workloads on Intel Xeon Phi. In Workload Characterization (IISWC), 2015 IEEE International Symposium on. IEEE, 114--115. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. High-Performance PDES on Manycore Clusters

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGSIM-PADS '21: Proceedings of the 2021 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation
            May 2021
            181 pages
            ISBN:9781450382960
            DOI:10.1145/3437959

            Copyright © 2021 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 June 2021

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate398of779submissions,51%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader