ABSTRACT
Asynchronous many-task runtimes look promising for the next generation of high performance computing systems. But these runtimes are usually based on new programming models, requiring extensive programmer effort to port existing applications to them. An alternative approach is to reimagine the execution model of widely used programming APIs, such as MPI, in order to execute them more asynchronously. Virtualization is a powerful technique that can be used to execute a bulk synchronous parallel program in an asynchronous manner. Moreover, if the virtualized entities can be migrated between address spaces, the runtime can optimize execution with dynamic load balancing, fault tolerance, and other adaptive techniques.
Previous work on automating process virtualization has explored compiler approaches, source-to-source refactoring tools, and runtime methods. These approaches achieve virtualization with different tradeoffs in terms of portability (across different architectures, operating systems, compilers, and linkers), programmer effort required, and the ability to handle all different kinds of global state and programming languages. We implement support for three different related runtime methods, discuss shortcomings and their applicability to user-level virtualized process migration, and compare performance to existing approaches. Compared to existing approaches, one of our new methods achieves what we consider the best overall functionality in terms of portability, automation, support for migration, and runtime performance.
- Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael Robson, Yanhua Sun, Ehsan Totoni, Lukasz Wesolowski, and Laxmikant Kale. 2014. Parallel Programming with Migratable Objects: Charm++ in Practice(SC).Google Scholar
- Gabriel Antoniu, Luc Bouge, and Raymond Namyst. 1999. An efficient and transparent thread migration scheme in the PM2 runtime system. In Proc. 3rd Workshop on Runtime Systems for Parallel Programming (RTSPP) San Juan, Puerto Rico. Lecture Notes in Computer Science 1586. Springer-Verlag, 496–510.Google Scholar
- Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: expressing locality and independence with logical regions. In Proceedings of the international conference on high performance computing, networking, storage and analysis. IEEE Computer Society Press, 66.Google ScholarDigital Library
- Jean-Baptiste Besnard, Julien Adam, Sameer Shende, Marc Pérache, Patrick Carribault, Julien Jaeger, and Allen D. Maloney. 2016. Introducing Task-Containers as an Alternative to Runtime-Stacking. In Proceedings of the 23rd European MPI Users’ Group Meeting (Edinburgh, United Kingdom) (EuroMPI 2016). Association for Computing Machinery, New York, NY, USA, 51–63. https://doi.org/10.1145/2966884.2966910Google ScholarDigital Library
- Shirley Browne, Christine Deane, George Ho, and Philip Mucci. 1999. PAPI: A Portable Interface to Hardware Performance Counters.Google Scholar
- B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007. Parallel Programmability and the Chapel Language. Int. J. High Perform. Comput. Appl. 21 (August 2007), 291–312. Issue 3. https://doi.org/10.1177/1094342007078442Google ScholarDigital Library
- Sanjay Chatterjee, Sagnak Tasırlar, Zoran Budimlic, Vincent Cavé, Milind Chabbi, Max Grossman, Vivek Sarkar, and Yonghong Yan. 2013. Integrating Asynchronous Task Parallelism with MPI. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. 712–725. https://doi.org/10.1109/IPDPS.2013.78Google ScholarDigital Library
- James Dinan, Pavan Balaji, David Goodell, Douglas Miller, Marc Snir, and Rajeev Thakur. 2013. Enabling MPI Interoperability through Flexible Communication Endpoints. In Proceedings of the 20th European MPI Users’ Group Meeting (Madrid, Spain) (EuroMPI ’13). Association for Computing Machinery, New York, NY, USA, 13–18. https://doi.org/10.1145/2488551.2488553Google ScholarDigital Library
- James Dinan and Mario Flajslik. 2014. Contexts: A Mechanism for High Throughput Communication in OpenSHMEM. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (Eugene, OR, USA) (PGAS ’14). Association for Computing Machinery, New York, NY, USA, Article 10, 9 pages. https://doi.org/10.1145/2676870.2676872Google ScholarDigital Library
- Message Passing Interface Forum. 2015. MPI: A Message-passing Interface Standard, Version 3.1 ; June 4, 2015. High-Performance Computing Center Stuttgart, University of Stuttgart. https://books.google.com/books?id=Fbv7jwEACAAJGoogle Scholar
- Atsushi Hori, Min Si, Balazs Gerofi, Masamichi Takagi, Jai Dayal, Pavan Balaji, and Yutaka Ishikawa. 2018. Process-in-process: Techniques for Practical Address-space Sharing. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (Tempe, Arizona) (HPDC ’18). ACM, New York, NY, USA, 131–143. https://doi.org/10.1145/3208040.3208045Google ScholarDigital Library
- Chao Huang, Orion Lawlor, and L. V. Kalé. 2003. Adaptive MPI. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003), LNCS 2958. College Station, Texas, 306–322.Google Scholar
- Nikhil Jain, Abhinav Bhatele, Jae-Seung Yeom, Mark F. Adams, Francesco Miniati, Chao Mei, and Laxmikant V. Kale. 2015. Charm++ & MPI: Combining the Best of Both Worlds. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (to appear)(IPDPS ’15). IEEE Computer Society. LLNL-CONF-663041.Google ScholarDigital Library
- Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio, and Dietmar Fey. 2014. HPX: A task based programming model in a global address space. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. ACM, 6.Google ScholarDigital Library
- Humaira Kamal and Alan Wagner. 2010. FG-MPI: Fine-grain MPI for multicore and clusters. In 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW). 1–8. https://doi.org/10.1109/IPDPSW.2010.5470773Google ScholarCross Ref
- Stas Negara, Gengbin Zheng, Kuo-Chuan Pan, Natasha Negara, Ralph E. Johnson, Laxmikant V. Kalé, and Paul M. Ricker. 2010. Automatic MPI to AMPI Program Transformation Using Photran. In Proceedings of the 2010 Conference on Parallel Processing (Ischia, Italy) (Euro-Par 2010). Springer-Verlag, Berlin, Heidelberg, 531–539.Google ScholarDigital Library
- Stas Negara, Gengbin Zheng, Kuo-Chuan Pan, Natasha Negara, Ralph E. Johnson, Laxmikant V. Kale, and Paul M. Ricker. 2010. Automatic MPI to AMPI Program Transformation using Photran. In 3rd Workshop on Productivity and Performance (PROPER 2010). Ischia/Naples/Italy.Google ScholarDigital Library
- Marc Perache, Herve Jourdren, and Raymond Namyst. 2008. MPC: A Unified Parallel Runtime for Clusters of NUMA Machines. In Proceedings of the 14th International Euro-Par Conference on Parallel Processing (Las Palmas de Gran Canaria, Spain) (Euro-Par ’08). Springer-Verlag, Berlin, Heidelberg, 78–88. https://doi.org/10.1007/978-3-540-85451-7_9Google ScholarDigital Library
- KJ Roberts, JC Dietrich, D Wirasaet, WJ Pringle, and JJ Westerink. 2021. Dynamic load balancing for predictions of storm surge and coastal flooding. Environmental Modelling and Software 140, 105045. https://doi.org/10.1016/j.envsoft.2021.105045Google ScholarCross Ref
- Hong Tang, Kai Shen, and Tao Yang. 1999. Compile/Run-Time Support for Threaded MPI Execution on Multiprogrammed Shared Memory Machines. In Proceedings of the Seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Atlanta, Georgia, USA) (PPoPP ’99). Association for Computing Machinery, New York, NY, USA, 107–118. https://doi.org/10.1145/301104.301114Google ScholarDigital Library
- Marc Tchiboukdjian, Patrick Carribault, and Marc Pérache. 2012. Hierarchical Local Storage: Exploiting Flexible User-Data Sharing Between MPI Tasks. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium. 366–377. https://doi.org/10.1109/IPDPS.2012.42Google ScholarDigital Library
- Sam White and Laxmikant V. Kale. 2018. Optimizing point-to-point communication between adaptive MPI endpoints in shared memory. Concurrency and Computation: Practice and Experience (2018), n/a–n/a. https://doi.org/10.1002/cpe.4467Google ScholarCross Ref
- Gengbin Zheng, Stas Negara, Celso L. Mendes, Eduardo R. Rodrigues, and Laxmikant V. Kale. 2011. Automatic Handling of Global Variables for Multi-threaded MPI Programs. In Proceedings of the 16th International Conference on Parallel and Distributed Systems (ICPADS) 2011.Google ScholarDigital Library
Index Terms
- Runtime Techniques for Automatic Process Virtualization
Recommendations
Improving machine virtualisation with 'hotplug memory'
Machine virtualisation is a key technology for server consolidation and on-demand server provisioning. To support this trend, it is essential to improve the performance of virtualisation software and enable the efficient running of many virtual ...
Automatic Virtualization of Accelerators
HotOS '19: Proceedings of the Workshop on Hot Topics in Operating SystemsApplications are migrating en masse to the cloud, while accelerators such as GPUs, TPUs, and FPGAs proliferate in the wake of Moore's Law. These technological trends are incompatible. Cloud applications run on virtual platforms, but traditional I/O ...
Comments