Addressing Global Data Dependencies in Heterogeneous Asynchronous Runtime Systems on GPUs
- Univ. of Utah, Salt Lake City, UT (United States)
Large-scale parallel applications with complex global data dependencies beyond those of reductions pose significant scalability challenges in an asynchronous runtime system. Internodal challenges include identifying the all-to-all communication of data dependencies among the nodes. Intranodal challenges include gathering together these data dependencies into usable data objects while avoiding data duplication. This paper addresses these challenges within the context of a large-scale, industrial coal boiler simulation using the Uintah asynchronous many-task runtime system on GPU architectures. We show significant reduction in time spent analyzing data dependencies through refinements in our dependency search algorithm. Multiple task graphs are used to eliminate subsequent analysis when task graphs change in predictable and repeatable ways. Using a combined data store and task scheduler redesign reduces data dependency duplication ensuring that problems fit within host and GPU memory. Furthermore, these modifications did not require any changes to application code or sweeping changes to the Uintah runtime system. We report results running on the DOE Titan system on 119K CPU cores and 7.5K GPUs simultaneously. Our solutions can be generalized to other task dependency problems with global dependencies among thousands of nodes which must be processed efficiently at large scale.
- Research Organization:
- Univ. of Utah, Salt Lake City, UT (United States); Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA); USDOE Office of Science (SC)
- DOE Contract Number:
- NA0002375; AC05-00OR22725
- OSTI ID:
- 1582428
- Journal Information:
- Proceedings of the 3rd International IEEE Workshop on Extreme Scale Programming Models and Middleware, Conference: 3.International IEEE Workshop on Extreme Scale Programming Models and Middleware (ESPM2'17), Denver, CO (United States), 12 Nov 2017
- Country of Publication:
- United States
- Language:
- English
Node failure resiliency for Uintah without checkpointing
|
journal | June 2019 |
Similar Records
Automatic Halo Management for the Uintah GPU-Heterogeneous Asynchronous Many-Task Runtime
Radiation modeling using the Uintah heterogeneous CPU/GPU runtime system. In: XSEDE '12 Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond, Article No. 4