Abstract
We will present our industrial experience deploying software and heterogeneous hardware platforms to support end-to-end workflows in the power systems design engineering space. Such workflows include classical physics-based High Performance Computing (HPC) simulations, GPU-based ML training and validation, as well as pre- and post-processing on commodity CPU systems.
The software architecture is characterized by message-oriented middleware which normalizes distributed, heterogenous compute assets into a single ecosystem. Services provide enterprise authentication, data management, version tracking, digital provenance, and asynchronous event triggering, fronted by a secure API, Python SDK, and monitoring GUIs. With the tooling, various classes of workflows from simple, unitary through complex multi-modal workflows are enabled. The software development process was informed by and uses several national laboratory software packages whose impact and opportunities will also be discussed.
Utilizing this architecture, automated workflow processes focused on complex and industrially relevant applications have been developed. These leverage the asynchronous triggering and job distribution capabilities of the architecture to greatly improve design capabilities. The physics-based workflows involve simple Python-based pre-processing, proprietary Linux-based physics solvers, and multiple distinct HPC steps each of which required unique inputs and provided distinct outputs. Post-processing via proprietary Fortran and Python scripts are used to generate training data for machine learning algorithms. Physics model results were then provided to machine learning (ML) algorithms on GPU compute nodes to optimize the machine learning models based on design criteria. Finally, the ML optimized results were validated by running the identified designs through the physics-based workflow.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Size matters – data size and locality must be addressed during workflow implementation and execution. Stated simply, the data must move to the compute, or the compute must be available where the data resides. A workflow system which provides abstractions over data location and site-specific application execution would be useful, as described in this paper, although we will not address the topic of “big data” specifically.
- 2.
This is like other categorizations of workflow types [13].
- 3.
In our compute model, as we will see below, this can be implemented as one site or two – as a single site with two distinct runtime “compute types” within it sharing the same enterprise authentication scheme, or as two distinct sites which happen to have the same authentication scheme in common.
- 4.
On one site the user’s identity might be “sbrown”, on another “Susan.Brown”, etc.
- 5.
While the national laboratory computing facilities are valuable resources, for industrial applications their utility is practically limited to non-proprietary workloads. A further impediment to their integrated use is the necessary national laboratory security perimeter, and while the Superfacility API alleviates some of those impediments, the SFAPI is not yet deployed across all national facilities. Workflows which span sites including commercial sites like GE’s have their own issues with security perimeters. As we will show later in this paper, the design of an inter-site workflow system needs to be realistic about the lack of ease of bi-directional data flow – it is not a trivial matter to open a communication channel into a corporate network, and a system which does not depend on such access likely has a higher probability of adoption.
- 6.
SFAPI has a richer administrative API.
- 7.
Intra-job i.e., in-situ workflows, do not require all these pillars – e.g., their authorization to run is validated before running, their computing pre-provisioned, etc.
- 8.
The GE Spin component is not yet implemented. Conceptually it utilizes existing vendor APIs such as AWS, VMWare, etc.
- 9.
While the use of containers is ideal for this purpose, and is implementable within DT4D, practical limitations in system security requirements, commercial software license terms, etc. necessitate the ability to sometimes understand and interface with statically configured systems and runtimes.
- 10.
An examination of prior art included ADIOS2 [17] which also provides an MPI-based transport but also implements far more functionality than our use cases demanded, and thus we erred on the side of simplicity.
- 11.
A current side-project involves the inC2 library permitting a simulation application to transmit an informational job status message which contains a declarative GUI definition – e.g., a set of simple form control parameters. The DT4D GUI can then render this application-specific form. Use cases include interactive simulation steerage.
- 12.
Future work includes defining a formal interface and resolving conflicts across collaborating Auth subsystems, for example, in a situation where collaborators from multiple labs necessitates identity federation. Future run optimization can be accomplished using Run Repo and hardware profile metadata. The implementation of the Spin component is also planned.
- 13.
Future work includes adding a cross-site GUI which conceptually subsumes that of DT4D.
- 14.
Examination of prior art included ALCF Balsam [19]. We also looked at LLNL’s Flux [20], and at RADICAL-Pilot [21], which seemed directed at two of the three workflow types we describe. LANL’s BEE addresses reproducibility by using application containers and extending the runtime model with pre-execution CI/CD [22]. Pegasus is centered on an API for describing a priori DAGs. [23] Many other projects approach the workflow space from different non-comprehensive points of view, and each has its own potential strengths and weaknesses [24]. The build vs. buy decision is multi-faceted.
- 15.
It’s possible of course to generate a graph representation of a Python script from its abstract syntax tree – a tactic we showed to users to a somewhat unimpressive response – the raw code was the preferred medium.
- 16.
The visualization and navigation of dynamic and graph-oriented workflow traces, including with references to the data produced and consumed at each step, was demonstrated in the early phases of this project, but it is a subject of near-term future work to better understand how to apply such visualizations to real use cases.
- 17.
The team has utilized Spack [25] and CMake in the build system. GitLab CI is used for internal CI/CD.
- 18.
NERSC 2021 Director Reserve Allocation, project m3952, “MxN Application Readiness”, A. Gallo PI.
References
Arthur, R.: Provenance for decision-making. In: Medium (2020). https://richardarthur.medium.com/provenance-for-decision-making-bf2c89d76ec2. Accessed 6 June 2022
Digital Thread for Design | GE Research. https://www.ge.com/research/technology-domains/digital-technologies/digital-thread-design. Accessed 6 June 2022
Feeling The Burn: Inside The Boot Camp For Elite Gas Turbines | GE News. https://www.ge.com/news/reports/feeling-the-burn-inside-the-boot-camp-for-elite-gas-turbines. Accessed 6 June 2022
High Performance Computing | GE Research. https://www.ge.com/research/technology-domains/digital-technologies/high-performance-computing. Accessed 6 June 2022
Hatakeyama, J., Farr, D., Seal, D.J.: Accelerating the MBE ecosystem through cultural transformation
GE Research Uses Summit Supercomputer for Groundbreaking Study on Wind Power | GE News. https://www.ge.com/news/press-releases/ge-research-uses-summit-supercomputer-groundbreaking-study-wind-power. Accessed 6 June 2022
Ang, J., Hoang, T., Kelly, S., et al.: Advanced simulation and computing co-design strategy (2016)
Compute Systems. In: Oak Ridge Leadership Computing Facility. https://www.olcf.ornl.gov/olcf-resources/compute-systems/. Accessed 6 June 2022
Coughlin T Compute Cambrian Explosion. In: Forbes. https://www.forbes.com/sites/tomcoughlin/2019/04/26/compute-cambrian-explosion/. Accessed 6 June 2022
What do you mean by “Event-Driven”? In: martinfowler.com. https://martinfowler.com/articles/201701-event-driven.html. Accessed 6 June 2022
Arthur, R.: Machine-augmented Mindfulness. In: Medium (2020). https://richardarthur.medium.com/machine-augmented-mindfulness-e844f9c54985. Accessed 6 Jun 2022
Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
Deelman, E., Peterka, T., Altintas, I., et al.: The future of scientific workflows. Int. J. High Perform. Comput. Appl. 32, 159–175 (2018)
NERSC SuperFacility API - Swagger UI. https://api.nersc.gov/api/v1.2/#/status/read_planned_outages_status_outages_planned__name__get. Accessed 6 June 2022
Slurm Workload Manager - squeue. https://slurm.schedmd.com/squeue.html. Accessed 6 June 2022
Multiple Program Multiple Data programming with MPI. CFD on the GO (2022)
Godoy, W.F., Podhorszki, N., Wang, R., et al.: ADIOS 2: the adaptable input output system. A framework for high-performance data management. SoftwareX 12, 100561 (2020). https://doi.org/10.1016/j.softx.2020.100561
Gallo, A.: lwfm (2022)
Salim, M.A., Uram, T.D., Childers, J.T., et al.: Balsam: automated scheduling and execution of dynamic, data-intensive HPC workflows. arXiv (2019)
Ahn, D.H., Bass, N., Chu, A., et al.: Flux: overcoming scheduling challenges for exascale workflows, vol. 10 (2020)
Merzky, A., Turilli, M., Titov, M., et al.: Design and performance characterization of RADICAL-pilot on leadership-class platforms (2021)
Chen, J., Guan, Q., Zhang, Z., et al.: BeeFlow: a workflow management system for in situ processing across HPC and cloud systems. In: 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pp 1029–1038 (2018)
This research used the Pegasus Workflow Management Software funded by the National Science Foundation under grant #1664162
Arthur R (2021) Co-Design Web. In: Medium. https://richardarthur.medium.com/co-design-web-6f37664ac1e1. Accessed 6 Jun 2022
Gamblin, T., LeGendre, M., Collette, M.R., et al.: The Spack package manager: bringing order to HPC software chaos. In: SC 2015: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12(2015)
Han, J.-C., Wright, L.: Enhanced internal cooling of turbine blades and vanes. In: Gas Turbine Handbook. Department of Energy - National Energy Technology Laboratory, p. 34 (2006)
Acharya, S., Kanani, Y.: Chapter three - advances in film cooling heat transfer. In: Sparrow, E.M., Abraham, J.P., Gorman, J.M. (eds.) Advances in Heat Transfer, pp. 91–156. Elsevier, Amsterdam (2017)
Tallman, J.A., Osusky, M., Magina, N., Sewall, E.: An assessment of machine learning techniques for predicting turbine airfoil component temperatures, using FEA simulations for training data. In: Volume 5A: Heat Transfer. American Society of Mechanical Engineers, Phoenix, Arizona, USA, p. V05AT20A002 (2019)
How Apple, Google, and Microsoft will kill passwords and phishing in one stroke | Ars Technica. https://arstechnica.com/information-technology/2022/05/how-apple-google-and-microsoft-will-kill-passwords-and-phishing-in-1-stroke/. Accessed 6 June 2022
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Gallo, A., Claydon, I., Tucker, E., Arthur, R. (2022). Industrial Experience Deploying Heterogeneous Platforms for Use in Multi-modal Power Systems Design Workflows. In: Doug, K., Al, G., Pophale, S., Liu, H., Parete-Koon, S. (eds) Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation. SMC 2022. Communications in Computer and Information Science, vol 1690. Springer, Cham. https://doi.org/10.1007/978-3-031-23606-8_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-23606-8_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23605-1
Online ISBN: 978-3-031-23606-8
eBook Packages: Computer ScienceComputer Science (R0)