Industrial Experience Deploying Heterogeneous Platforms for Use in Multi-modal Power Systems Design Workflows

Gallo, Andrew; Claydon, Ian; Tucker, Eric; Arthur, Richard

doi:10.1007/978-3-031-23606-8_16

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1690))

Included in the following conference series:

Smoky Mountains Computational Sciences and Engineering Conference

487 Accesses

Abstract

We will present our industrial experience deploying software and heterogeneous hardware platforms to support end-to-end workflows in the power systems design engineering space. Such workflows include classical physics-based High Performance Computing (HPC) simulations, GPU-based ML training and validation, as well as pre- and post-processing on commodity CPU systems.

The software architecture is characterized by message-oriented middleware which normalizes distributed, heterogenous compute assets into a single ecosystem. Services provide enterprise authentication, data management, version tracking, digital provenance, and asynchronous event triggering, fronted by a secure API, Python SDK, and monitoring GUIs. With the tooling, various classes of workflows from simple, unitary through complex multi-modal workflows are enabled. The software development process was informed by and uses several national laboratory software packages whose impact and opportunities will also be discussed.

Utilizing this architecture, automated workflow processes focused on complex and industrially relevant applications have been developed. These leverage the asynchronous triggering and job distribution capabilities of the architecture to greatly improve design capabilities. The physics-based workflows involve simple Python-based pre-processing, proprietary Linux-based physics solvers, and multiple distinct HPC steps each of which required unique inputs and provided distinct outputs. Post-processing via proprietary Fortran and Python scripts are used to generate training data for machine learning algorithms. Physics model results were then provided to machine learning (ML) algorithms on GPU compute nodes to optimize the machine learning models based on design criteria. Finally, the ML optimized results were validated by running the identified designs through the physics-based workflow.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Size matters – data size and locality must be addressed during workflow implementation and execution. Stated simply, the data must move to the compute, or the compute must be available where the data resides. A workflow system which provides abstractions over data location and site-specific application execution would be useful, as described in this paper, although we will not address the topic of “big data” specifically.
2.
This is like other categorizations of workflow types [13].
3.
In our compute model, as we will see below, this can be implemented as one site or two – as a single site with two distinct runtime “compute types” within it sharing the same enterprise authentication scheme, or as two distinct sites which happen to have the same authentication scheme in common.
4.
On one site the user’s identity might be “sbrown”, on another “Susan.Brown”, etc.
5.
While the national laboratory computing facilities are valuable resources, for industrial applications their utility is practically limited to non-proprietary workloads. A further impediment to their integrated use is the necessary national laboratory security perimeter, and while the Superfacility API alleviates some of those impediments, the SFAPI is not yet deployed across all national facilities. Workflows which span sites including commercial sites like GE’s have their own issues with security perimeters. As we will show later in this paper, the design of an inter-site workflow system needs to be realistic about the lack of ease of bi-directional data flow – it is not a trivial matter to open a communication channel into a corporate network, and a system which does not depend on such access likely has a higher probability of adoption.
6.
SFAPI has a richer administrative API.
7.
Intra-job i.e., in-situ workflows, do not require all these pillars – e.g., their authorization to run is validated before running, their computing pre-provisioned, etc.
8.
The GE Spin component is not yet implemented. Conceptually it utilizes existing vendor APIs such as AWS, VMWare, etc.
9.
While the use of containers is ideal for this purpose, and is implementable within DT4D, practical limitations in system security requirements, commercial software license terms, etc. necessitate the ability to sometimes understand and interface with statically configured systems and runtimes.
10.
An examination of prior art included ADIOS2 [17] which also provides an MPI-based transport but also implements far more functionality than our use cases demanded, and thus we erred on the side of simplicity.
11.
A current side-project involves the inC2 library permitting a simulation application to transmit an informational job status message which contains a declarative GUI definition – e.g., a set of simple form control parameters. The DT4D GUI can then render this application-specific form. Use cases include interactive simulation steerage.
12.
Future work includes defining a formal interface and resolving conflicts across collaborating Auth subsystems, for example, in a situation where collaborators from multiple labs necessitates identity federation. Future run optimization can be accomplished using Run Repo and hardware profile metadata. The implementation of the Spin component is also planned.
13.
Future work includes adding a cross-site GUI which conceptually subsumes that of DT4D.
14.
Examination of prior art included ALCF Balsam [19]. We also looked at LLNL’s Flux [20], and at RADICAL-Pilot [21], which seemed directed at two of the three workflow types we describe. LANL’s BEE addresses reproducibility by using application containers and extending the runtime model with pre-execution CI/CD [22]. Pegasus is centered on an API for describing a priori DAGs. [23] Many other projects approach the workflow space from different non-comprehensive points of view, and each has its own potential strengths and weaknesses [24]. The build vs. buy decision is multi-faceted.
15.
It’s possible of course to generate a graph representation of a Python script from its abstract syntax tree – a tactic we showed to users to a somewhat unimpressive response – the raw code was the preferred medium.
16.
The visualization and navigation of dynamic and graph-oriented workflow traces, including with references to the data produced and consumed at each step, was demonstrated in the early phases of this project, but it is a subject of near-term future work to better understand how to apply such visualizations to real use cases.
17.
The team has utilized Spack [25] and CMake in the build system. GitLab CI is used for internal CI/CD.
18.
NERSC 2021 Director Reserve Allocation, project m3952, “MxN Application Readiness”, A. Gallo PI.

References

Arthur, R.: Provenance for decision-making. In: Medium (2020). https://richardarthur.medium.com/provenance-for-decision-making-bf2c89d76ec2. Accessed 6 June 2022
Digital Thread for Design | GE Research. https://www.ge.com/research/technology-domains/digital-technologies/digital-thread-design. Accessed 6 June 2022
Feeling The Burn: Inside The Boot Camp For Elite Gas Turbines | GE News. https://www.ge.com/news/reports/feeling-the-burn-inside-the-boot-camp-for-elite-gas-turbines. Accessed 6 June 2022
High Performance Computing | GE Research. https://www.ge.com/research/technology-domains/digital-technologies/high-performance-computing. Accessed 6 June 2022
Hatakeyama, J., Farr, D., Seal, D.J.: Accelerating the MBE ecosystem through cultural transformation
Google Scholar
GE Research Uses Summit Supercomputer for Groundbreaking Study on Wind Power | GE News. https://www.ge.com/news/press-releases/ge-research-uses-summit-supercomputer-groundbreaking-study-wind-power. Accessed 6 June 2022
Ang, J., Hoang, T., Kelly, S., et al.: Advanced simulation and computing co-design strategy (2016)
Google Scholar
Compute Systems. In: Oak Ridge Leadership Computing Facility. https://www.olcf.ornl.gov/olcf-resources/compute-systems/. Accessed 6 June 2022
Coughlin T Compute Cambrian Explosion. In: Forbes. https://www.forbes.com/sites/tomcoughlin/2019/04/26/compute-cambrian-explosion/. Accessed 6 June 2022
What do you mean by “Event-Driven”? In: martinfowler.com. https://martinfowler.com/articles/201701-event-driven.html. Accessed 6 June 2022
Arthur, R.: Machine-augmented Mindfulness. In: Medium (2020). https://richardarthur.medium.com/machine-augmented-mindfulness-e844f9c54985. Accessed 6 Jun 2022
Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
Article Google Scholar
Deelman, E., Peterka, T., Altintas, I., et al.: The future of scientific workflows. Int. J. High Perform. Comput. Appl. 32, 159–175 (2018)
Article Google Scholar
NERSC SuperFacility API - Swagger UI. https://api.nersc.gov/api/v1.2/#/status/read_planned_outages_status_outages_planned__name__get. Accessed 6 June 2022
Slurm Workload Manager - squeue. https://slurm.schedmd.com/squeue.html. Accessed 6 June 2022
Multiple Program Multiple Data programming with MPI. CFD on the GO (2022)
Google Scholar
Godoy, W.F., Podhorszki, N., Wang, R., et al.: ADIOS 2: the adaptable input output system. A framework for high-performance data management. SoftwareX 12, 100561 (2020). https://doi.org/10.1016/j.softx.2020.100561
Gallo, A.: lwfm (2022)
Google Scholar
Salim, M.A., Uram, T.D., Childers, J.T., et al.: Balsam: automated scheduling and execution of dynamic, data-intensive HPC workflows. arXiv (2019)
Google Scholar
Ahn, D.H., Bass, N., Chu, A., et al.: Flux: overcoming scheduling challenges for exascale workflows, vol. 10 (2020)
Google Scholar
Merzky, A., Turilli, M., Titov, M., et al.: Design and performance characterization of RADICAL-pilot on leadership-class platforms (2021)
Google Scholar
Chen, J., Guan, Q., Zhang, Z., et al.: BeeFlow: a workflow management system for in situ processing across HPC and cloud systems. In: 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pp 1029–1038 (2018)
Google Scholar
This research used the Pegasus Workflow Management Software funded by the National Science Foundation under grant #1664162
Google Scholar
Arthur R (2021) Co-Design Web. In: Medium. https://richardarthur.medium.com/co-design-web-6f37664ac1e1. Accessed 6 Jun 2022
Gamblin, T., LeGendre, M., Collette, M.R., et al.: The Spack package manager: bringing order to HPC software chaos. In: SC 2015: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12(2015)
Google Scholar
Han, J.-C., Wright, L.: Enhanced internal cooling of turbine blades and vanes. In: Gas Turbine Handbook. Department of Energy - National Energy Technology Laboratory, p. 34 (2006)
Google Scholar
Acharya, S., Kanani, Y.: Chapter three - advances in film cooling heat transfer. In: Sparrow, E.M., Abraham, J.P., Gorman, J.M. (eds.) Advances in Heat Transfer, pp. 91–156. Elsevier, Amsterdam (2017)
Google Scholar
Tallman, J.A., Osusky, M., Magina, N., Sewall, E.: An assessment of machine learning techniques for predicting turbine airfoil component temperatures, using FEA simulations for training data. In: Volume 5A: Heat Transfer. American Society of Mechanical Engineers, Phoenix, Arizona, USA, p. V05AT20A002 (2019)
Google Scholar
How Apple, Google, and Microsoft will kill passwords and phishing in one stroke | Ars Technica. https://arstechnica.com/information-technology/2022/05/how-apple-google-and-microsoft-will-kill-passwords-and-phishing-in-1-stroke/. Accessed 6 June 2022

Download references

Author information

Authors and Affiliations

General Electric – Global Research Center, Niskayuna, NY, 12309, USA
Andrew Gallo, Ian Claydon, Eric Tucker & Richard Arthur

Authors

Andrew Gallo
View author publications
You can also search for this author in PubMed Google Scholar
Ian Claydon
View author publications
You can also search for this author in PubMed Google Scholar
Eric Tucker
View author publications
You can also search for this author in PubMed Google Scholar
Richard Arthur
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrew Gallo .

Editor information

Editors and Affiliations

Oak Ridge National Laboratory, Oak Ridge, TN, USA
Kothe Doug
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Geist Al
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Swaroop Pophale
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Hong Liu
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Suzanne Parete-Koon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gallo, A., Claydon, I., Tucker, E., Arthur, R. (2022). Industrial Experience Deploying Heterogeneous Platforms for Use in Multi-modal Power Systems Design Workflows. In: Doug, K., Al, G., Pophale, S., Liu, H., Parete-Koon, S. (eds) Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation. SMC 2022. Communications in Computer and Information Science, vol 1690. Springer, Cham. https://doi.org/10.1007/978-3-031-23606-8_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-23606-8_16
Published: 18 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23605-1
Online ISBN: 978-3-031-23606-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics