Skip to main content

Abstract

We will present our industrial experience deploying software and heterogeneous hardware platforms to support end-to-end workflows in the power systems design engineering space. Such workflows include classical physics-based High Performance Computing (HPC) simulations, GPU-based ML training and validation, as well as pre- and post-processing on commodity CPU systems.

The software architecture is characterized by message-oriented middleware which normalizes distributed, heterogenous compute assets into a single ecosystem. Services provide enterprise authentication, data management, version tracking, digital provenance, and asynchronous event triggering, fronted by a secure API, Python SDK, and monitoring GUIs. With the tooling, various classes of workflows from simple, unitary through complex multi-modal workflows are enabled. The software development process was informed by and uses several national laboratory software packages whose impact and opportunities will also be discussed.

Utilizing this architecture, automated workflow processes focused on complex and industrially relevant applications have been developed. These leverage the asynchronous triggering and job distribution capabilities of the architecture to greatly improve design capabilities. The physics-based workflows involve simple Python-based pre-processing, proprietary Linux-based physics solvers, and multiple distinct HPC steps each of which required unique inputs and provided distinct outputs. Post-processing via proprietary Fortran and Python scripts are used to generate training data for machine learning algorithms. Physics model results were then provided to machine learning (ML) algorithms on GPU compute nodes to optimize the machine learning models based on design criteria. Finally, the ML optimized results were validated by running the identified designs through the physics-based workflow.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Size matters – data size and locality must be addressed during workflow implementation and execution. Stated simply, the data must move to the compute, or the compute must be available where the data resides. A workflow system which provides abstractions over data location and site-specific application execution would be useful, as described in this paper, although we will not address the topic of “big data” specifically.

  2. 2.

    This is like other categorizations of workflow types [13].

  3. 3.

    In our compute model, as we will see below, this can be implemented as one site or two – as a single site with two distinct runtime “compute types” within it sharing the same enterprise authentication scheme, or as two distinct sites which happen to have the same authentication scheme in common.

  4. 4.

    On one site the user’s identity might be “sbrown”, on another “Susan.Brown”, etc.

  5. 5.

    While the national laboratory computing facilities are valuable resources, for industrial applications their utility is practically limited to non-proprietary workloads. A further impediment to their integrated use is the necessary national laboratory security perimeter, and while the Superfacility API alleviates some of those impediments, the SFAPI is not yet deployed across all national facilities. Workflows which span sites including commercial sites like GE’s have their own issues with security perimeters. As we will show later in this paper, the design of an inter-site workflow system needs to be realistic about the lack of ease of bi-directional data flow – it is not a trivial matter to open a communication channel into a corporate network, and a system which does not depend on such access likely has a higher probability of adoption.

  6. 6.

    SFAPI has a richer administrative API.

  7. 7.

    Intra-job i.e., in-situ workflows, do not require all these pillars – e.g., their authorization to run is validated before running, their computing pre-provisioned, etc.

  8. 8.

    The GE Spin component is not yet implemented. Conceptually it utilizes existing vendor APIs such as AWS, VMWare, etc.

  9. 9.

    While the use of containers is ideal for this purpose, and is implementable within DT4D, practical limitations in system security requirements, commercial software license terms, etc. necessitate the ability to sometimes understand and interface with statically configured systems and runtimes.

  10. 10.

    An examination of prior art included ADIOS2 [17] which also provides an MPI-based transport but also implements far more functionality than our use cases demanded, and thus we erred on the side of simplicity.

  11. 11.

    A current side-project involves the inC2 library permitting a simulation application to transmit an informational job status message which contains a declarative GUI definition – e.g., a set of simple form control parameters. The DT4D GUI can then render this application-specific form. Use cases include interactive simulation steerage.

  12. 12.

    Future work includes defining a formal interface and resolving conflicts across collaborating Auth subsystems, for example, in a situation where collaborators from multiple labs necessitates identity federation. Future run optimization can be accomplished using Run Repo and hardware profile metadata. The implementation of the Spin component is also planned.

  13. 13.

    Future work includes adding a cross-site GUI which conceptually subsumes that of DT4D.

  14. 14.

    Examination of prior art included ALCF Balsam [19]. We also looked at LLNL’s Flux [20], and at RADICAL-Pilot [21], which seemed directed at two of the three workflow types we describe. LANL’s BEE addresses reproducibility by using application containers and extending the runtime model with pre-execution CI/CD [22]. Pegasus is centered on an API for describing a priori DAGs. [23] Many other projects approach the workflow space from different non-comprehensive points of view, and each has its own potential strengths and weaknesses [24]. The build vs. buy decision is multi-faceted.

  15. 15.

    It’s possible of course to generate a graph representation of a Python script from its abstract syntax tree – a tactic we showed to users to a somewhat unimpressive response – the raw code was the preferred medium.

  16. 16.

    The visualization and navigation of dynamic and graph-oriented workflow traces, including with references to the data produced and consumed at each step, was demonstrated in the early phases of this project, but it is a subject of near-term future work to better understand how to apply such visualizations to real use cases.

  17. 17.

    The team has utilized Spack [25] and CMake in the build system. GitLab CI is used for internal CI/CD.

  18. 18.

    NERSC 2021 Director Reserve Allocation, project m3952, “MxN Application Readiness”, A. Gallo PI.

References

  1. Arthur, R.: Provenance for decision-making. In: Medium (2020). https://richardarthur.medium.com/provenance-for-decision-making-bf2c89d76ec2. Accessed 6 June 2022

  2. Digital Thread for Design | GE Research. https://www.ge.com/research/technology-domains/digital-technologies/digital-thread-design. Accessed 6 June 2022

  3. Feeling The Burn: Inside The Boot Camp For Elite Gas Turbines | GE News. https://www.ge.com/news/reports/feeling-the-burn-inside-the-boot-camp-for-elite-gas-turbines. Accessed 6 June 2022

  4. High Performance Computing | GE Research. https://www.ge.com/research/technology-domains/digital-technologies/high-performance-computing. Accessed 6 June 2022

  5. Hatakeyama, J., Farr, D., Seal, D.J.: Accelerating the MBE ecosystem through cultural transformation

    Google Scholar 

  6. GE Research Uses Summit Supercomputer for Groundbreaking Study on Wind Power | GE News. https://www.ge.com/news/press-releases/ge-research-uses-summit-supercomputer-groundbreaking-study-wind-power. Accessed 6 June 2022

  7. Ang, J., Hoang, T., Kelly, S., et al.: Advanced simulation and computing co-design strategy (2016)

    Google Scholar 

  8. Compute Systems. In: Oak Ridge Leadership Computing Facility. https://www.olcf.ornl.gov/olcf-resources/compute-systems/. Accessed 6 June 2022

  9. Coughlin T Compute Cambrian Explosion. In: Forbes. https://www.forbes.com/sites/tomcoughlin/2019/04/26/compute-cambrian-explosion/. Accessed 6 June 2022

  10. What do you mean by “Event-Driven”? In: martinfowler.com. https://martinfowler.com/articles/201701-event-driven.html. Accessed 6 June 2022

  11. Arthur, R.: Machine-augmented Mindfulness. In: Medium (2020). https://richardarthur.medium.com/machine-augmented-mindfulness-e844f9c54985. Accessed 6 Jun 2022

  12. Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18

    Article  Google Scholar 

  13. Deelman, E., Peterka, T., Altintas, I., et al.: The future of scientific workflows. Int. J. High Perform. Comput. Appl. 32, 159–175 (2018)

    Article  Google Scholar 

  14. NERSC SuperFacility API - Swagger UI. https://api.nersc.gov/api/v1.2/#/status/read_planned_outages_status_outages_planned__name__get. Accessed 6 June 2022

  15. Slurm Workload Manager - squeue. https://slurm.schedmd.com/squeue.html. Accessed 6 June 2022

  16. Multiple Program Multiple Data programming with MPI. CFD on the GO (2022)

    Google Scholar 

  17. Godoy, W.F., Podhorszki, N., Wang, R., et al.: ADIOS 2: the adaptable input output system. A framework for high-performance data management. SoftwareX 12, 100561 (2020). https://doi.org/10.1016/j.softx.2020.100561

  18. Gallo, A.: lwfm (2022)

    Google Scholar 

  19. Salim, M.A., Uram, T.D., Childers, J.T., et al.: Balsam: automated scheduling and execution of dynamic, data-intensive HPC workflows. arXiv (2019)

    Google Scholar 

  20. Ahn, D.H., Bass, N., Chu, A., et al.: Flux: overcoming scheduling challenges for exascale workflows, vol. 10 (2020)

    Google Scholar 

  21. Merzky, A., Turilli, M., Titov, M., et al.: Design and performance characterization of RADICAL-pilot on leadership-class platforms (2021)

    Google Scholar 

  22. Chen, J., Guan, Q., Zhang, Z., et al.: BeeFlow: a workflow management system for in situ processing across HPC and cloud systems. In: 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pp 1029–1038 (2018)

    Google Scholar 

  23. This research used the Pegasus Workflow Management Software funded by the National Science Foundation under grant #1664162

    Google Scholar 

  24. Arthur R (2021) Co-Design Web. In: Medium. https://richardarthur.medium.com/co-design-web-6f37664ac1e1. Accessed 6 Jun 2022

  25. Gamblin, T., LeGendre, M., Collette, M.R., et al.: The Spack package manager: bringing order to HPC software chaos. In: SC 2015: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12(2015)

    Google Scholar 

  26. Han, J.-C., Wright, L.: Enhanced internal cooling of turbine blades and vanes. In: Gas Turbine Handbook. Department of Energy - National Energy Technology Laboratory, p. 34 (2006)

    Google Scholar 

  27. Acharya, S., Kanani, Y.: Chapter three - advances in film cooling heat transfer. In: Sparrow, E.M., Abraham, J.P., Gorman, J.M. (eds.) Advances in Heat Transfer, pp. 91–156. Elsevier, Amsterdam (2017)

    Google Scholar 

  28. Tallman, J.A., Osusky, M., Magina, N., Sewall, E.: An assessment of machine learning techniques for predicting turbine airfoil component temperatures, using FEA simulations for training data. In: Volume 5A: Heat Transfer. American Society of Mechanical Engineers, Phoenix, Arizona, USA, p. V05AT20A002 (2019)

    Google Scholar 

  29. How Apple, Google, and Microsoft will kill passwords and phishing in one stroke | Ars Technica. https://arstechnica.com/information-technology/2022/05/how-apple-google-and-microsoft-will-kill-passwords-and-phishing-in-1-stroke/. Accessed 6 June 2022

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew Gallo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gallo, A., Claydon, I., Tucker, E., Arthur, R. (2022). Industrial Experience Deploying Heterogeneous Platforms for Use in Multi-modal Power Systems Design Workflows. In: Doug, K., Al, G., Pophale, S., Liu, H., Parete-Koon, S. (eds) Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation. SMC 2022. Communications in Computer and Information Science, vol 1690. Springer, Cham. https://doi.org/10.1007/978-3-031-23606-8_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23606-8_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23605-1

  • Online ISBN: 978-3-031-23606-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics