ABSTRACT
In an in-situ workflow, multiple components such as simulation and analysis applications are coupled with streaming data transfers. The multiplicity of possible configurations necessitates an auto-tuner for workflow optimization. Existing auto-tuning approaches are computationally expensive because many configurations must be sampled by running the whole workflow repeatedly in order to train the auto-tuner surrogate model or otherwise explore the configuration space. To reduce these costs, we instead combine the performance models of component applications by exploiting the analytical workflow structure, selectively generating test configurations to measure and guide the training of a machine learning workflow surrogate model. Because the training can focus on well-performing configurations, the resulting surrogate model can achieve high prediction accuracy for good configurations despite training with fewer total configurations. Experiments with real applications demonstrate that our approach can identify significantly better configurations than other approaches for a fixed computer time budget.
Supplemental Material
- ADIOS.2021. https://csmd.ornl.gov/adios.Google Scholar
- Timothy G. Armstrong, Justin M. Wozniak, Michael Wilde, and Ian T. Foster. 2014. Compiler Techniques for Massively Scalable Implicit Task Parallelism. In IEEE/ACM Intl. Conf. on High Performance Computing, Networking, Storage and Analysis (SC). 299--310.Google Scholar
- Utkarsh Ayachit, et al. 2016. Performance Analysis, Design Considerations, and Applications of Extreme-scale in situ Infrastructures. In IEEE/ACM Intl. Conf. on High Performance Computing, Networking, Storage and Analysis (SC).Google ScholarDigital Library
- Prasanna Balaprakash, Jack Dongarra, Todd Gamblin, Mary Hall, Jeffrey K. Hollingsworth, Boyana Norris, and Richard Vuduc. 2018. Autotuning in High-performance Computing Applications. Proc. IEEE 106, 11 (2018), 2068--2083.Google ScholarCross Ref
- Prasanna Balaprakash, Robert B. Gramacy, and Stefan M. Wild. 2013. Active-learning-based surrogate models for empirical performance tuning. In IEEE Cluster.Google Scholar
- Babak Behzad, Surendra Byna, Prabhat, and Marc Snir. 2019. Optimizing I/O Performance of HPC Applications with Autotuning. ACM Trans. on Parallel Computing (TOPC) 5, 4 (2019), 15:1--15:27.Google Scholar
- Alexandra Calotoiu, Marcin Copik, Torsten Hoefler, Marcus Ritter, Sergei Shudler, and Felix Wolf. 2020. ExtraPeak: Advanced Automatic Performance Modeling for HPC Applications. In Spring Software for Exascale Computing. 453--482.Google Scholar
- Alexandra Calotoiu, Torsten Hoefler, Marius Poke, and Felix Wolf. 2013. Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes. In IEEE/ACM Intl. Conf. on High Performance Computing, Networking, Storage and Analysis (SC). 1--12.Google ScholarDigital Library
- Zhen Cao, Vasily Tarasov, Sachin Tiwari, and Erez Zadok. 2018. Towards better understanding of black-box auto-tuning: A comparative analysis for storage systems. In USENIX Annual Technical Conference (ATC). 893--907.Google Scholar
- Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining (KDD). 785--794.Google ScholarDigital Library
- Jaemin Choi, David F. Richards, Laxmikant V. Kale, and Abhinav Bhatele. 2020. End-to-end Performance Modeling of Distributed GPU Applications. In ACM International Conference on Supercomputing (ICS). 30:1--12.Google Scholar
- Jai Dayal, Drew Bratcher, Greg Eisenhauer, Karsten Schwan, Matthew Wolf, Xuechen Zhang, Hasan Abbasi, Scott Klasky, and Norbert Podhorszki. 2014. Flexpath: Type-based publish/subscribe system for large-scale science analytics. In IEEE/ACM intl. Symp. on Cluster, Cloud, and Internet Computing (CCGrid). 246--255.Google Scholar
- Diego Didona, Francesco Quaglia, Paolo Romano, and Ennio Torre. 2015. Enhancing Performance Prediction Robustness by Combining Analytical Modeling and Machine Learning. In ACM International Conference on Performance Engineering (ICPE). 145--156.Google ScholarDigital Library
- Ciprian Docan, Manish Parashar, and Scott Klasky. 2012. DataSpaces: An Interaction and Coordination Framework for Coupled Simulation Workflows. Cluster Computing 15, 2 (2012), 163--181.Google ScholarDigital Library
- Mathieu Doucet et al. 2021. Machine learning for neutron scattering at ORNL. Machine Learning: Science and Technology 2, 2 (jan 2021), 023001. Google ScholarCross Ref
- Matthieu Dreher and Tom Peterka. 2017. Decaf: Decoupled dataflows for in situ high-performance workflows. Technical Report ANL/MCS-TM-371. ANL.Google ScholarCross Ref
- Shaohua Duan, Pradeep Subedi, Philip E. Davis, and Manish Parashar. 2019. Addressing Data Resiliency for Staging Based Scientific Workflows. In IEEE/ACM Intl. Conf. on High Performance Computing, Networking, Storage and Analysis (SC). 87:1--22.Google Scholar
- Dmitry Duplyakin, Jed Brown, and Robert Ricci. 2016. Active learning in performance analysis. In IEEE Cluster. Taipei, Taiwan, 182--191.Google Scholar
- Ian Foster, Mark Ainsworth, Julie Bessac, Franck Cappello, Jong Choi, Sheng Di, Zichao Di, Ali M Gok, Hanqi Guo, Kevin A Huck, Christopher Kelly, Scott Klasky, Kerstin Kleese van Dam, Xin Liang, Kshitij Mehta, Manish Parashar, Tom Peterka, Line Pouchard, Tong Shu, Ozan Tugluk, Hubertus van Dam, Lipeng Wan, Matthew Wolf, Justin M. Wozniak, Wei Xu, Igor Yakushin, Shinjae Yoo, and Todd Munson. 2021. Online Data Analysis and Reduction: An Important Co-design Motif for Extreme-scale Computers. International Journal of High Performance Computing Applications (IJHPCA) (2021).Google Scholar
- Geoffrey Fox, Shantenu Jha, and Lavanya Ramakrishnan. 2015. Streaming and Steering Applications: Requirements and Infrastructure final report.Google Scholar
- Yuankun Fu, Feng Li, Fengguang Song, and Zizhong Chen. 2018. Performance Analysis and Optimization of In-situ Integration of Simulation with Data Analysis: Zipping Applications Up. In ACM Intl. Symp. on High-Performance Parallel and Distributed Computing (HPDC). 192--205.Google Scholar
- Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and David Sculley. 2017. Google Vizier: A service for black-box optimization. In ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data (KDD). 1487--1496.Google ScholarDigital Library
- Heat Transfer. 2019. https://github.com/CODARcode/Example-Heat_Transfer/blob/master/README.adoc.Google Scholar
- Kate Keahey and James Ahrens. 2017. Future Online Analysis Platform workshop report.Google Scholar
- LAMMPS. 2021. https://lammps.sandia.gov.Google Scholar
- Matthew Larsen, Cyrus Harrison, James Kress, David Pugmire, Jeremy S. Meredith, and Hank Childs. 2016. Performance modeling of in situ rendering. In IEEE/ACM Intl. Conf. on High Performance Computing, Networking, Storage and Analysis (SC).Google ScholarDigital Library
- Qing Liu, et al. 2014. Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks. Concurrency and Computation: Practice and Experience 26, 7 (2014), 1453--1473.Google ScholarDigital Library
- Preeti Malakar, Venkatram Vishwanath, Todd Munson, Christopher Knight, Mark Hereld, Sven Leyffer, and Michael E. Papka. 2015. Optimal Scheduling of In-situ Analysis for Large-scale Scientific Simulations. In IEEE/ACM Intl. Conf. on High Performance Computing, Networking, Storage and Analysis (SC). Austin, TX, USA.Google Scholar
- Azamat Mametjanov, Prasanna Balaprakash, Chekuri Choudary, Paul D. Hovland, Stefan M. Wild, and Gerald Sabin. 2015. Autotuning FPGA Design Parameters for Performance and Power. In IEEE Intl. Symp. on Field-Programmable Custom Computing Machines. 84--91.Google Scholar
- Aniruddha Marathe, Rushil Anirudh, Nikhil Jain, Abhinav Bhatele, Jayaraman Thiagarajan, Bhavya Kailkhura, Jae-Seung Yeom, Barry Rountree, and Todd Gamblin. 2017. Performance modeling under resource constraints using deep transfer learning. In IEEE/ACM Intl. Conf. on High Performance Computing, Networking, Storage and Analysis (SC). 1--12.Google ScholarDigital Library
- Ke Meng, Jiajia Li, Guangming Tan, and Ninghui Sun. 2019. A pattern based algorithmic autotuner for graph processing on GPUs. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 201--213.Google ScholarDigital Library
- Harshitha Menon, Abhinav Bhatele, and Todd Gamblin. 2020. Auto-tuning Parameter Choices in HPC Applications using Bayesian Optimization. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 831--840.Google ScholarCross Ref
- Ari Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. 2019. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In ACM Intl. Conf. on Neural Information Processing Systems (NeurIPS). 1--11.Google Scholar
- Jiandong Mu, Mengdi Wang, Lanbo Li, Jun Yang, Wei Lin, and Wei Zhang. 2020. A History-Based Auto-Tuning Framework for Fast and High-Performance DNN Design on GPU. In ACM/IEEE Design Automation Conference (DAC). 1--6.Google ScholarCross Ref
- William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. Minimizing the cost of iterative compilation with active learning. In IEEE/ACM Intl. Symp. on Code Generation and Optimization (CGO). 245--256.Google ScholarCross Ref
- Jonathan Ozik, Nicholson T. Collier, Justin M. Wozniak, Charles M. Macal, and Gary An. 2018. Extreme-Scale Dynamic Exploration of a Distributed Agent-Based Model with the EMEWS Framework. IEEE Transactions on Computational Social Systems 5, 3 (2018), 884--895.Google ScholarCross Ref
- Tom Peterka. 2019. ASCR Workshop on In Situ Data Management report.Google ScholarCross Ref
- Mihail Popov, Alexandra Jimborean, and David Black-Schaffer. 2019. Efficient thread/page/parallelism autotuning for NUMA systems. In ACM International Conference on Supercomputing (ICS). 342--353.Google ScholarDigital Library
- Marcus Ritter, Alexandru Calotoiu, Sebastian Rinke, Thorsten Reimann, Torsten Hoefler, and Felix Wolf. 2020. Learning Cost-Effective Sampling Strategies for Empirical Performance Modeling. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 884--895.Google Scholar
- Tong Shu. 2017. Performance Optimization and Energy Efficiency of Big-data Computing Workflows. Dissertation. New Jersey Institute of Technology, Newark, NJ, USA. http://archives.njit.edu/vol01/etd/2010s/2017/njit-etd2017-096/njit-etd2017-096.pdf.Google Scholar
- Tong Shu, Yanfei Guo, Justin Wozniak, Xiaoning Ding, Ian Foster, and Tahsin Kurc. 2021. POSTER: In-situ Workflow Auto-tuning through Combining Component Models. In Proc. of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). Virtual Event, 467--468.Google Scholar
- Tong Shu and Chase Q. Wu. 2016. Energy-efficient Mapping of Big Data Workflows under Deadline Constraints. In Proc. of Workshop on Workflows in Support of Large-Scale Science in conjunction with ACM/IEEE Supercomputing Conference. Salt Lake City, UT, USA, 34--43. http://ceur-ws.org/Vol-1800/paper5.pdf.Google Scholar
- Tong Shu and Chase Q. Wu. 2017. Energy-efficient Dynamic Scheduling of Deadline-constrained MapReduce Workflows. In Proc. of IEEE eScience. Auckland, New Zealand, 393--402.Google Scholar
- Tong Shu and Chase Q. Wu. 2017. Performance Optimization of Hadoop Workflows in Public Clouds through Adaptive Task Partitioning. In Proc. of IEEE INFOCOM. Atlanta, GA, USA, 2349--2357.Google Scholar
- Tong Shu and Chase Q. Wu. 2020. Energy-efficient Mapping of Large-scale Workflows under Deadline Constraints in Big Data Computing Systems. Future Generation Computer Systems (FGCS) 110 (2020), 515--530. https://www.sciencedirect.com/science/article/pii/S0167739X17300468.Google ScholarCross Ref
- Mohammed Sourouri, Espen Birger Raknes, Nico Reissmann, Johannes Langguth, Daniel Hackenberg, Robert Schöne, and Per Gunnar Kjeldsberg. 2017. Towards fine-grained dynamic tuning of HPC applications on modern multi-core architectures. In IEEE/ACM Intl. Conf. on High Performance Computing, Networking, Storage and Analysis (SC).Google ScholarDigital Library
- Rick Stevens, Jeffrey Nichols, and Katherine Yelick. 2020. AI for Science Report on the Department of Energy (DOE) Town Halls on Artificial Intelligence (AI) for Science.Google Scholar
- Pradeep Subedi, Philip Davis, Shaohua Duan, Scott Klasky, Hemanth Kolla, and Manish Parashar. 2018. Stacker: An Autonomic Data Movement Engine for Extreme-Scale Data Staging-Based In-Situ Workflows. In IEEE/ACM Intl. Conf. on High Performance Computing, Networking, Storage and Analysis (SC).Google Scholar
- Jingwei Sun, Guangzhong Sun, Shiyan Zhan, Jiepeng Zhang, and Yong Chen. 2020. Automated Performance Modeling of HPC Applications Using Machine Learning. IEEE Trans. on Computers (TC) 69, 5 (2020), 749--763.Google ScholarCross Ref
- Jayaraman J. Thiagarajan, Nikhil Jain, Rushil Anirudh, Alfredo Gimenez, Rahul Sridhar, Aniruddha Marathe, Tao Wang, Murali Emani, Abhinav Bhatele, and Todd Gamblin. 2018. Bootstrapping parameter space exploration for fast tuning. In ACM International Conference on Supercomputing (ICS). 385--395.Google ScholarDigital Library
- Philippe Tillet and David Cox. 2017. Input-aware auto-tuning of compute-bound HPC kernels. In IEEE/ACM Intl. Conf. on High Performance Computing, Networking, Storage and Analysis (SC).Google ScholarDigital Library
- Venkatram Vishwanath, Mark Hereld, Vitali Morozov, and Michael E. Papka. 2011. Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems. In IEEE/ACM Intl. Conf. on High Performance Computing, Networking, Storage and Analysis (SC).Google Scholar
- Voro++. 2021. http://math.lbl.gov/voro++.Google Scholar
- Justin M. Wozniak, Philip Davis, Tong Shu, Jonathan Ozik, Nicholas Collier, Ian Foster, Thomas Brettin, and Rick Stevens. 2018. Scaling Deep Learning for Cancer with Advanced Workflow Storage Integration. In Proc. of the 4th Workshop on Machine Learning in HPC Environments in conjunction with ACM/IEEE Supercomputing Conference. Dallas, TX, USA, 114--123.Google ScholarCross Ref
- Justin M. Wozniak, Matthieu Dorier, Robert Ross, Tong Shu, Tahsin Kurc, Li Tang, Norbert Podhorszki, and Matthew Wolf. 2019. MPI Jobs within MPI Jobs: a Practical Way of Enabling Task-level Fault-tolerance in HPC Workflows. Future Generation Computer Systems (FGCS) 101 (2019), 576--589.Google ScholarDigital Library
- Yufei Xia, Chuanzhe Liu, Yuying, and Nana Liu. 2017. A Boosted Decision Tree Approach using Bayesian Hyper-parameter Optimization for Credit Scoring. Expert Systems with Applications 75 (2017), 225--241.Google ScholarCross Ref
- Zhibin Yu, Zhendong Bei, and Xuehai Qian. 2018. Datasize-aware high dimensional configurations auto-tuning of in-memory cluster computing. In ACM Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 564--577.Google ScholarDigital Library
- Fan Zhang, Tong Jin, Qian Sun, Melissa Romanus, Hoang Bui, Scott Klasky, and Manish Parashar. 2017. In-memory staging and data-centric task placement for coupled scientific simulation workflows. Concurrency and Computation: Practice and Experience 29, 12 (2017), 1--19.Google ScholarCross Ref
- Fang Zheng, Hongbo Zou, Greg Eisenhauer, Karsten Schwan, Matthew Wolf, Jai Dayal, Tuan-Anh Nguyen, Jianting Cao, Hasan Abbasi, Scott Klasky, Norbert Podhorszki, and Hongfeng Yu. 2013. FlexIO: I/O middleware for location-flexible scientific data analytics. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). 320--331.Google ScholarDigital Library
Index Terms
- Bootstrapping in-situ workflow auto-tuning via combining performance models of component applications
Recommendations
In-situ workflow auto-tuning through combining component models
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingIn-situ parallel workflows couple multiple component applications via streaming data transfer to avoid data exchange via shared file systems. Such workflows are challenging to configure for optimal performance due to the huge space of possible ...
INSTANT: A Runtime Framework to Orchestrate In-Situ Workflows
Euro-Par 2023: Parallel ProcessingAbstractIn-situ workflow is a type of workflow where multiple components execute concurrently with data flowing continuously. The adoption of in-situ workflows not only accelerates mission-critical scientific discoveries but also enables responsive ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and AnalysisOpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Comments