skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Adaptive Learning for Concept Drift in Application Performance Modeling

Conference ·

Supervised learning is a promising approach for modeling the performance of applications running on large HPC systems. A key assumption in supervised learning is that the training and testing data are obtained under the same conditions. However, in production HPC systems these conditions might not hold because the conditions of the platform can change over time as a result of hardware degradation, hardware replacement, software upgrade, and configuration updates. These changes could alter the data distribution in a way that affects the accuracy of the predictive performance models and render them less useful; this phenomenon is referred to as concept drift. Ignoring concept drift can lead to suboptimal resource usage and decreased efficiency when those performance models are deployed for tuning and job scheduling in production systems. To address this issue, we propose a concept-drift-aware predictive modeling approach that comprises two components: (1) an online Bayesian changepoint detection method that can automatically identify the location of events that lead to concept drift in near-real time and (2) a moment-matching transformation inspired by transfer learning that converts the training data collected before the drift to be useful for retraining. We use application input/output performance data collected on Cori, a production supercomputing system at the National Energy Research Scientific Computing Center, to demonstrate the effectiveness of our approach. The results show that concept-drift-aware models obtain significant improvement in accuracy; the median absolute error of the best-performing Gaussian process regression improved by 58.8% when the proposed approaches were used.

Research Organization:
Argonne National Laboratory (ANL), Argonne, IL (United States)
Sponsoring Organization:
USDOE Office of Science - Office of Advanced Scientific Computing Research (ASCR) - Scientific Discovery through Advanced Computing (SciDAC)
DOE Contract Number:
AC02-06CH11357
OSTI ID:
1574301
Resource Relation:
Conference: 48th International Conference on Parallel Processing, 08/05/19 - 08/08/19, Kyoto, JP
Country of Publication:
United States
Language:
English

References (19)

Collective I/O Tuning Using Analytical and Machine Learning Models conference September 2015
24/7 Characterization of petascale I/O workloads conference August 2009
Detection of Recovery Patterns in Cluster Systems Using Resource Usage Data conference January 2017
Robust Online Time Series Prediction with Recurrent Neural Networks conference October 2016
Problem Determination in Enterprise Middleware Systems using Change Point Correlation of Time Series Data conference January 2006
A survey of methods for time series change point detection journal September 2016
A survey on concept drift adaptation journal April 2014
Extremely randomized trees journal March 2006
Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems journal November 2018
The Changepoint Model for Statistical Process Control journal October 2003
PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing conference January 2005
Pilot: A Framework that Understands How to Do Performance Benchmarks the Right Way
  • Li, Yan; Gupta, Yash; Miller, Ethan L.
  • 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) https://doi.org/10.1109/MASCOTS.2016.31
conference September 2016
A Year in the Life of a Parallel File System conference November 2018
Performance modeling under resource constraints using deep transfer learning
  • Marathe, Aniruddha; Anirudh, Rushil; Jain, Nikhil
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '17 https://doi.org/10.1145/3126908.3126969
conference January 2017
A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data journal January 2014
Two Nonparametric Control Charts for Detecting Arbitrary Distribution Changes journal April 2012
Bayesian Online Learning of the Hazard Rate in Change-Point Problems journal September 2010
Comparisons of various types of normality tests journal December 2011
SLURM: Simple Linux Utility for Resource Management book January 2003