skip to main content
10.1145/1551609.1551632acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Trace-based evaluation of job runtime and queue wait time predictions in grids

Published: 11 June 2009 Publication History

Abstract

Large-scale distributed computing systems such as grids are serving a growing number of scientists. These environments bring about not only the advantages of an economy of scale, but also the challenges of resource and workload heterogeneity. A consequence of these two forms of heterogeneity is that job runtimes and queue wait times are highly variable, which generally reduces system performance and makes grids difficult to use by the common scientist. Predicting job runtimes and queue wait times have been widely studied for parallel environments. However, there is no detailed investigation on how the proposed prediction methods perform in grids, whose resource structure and workload characteristics are very different from those in parallel systems. In this paper, we assess the performance and benefit of predicting job runtimes and queue wait times in grids based on traces gathered from various research and production grid environments. First, we evaluate the performance of simple yet widely used time series prediction methods and the effect of applying them to different types of job classes (e.g., all jobs submitted by single users or to single sites). Then, we investigate the performance of two kinds of queue wait time prediction methods for grids. Last, we investigate whether prediction-based grid-level scheduling policies can have better performance than policies that do not use predictions.

References

[1]
F. Berman, R. Wolski, H. Casanova, and W. Cirne. Adaptive computing on the grid using AppLeS. IEEE TPDS, 14(4):369--382, 2003.
[2]
J. Brevik, D. Nurmi, and R. Wolski. Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems. In CCGRID, pages 190--199, 2004.
[3]
J. Brevik, D. Nurmi, and R. Wolski. Predicting bounds on queuing delay for batch-scheduled parallel machines. In PPoPP, pages 110--118, 2006.
[4]
P. J. Brockwell and R. A. Davis. Introduction to Time Series and Forecasting. Springer, March 2002.
[5]
H. Casanova, F. Berman, G. Obertelli, and R. Wolski. The AppLeS parameter sweep template: User-level middleware for the grid. In SC Conference, pages 60--79, 2000.
[6]
M. Dobber, G. Koole, and R. V. D. Mei. Dynamic load balancing for a grid application. In HiPC, pages 342--352, 2004.
[7]
M. Dobber, G. Koole, and R. van der Mei. Dynamic load balancing experiments in a grid. In CCGRID, pages 1063--1070, 2005.
[8]
M. Dobber, R. van der Mei, and G. Koole. A prediction method for job runtimes on shared processors: Survey, statistical analysis and new avenues. Perform. Eval., 64(7-8):755--781, 2007.
[9]
A. B. Downey. Predicting Queue Times on Space-Sharing Parallel Computers. In IPPS, pages 209--218, 1997.
[10]
A. B. Downey. Using Queue Time Predictions for Processor Allocation. In JSSPP, pages 35--57, 1997.
[11]
G. Elliott, T. J. Rothenberg, and J. H. Stock. Efficient tests for an autoregressive unit root. Econometrica, 64(4):813--836, 1996.
[12]
D. Feitelson, L. Rudolph, U. Schwiegelshohn, K. Sevcik, and P. Wong. Theory and practice in parallel job scheduling. In JSSPP, pages 1--34, 1997.
[13]
D. Feitelson and A. Weil. Utilization and predictability in scheduling the ibm sp2 with backfilling. In PPS, pages 542--561, 1998.
[14]
A. Iosup, C. Dumitrescu, D. Epema, H. Li, and L. Wolters. How are real grids used? the analysis of four grid traces and its implications. In GRID, pages 262--269, 2006.
[15]
A. Iosup, D. Epema, C. Franke, A. Papaspyrou, L. Schley, B. Song, and R. Yahyapour. On grid performance evaluation using synthetic workloads. In JSSPP, pages 232--255, 2006.
[16]
A. Iosup, M. Jan, O. Sonmez, and D. Epema. The Characteristics and Performance of Groups of Jobs in Grids. In Euro-Par, pages 382--393, 2007.
[17]
A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, and D. Epema. The grid workloads archive. FGCS, 24(7):672--686, 2008.
[18]
A. Iosup, O. Sonmez, S. Anoep, and D. Epema. The performance of bags-of-tasks in large-scale distributed systems. In HPDC, pages 97--108, 2008.
[19]
A. Iosup, O. Sonmez, and D. Epema. DGSim: Comparing Grid Resource Management Architectures through Trace-Based Simulation. LNCS, 5168:13--25, 2008.
[20]
M. A. Iverson and G. J. Follen. Run-time statistical estimation of task execution times for heterogeneous distributed computing. In HPDC, pages 263--270, 1996.
[21]
W. Kang and A. Grimshaw. Failure prediction in computational grids. In ANSS, pages 275--282, 2007.
[22]
N. H. Kapadia, J. A. B. Fortes, and C. E. Brodley. Predictive application-performance modeling in a computational grid environment. In HPDC, pages 47--54, 1999.
[23]
B.-D. Lee and J. M. Schopf. Run-Time Prediction of Parallel Applications on Shared Environments. In Cluster, volume 0, pages 487--582, 2003.
[24]
U. Lublin and D. G. Feitelson. The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel and Distributed Computing, 63(11):1105--1122, 2003.
[25]
M. Maheswaran, S. Ali, H. J. Siegel, D. Hensgen, and R. F. Freund. Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems. In HCW, pages 30--44, 1999.
[26]
F. Nadeem, R. Prodan, T. Fahringer, and A. Iosup. A Framework For Resource Availability Characterization And Online Prediction in the Grids. In CoreGRID Integration Workshop, pages 209--224, 2008.
[27]
D. C. Nurmi, J. Brevik, and R. Wolski. QBETS: Queue Bounds Estimation from Time Series. In SIGMETRICS, pages 379--380, 2007.
[28]
D. Pease, A. Ghafoor, I. Ahmad, D. L. Andrews, K. Foudil-Bey, T. E. Karpinski, M. A. Mikki, and M. Zerrouki. Paws: A performance evaluation tool for parallel computing systems. IEEE Computer, 24(1):18--29, 1991.
[29]
K. H. Shum. Adaptive Distributed Computing through Competition. In ICCDS, page 220, 1996.
[30]
W. Smith, I. Foster, and V. Taylor. Predicting Application Run Times Using Historical Information. In JSSPP, pages 122--142, 1998.
[31]
W. Smith, V. Taylor, and I. Foster. Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance. In JSSPP, pages 202--219, 1999.
[32]
SPECCPU Team. SPEC CPU2006. Standard Performance. http://www.spec.org/cpu2006/.
[33]
H. Stark and J. W. Woods. Probability, random processes, and estimation theory for engineers. Prentice-Hall, Inc., 1986.
[34]
The Parallel Workloads Archive Team. The parallel workloads archive logs, Jan. 2009. {Online}. Available: http://www.cs.huji.ac.il/labs/parallel/workload/logs.html.
[35]
D. Tsafrir, Y. Etsion, and D. G. Feitelson. Backfilling Using System-Generated Predictions Rather than User Runtime Estimates. IEEE TPDS, 18(6):789--803, 2007.
[36]
R. Wolski. Experiences with predicting resource performance on-line in computational grid settings. In SIGMETRICS, pages 575--611, 2006.
[37]
J. Yang, I. Ahmad, and A. Ghafoor. Estimation of execution times on heterogeneous supercomputer architectures. In ICPP, pages 219--226, 1993.
[38]
Y. Zhang, W. Sun, and Y. Inoguchi. Predict task running time in grid environments based on cpu load predictions. FGCS, 24(6):489--497, 2008.

Cited By

View all
  • (2024)Tight Bounds for Dynamic Bin Packing with PredictionsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004378:3(1-28)Online publication date: 10-Dec-2024
  • (2024)A Hierarchical Deep Learning Approach for Predicting Job Queue Times in HPC SystemsSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00086(621-628)Online publication date: 17-Nov-2024
  • (2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '09: Proceedings of the 18th ACM international symposium on High performance distributed computing
June 2009
237 pages
ISBN:9781605585871
DOI:10.1145/1551609
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. grid scheduling
  2. performance evaluation
  3. predictions
  4. time series prediction methods
  5. trace-based simulation

Qualifiers

  • Research-article

Conference

HPDC '09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Tight Bounds for Dynamic Bin Packing with PredictionsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/37004378:3(1-28)Online publication date: 10-Dec-2024
  • (2024)A Hierarchical Deep Learning Approach for Predicting Job Queue Times in HPC SystemsSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00086(621-628)Online publication date: 17-Nov-2024
  • (2023)Runtime Variation in Big Data AnalyticsProceedings of the ACM on Management of Data10.1145/35889211:1(1-20)Online publication date: 30-May-2023
  • (2023)Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement LearningProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607042(1-13)Online publication date: 12-Nov-2023
  • (2023)Mastering HPC Runtime Prediction: From Observing Patterns to a Methodological ApproachPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good10.1145/3569951.3593598(75-85)Online publication date: 23-Jul-2023
  • (2022)Dynamic Bin Packing with PredictionsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35706056:3(1-24)Online publication date: 8-Dec-2022
  • (2022)Queue congestion prediction for large-scale high performance computing systems using a hidden Markov modelThe Journal of Supercomputing10.1007/s11227-022-04356-z78:10(12202-12223)Online publication date: 28-Feb-2022
  • (2021)MXDAGProceedings of the 20th ACM Workshop on Hot Topics in Networks10.1145/3484266.3487384(221-228)Online publication date: 10-Nov-2021
  • (2021)Forecasting System of Computational Time of DFT/TDDFT Calculations under the Multiverse Ansatz via Machine Learning and CheminformaticsACS Omega10.1021/acsomega.0c049816:3(2001-2024)Online publication date: 14-Jan-2021
  • (2020)The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN48063.2020.00034(158-171)Online publication date: Jun-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media