Performance Implications of Failures in Large-Scale Cluster Scheduling

Zhang, Yanyong; Squillante, Mark S.; Sivasubramaniam, Anand; Sahoo, Ramendra K.

doi:10.1007/11407522_13

Yanyong Zhang¹⁹,
Mark S. Squillante²⁰,
Anand Sivasubramaniam²¹ &
…
Ramendra K. Sahoo²²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3277))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

858 Accesses

Abstract

As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those systems anticipating and accommodating the occurrence of failures. Failures become a commonplace feature of such large-scale systems, and one cannot continue to treat them as an exception. Despite the current and increasing importance of failures in these systems, our understanding of the performance impact of these critical issues on parallel computing environments is extremely limited. In this paper we develop a general failure modeling framework based on recent results from large-scale clusters and then we exploit this framework to conduct a detailed performance analysis of the impact of failures on system performance for a wide range of scheduling policies. Our results demonstrate that such failures can have a significant impact on the mean job response time and mean job slowdown under existing scheduling policies that ignore failures. We therefore investigate different scheduling mechanisms and policies to address these performance issues. Our results show that periodic checkpointing of jobs seems to do little to ease this problem. On the other hand, we demonstrate that information about the spatial and temporal correlation of failure occurrences can be very useful in designing a scheduling (job allocation) strategy to enhance system performance, with the former providing the greatest benefits.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Exploring the Impact of Node Failures on the Resource Allocation for Parallel Jobs

FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

Article 01 October 2018

Scheduling for Fault-Tolerance: An Introduction

References

Albers, S., Schmidt, G.: Scheduling with unexpected machine breakdowns. Discrete Applied Mathematics 110(2-3), 85–99 (2001)
Article MATH MathSciNet Google Scholar
S.M., Andrews, D.: On the reliability of the ibm mvs/xa operating system. IEEE Trans. Software Engineering (October 1987)
Google Scholar
Arlitt, M., Jin, T.: Workload Characterization of the 1998 World Cup E-Commerce Site. Technical Report Technical Report HPL-1999-62, HP (May 1999)
Google Scholar
Bruno, J.L., Coffman, E.G.: Optimal Fault-Tolerant Computing onMultiprocess Systems. Acta Informatica 34, 881–904 (1997)
Article MATH MathSciNet Google Scholar
Buckley, M.F., Siewiorek, D.P.: Vax/vms event monitoring and analysis. In: FTCS-25, Computing Digest of Papers, June 1995, pp. 414–423 (1995)
Google Scholar
Buckley, M.F., Siewiorek, D.P.: Comparative analysis of event tupling schemes. In: FTCS-26, Computing Digest of Papers, June 1996, pp. 294–303 (1996)
Google Scholar
Castillo, X., Siewiorek, D.P.: A workload dependent software reliability prediction model. In: Proc. 12th. Intl. Symp. Fault-Tolerant Computing, June 1982, pp. 279–286 (1982)
Google Scholar
Feitelson, D.: A survey of scheduling in multiprogrammed parallel systems. IBM Research Technical Report, RC 19790 (1994)
Google Scholar
Flautner, K., Kim, N., Martin, S., Blaauw, D., Mudge, T.: Drowsy Caches: Simple Techniques for Reducing Leakage Power. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 148–157 (2002)
Google Scholar
Franke, H., Jann, J., Moreira, J.E., Pattnaik, P.: An evaluation of parallel job scheduling for asci blue-pacific. In: Proc. of SC 1999. Portland OR, IBM Research Report RC 21559, IBM TJ Watson Research Center (November 1999)
Google Scholar
Franke, H., Jann, J., Moreira, J.E., Pattnaik, P., Jette, M.A.: Evaluation of Parallel Job Scheduling for ASCI Blue-Pacific. In: Proceedings of Supercomputing (November 1999)
Google Scholar
Gorda, B., Wolski, R.: Time sharing massively parallel machines. In: Proc. of ICPP 1995. Portland OR, pp. 214–217 (August 1995)
Google Scholar
Heath, T., Martin, R.P., Nguyen, T.D.: Improving cluster availability using workstation validation. In: Proceedings of the ACM SIGMETRICS 2002 Conference on Measurement and Modeling of Computer Systems, pp. 217–227 (2002)
Google Scholar
Hsueh, M.C., Iyer, R.K., Trivedi, K.S.: A measurement-based performability model for a multiprocessor system. In: Computer Performance and Reliability, pp. 337–352 (1987)
Google Scholar
Iyer, R.K., Rossetti, D.J.: Effect of system workload on operating system reliability: A study on ibm 3081. IEEE Trans. Software Engineering SE-11, 1438–1448 (1985)
Article Google Scholar
B. Kalyanasundaram and K. R. Pruhs. Fault-tolerant scheduling. In 26th Annual ACM Symposium on Theory of Computing, pages 115–124, 1994.
Google Scholar
Kartik, S., Murthy, C.S.R.: Task allocation algorithms for maximizing reliability of distributed computing systems. IEEE Transactions on Computer Systems 46, 719–724 (1997)
Article Google Scholar
Krevat, E., Castanos, J.G., Moreira, J.E.: Job scheduling for the bluegene/l system. In: JSPP (2003)
Google Scholar
Lee, I., Iyer, R.K.: Analysis of software halts in tandem system. In: Proceedings 3rd Intl. Software Reliability Engineering, October 1992, pp. 227–236 (1992)
Google Scholar
Lin, T.Y., Siewiorek, D.P.: Error log analysis: Statistical modelling and heuristic trend analysis. IEEE Trans. on Reliability 39(4), 419–432 (1990)
Article Google Scholar
Ling, Y., Mi, J., Lin, X.: A Variational Calculus Approach to Optimal Checkpoint Placement. IEEE Transactions on Computer Systems 50(7), 699–708 (2001)
Article Google Scholar
Lohman, G.M., Muckstadt, J.A.: Optimal Policy for Batch Operations: Backup, Checkpointing, Reorganization, and Updating. ACM Transactions on Database Systems 2(3), 209–222 (1977)
Article Google Scholar
Lyu, M., Mendiratta, V.: Software Fault Tolerance in a Clustered Architecture: Techniques and Reliability Modeling. In: Proceedings 1999 IEEE Aerospace Conference, pp. 141–150 (1999)
Google Scholar
Meyer, J., Wei, L.: Analysis of workload influence on dependability. In: Proceedings of the International Symposium on Fault-Tolerant Computing, pp. 84–89 (1988)
Google Scholar
Mukherjee, S., Weaver, C., Emer, J., Reinhardt, S., Austin, T.: A Systematic Methodology to Compute the Architectural Vulnerabilityi Factors for a High-Performance Microprocessor. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 29–40 (2003)
Google Scholar
Plank, J.S., Thomason, M.G.: Processor allocation and checkpoint interval selection in cluster computing systems. Journal of Parallel and Distributed Computing 61(11), 1570–1590 (2001)
Article MATH Google Scholar
Qin, X., Jiang, H., Swanson, D.R.: An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems, citeseer.nj.nec.com/qin02efficient.html
Sahoo, R., Sivasubramaniam, A., Squillante, M., Zhang, Y.: Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. In: Proceedings of the 2004 International Conference on Dependable Systems and Networks, pp. 389–398 (2004) (to appear)
Google Scholar
Sahoo, R.K., Oliner, A.J., Rish, I., Gupta, M., Moreira, J.E., Ma, S., Vilalta, R., Sivasubramaniam, A.: Critical event prediction for proactive management in large-scale computer clusters. In: KDD, August 2003, pp. 426–435 (2003)
Google Scholar
Shaltz, S.M., Wang, J.P., Goto, M.: Task allocation for maximizing reliability of distributed computer systems. IEEE Transactions on Computer Systems 41, 1156–1168 (1992)
Article Google Scholar
Shivakumar, P., Kistler, M., Keckler, S., Burger, D., Alvisi, L.: Modeling the effect of technology trends on soft error rate of combinational logic. In: Proceedings of the 2002 International Conference on Dependable Systems and Networks, pp. 389–398 (2002)
Google Scholar
Squillante, M.S.: Matrix-Analytic Methods in Stochastic Parallel-Server Scheduling Models. Advances in Matrix-Analytic Methods for Stochastic Models. Notable Publications (1998)
Google Scholar
Squillante, M.S., Wang, F., Papaefthymiou, M.: Stochastic Analysis of Gang Scheduling in Parallel and Distributed Systems. Technical Report, IBM Research Division (1996)
Google Scholar
Squillante, M.S., Zhang, Y., Sivasubramanian, A., Gautam, N., Moreira, J.E., Franke, H.: Modeling and analysis of dynamic coscheduling in parallel and distributed environments. Performance Evaluation Review 30(1), 43–54 (2002)
Article Google Scholar
Sullivan, M., Chillarege, R.: Software Defects and Their Impact on System Availability - A Study of Field Failures in Operating Systems. In: Proceedings of The 21st International Symposium on Fault Tolerant Computer Systems (FTCS), pp. 2–9 (1991)
Google Scholar
Tang, D., Iyer, R.K.: Impact of correlated failures on dependability in a vaxcluster system. In: IFIP Working Conference on Dependable Computing for Critical Applications (1991)
Google Scholar
Tang, D., Iyer, R.K., Subramani, S.S.: Failure analysis and modelling of a vaxcluster system. In: Proceedings 20th. Intl. Symposium on Fault-tolerant Computing, pp. 244–251 (1990)
Google Scholar
Vaidyanathan, K., Harper, R.E., Hunter, S.W., Trivedi, K.S.: Analysis and Implementation of Software Rejuvenation in Cluster Systems. In: Proceedings of the ACM SIGMETRICS 2001 Conference on Measurement and Modeling of Computer Systems, June 2001, pp. 62–71 (2001)
Google Scholar
Vaidyanathan, K., Harper, R.E., Hunter, S.W., Trivedi, K.S.: Analysis and implementation of software rejuvenation in cluster systems. In: SIGMETRICS 2001, pp. 62–71 (2001)
Google Scholar
Xu, J., Kallbarczyk, Z., Iyer, R.K.: Networked windows nt system field failure data analysis. Technical Report CRHC 9808 University of Illinois at Urbana-Champaign (1999)
Google Scholar
Zeigler, J.: Terrestrial Cosmic Rays. IBM Journal of Research and Development 40(1), 19–39 (1996)
Article Google Scholar
Zhang, Y., Franke, H., Moreira, J., Sivasubramaniam, A.: The Impact of Migration on Parallel Job Scheduling for Distributed Systems. In: Bode, A., Ludwig, T., Karl, W.C., Wismüller, R. (eds.) Euro-Par 2000. LNCS, vol. 1900, pp. 245–251. Springer, Heidelberg (2000)
Chapter Google Scholar
Zhang, Y., Franke, H., Moreira, J., Sivasubramaniam, A.: Improving parallel job scheduling by combining gang scheduling and backfilling techniques. In: Proceedings of the International Parallel and Distributed Processing Symposium, May 2000, pp. 133–142 (2000)
Google Scholar
Zhang, Y., Franke, H., Moreira, J., Sivasubramaniam, A.: An integrated approach to parallel scheduling using gang-scheduling backfilling and migration. IEEE Transactions on Parallel and Distributed System 14(3), 236–247 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Rutgers University, Piscataway, NJ, 08854, USA
Yanyong Zhang
Mathematical Sciences Department, IBM T.J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY, 10598-0218, USA
Mark S. Squillante
Department of Computer Science and Engineering, Pennsylvania State University, 316 Pond Laboratory, University Park, PA, 16802-6106, USA
Anand Sivasubramaniam
Exploratory Server Systems Department, IBM T.J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY, 10598-0218, USA
Ramendra K. Sahoo

Authors

Yanyong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Mark S. Squillante
View author publications
You can also search for this author in PubMed Google Scholar
Anand Sivasubramaniam
View author publications
You can also search for this author in PubMed Google Scholar
Ramendra K. Sahoo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, The Hebrew University of Jerusalem,
Dror G. Feitelson
Massachusetts Institute of Technology, 77 Massachusetts Avenue, MA 02139, Cambridge, USA
Larry Rudolph
No Affiliations,
Uwe Schwiegelshohn

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Squillante, M.S., Sivasubramaniam, A., Sahoo, R.K. (2005). Performance Implications of Failures in Large-Scale Cluster Scheduling. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2004. Lecture Notes in Computer Science, vol 3277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11407522_13

Download citation

DOI: https://doi.org/10.1007/11407522_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25330-3
Online ISBN: 978-3-540-31795-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics