Skip to main content

Performance Implications of Failures in Large-Scale Cluster Scheduling

  • Conference paper
Job Scheduling Strategies for Parallel Processing (JSSPP 2004)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3277))

Included in the following conference series:

  • 858 Accesses

Abstract

As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those systems anticipating and accommodating the occurrence of failures. Failures become a commonplace feature of such large-scale systems, and one cannot continue to treat them as an exception. Despite the current and increasing importance of failures in these systems, our understanding of the performance impact of these critical issues on parallel computing environments is extremely limited. In this paper we develop a general failure modeling framework based on recent results from large-scale clusters and then we exploit this framework to conduct a detailed performance analysis of the impact of failures on system performance for a wide range of scheduling policies. Our results demonstrate that such failures can have a significant impact on the mean job response time and mean job slowdown under existing scheduling policies that ignore failures. We therefore investigate different scheduling mechanisms and policies to address these performance issues. Our results show that periodic checkpointing of jobs seems to do little to ease this problem. On the other hand, we demonstrate that information about the spatial and temporal correlation of failure occurrences can be very useful in designing a scheduling (job allocation) strategy to enhance system performance, with the former providing the greatest benefits.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Albers, S., Schmidt, G.: Scheduling with unexpected machine breakdowns. Discrete Applied Mathematics 110(2-3), 85–99 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  2. S.M., Andrews, D.: On the reliability of the ibm mvs/xa operating system. IEEE Trans. Software Engineering (October 1987)

    Google Scholar 

  3. Arlitt, M., Jin, T.: Workload Characterization of the 1998 World Cup E-Commerce Site. Technical Report Technical Report HPL-1999-62, HP (May 1999)

    Google Scholar 

  4. Bruno, J.L., Coffman, E.G.: Optimal Fault-Tolerant Computing onMultiprocess Systems. Acta Informatica 34, 881–904 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  5. Buckley, M.F., Siewiorek, D.P.: Vax/vms event monitoring and analysis. In: FTCS-25, Computing Digest of Papers, June 1995, pp. 414–423 (1995)

    Google Scholar 

  6. Buckley, M.F., Siewiorek, D.P.: Comparative analysis of event tupling schemes. In: FTCS-26, Computing Digest of Papers, June 1996, pp. 294–303 (1996)

    Google Scholar 

  7. Castillo, X., Siewiorek, D.P.: A workload dependent software reliability prediction model. In: Proc. 12th. Intl. Symp. Fault-Tolerant Computing, June 1982, pp. 279–286 (1982)

    Google Scholar 

  8. Feitelson, D.: A survey of scheduling in multiprogrammed parallel systems. IBM Research Technical Report, RC 19790 (1994)

    Google Scholar 

  9. Flautner, K., Kim, N., Martin, S., Blaauw, D., Mudge, T.: Drowsy Caches: Simple Techniques for Reducing Leakage Power. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 148–157 (2002)

    Google Scholar 

  10. Franke, H., Jann, J., Moreira, J.E., Pattnaik, P.: An evaluation of parallel job scheduling for asci blue-pacific. In: Proc. of SC 1999. Portland OR, IBM Research Report RC 21559, IBM TJ Watson Research Center (November 1999)

    Google Scholar 

  11. Franke, H., Jann, J., Moreira, J.E., Pattnaik, P., Jette, M.A.: Evaluation of Parallel Job Scheduling for ASCI Blue-Pacific. In: Proceedings of Supercomputing (November 1999)

    Google Scholar 

  12. Gorda, B., Wolski, R.: Time sharing massively parallel machines. In: Proc. of ICPP 1995. Portland OR, pp. 214–217 (August 1995)

    Google Scholar 

  13. Heath, T., Martin, R.P., Nguyen, T.D.: Improving cluster availability using workstation validation. In: Proceedings of the ACM SIGMETRICS 2002 Conference on Measurement and Modeling of Computer Systems, pp. 217–227 (2002)

    Google Scholar 

  14. Hsueh, M.C., Iyer, R.K., Trivedi, K.S.: A measurement-based performability model for a multiprocessor system. In: Computer Performance and Reliability, pp. 337–352 (1987)

    Google Scholar 

  15. Iyer, R.K., Rossetti, D.J.: Effect of system workload on operating system reliability: A study on ibm 3081. IEEE Trans. Software Engineering SE-11, 1438–1448 (1985)

    Article  Google Scholar 

  16. B. Kalyanasundaram and K. R. Pruhs. Fault-tolerant scheduling. In 26th Annual ACM Symposium on Theory of Computing, pages 115–124, 1994.

    Google Scholar 

  17. Kartik, S., Murthy, C.S.R.: Task allocation algorithms for maximizing reliability of distributed computing systems. IEEE Transactions on Computer Systems 46, 719–724 (1997)

    Article  Google Scholar 

  18. Krevat, E., Castanos, J.G., Moreira, J.E.: Job scheduling for the bluegene/l system. In: JSPP (2003)

    Google Scholar 

  19. Lee, I., Iyer, R.K.: Analysis of software halts in tandem system. In: Proceedings 3rd Intl. Software Reliability Engineering, October 1992, pp. 227–236 (1992)

    Google Scholar 

  20. Lin, T.Y., Siewiorek, D.P.: Error log analysis: Statistical modelling and heuristic trend analysis. IEEE Trans. on Reliability 39(4), 419–432 (1990)

    Article  Google Scholar 

  21. Ling, Y., Mi, J., Lin, X.: A Variational Calculus Approach to Optimal Checkpoint Placement. IEEE Transactions on Computer Systems 50(7), 699–708 (2001)

    Article  Google Scholar 

  22. Lohman, G.M., Muckstadt, J.A.: Optimal Policy for Batch Operations: Backup, Checkpointing, Reorganization, and Updating. ACM Transactions on Database Systems 2(3), 209–222 (1977)

    Article  Google Scholar 

  23. Lyu, M., Mendiratta, V.: Software Fault Tolerance in a Clustered Architecture: Techniques and Reliability Modeling. In: Proceedings 1999 IEEE Aerospace Conference, pp. 141–150 (1999)

    Google Scholar 

  24. Meyer, J., Wei, L.: Analysis of workload influence on dependability. In: Proceedings of the International Symposium on Fault-Tolerant Computing, pp. 84–89 (1988)

    Google Scholar 

  25. Mukherjee, S., Weaver, C., Emer, J., Reinhardt, S., Austin, T.: A Systematic Methodology to Compute the Architectural Vulnerabilityi Factors for a High-Performance Microprocessor. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 29–40 (2003)

    Google Scholar 

  26. Plank, J.S., Thomason, M.G.: Processor allocation and checkpoint interval selection in cluster computing systems. Journal of Parallel and Distributed Computing 61(11), 1570–1590 (2001)

    Article  MATH  Google Scholar 

  27. Qin, X., Jiang, H., Swanson, D.R.: An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems, citeseer.nj.nec.com/qin02efficient.html

  28. Sahoo, R., Sivasubramaniam, A., Squillante, M., Zhang, Y.: Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. In: Proceedings of the 2004 International Conference on Dependable Systems and Networks, pp. 389–398 (2004) (to appear)

    Google Scholar 

  29. Sahoo, R.K., Oliner, A.J., Rish, I., Gupta, M., Moreira, J.E., Ma, S., Vilalta, R., Sivasubramaniam, A.: Critical event prediction for proactive management in large-scale computer clusters. In: KDD, August 2003, pp. 426–435 (2003)

    Google Scholar 

  30. Shaltz, S.M., Wang, J.P., Goto, M.: Task allocation for maximizing reliability of distributed computer systems. IEEE Transactions on Computer Systems 41, 1156–1168 (1992)

    Article  Google Scholar 

  31. Shivakumar, P., Kistler, M., Keckler, S., Burger, D., Alvisi, L.: Modeling the effect of technology trends on soft error rate of combinational logic. In: Proceedings of the 2002 International Conference on Dependable Systems and Networks, pp. 389–398 (2002)

    Google Scholar 

  32. Squillante, M.S.: Matrix-Analytic Methods in Stochastic Parallel-Server Scheduling Models. Advances in Matrix-Analytic Methods for Stochastic Models. Notable Publications (1998)

    Google Scholar 

  33. Squillante, M.S., Wang, F., Papaefthymiou, M.: Stochastic Analysis of Gang Scheduling in Parallel and Distributed Systems. Technical Report, IBM Research Division (1996)

    Google Scholar 

  34. Squillante, M.S., Zhang, Y., Sivasubramanian, A., Gautam, N., Moreira, J.E., Franke, H.: Modeling and analysis of dynamic coscheduling in parallel and distributed environments. Performance Evaluation Review 30(1), 43–54 (2002)

    Article  Google Scholar 

  35. Sullivan, M., Chillarege, R.: Software Defects and Their Impact on System Availability - A Study of Field Failures in Operating Systems. In: Proceedings of The 21st International Symposium on Fault Tolerant Computer Systems (FTCS), pp. 2–9 (1991)

    Google Scholar 

  36. Tang, D., Iyer, R.K.: Impact of correlated failures on dependability in a vaxcluster system. In: IFIP Working Conference on Dependable Computing for Critical Applications (1991)

    Google Scholar 

  37. Tang, D., Iyer, R.K., Subramani, S.S.: Failure analysis and modelling of a vaxcluster system. In: Proceedings 20th. Intl. Symposium on Fault-tolerant Computing, pp. 244–251 (1990)

    Google Scholar 

  38. Vaidyanathan, K., Harper, R.E., Hunter, S.W., Trivedi, K.S.: Analysis and Implementation of Software Rejuvenation in Cluster Systems. In: Proceedings of the ACM SIGMETRICS 2001 Conference on Measurement and Modeling of Computer Systems, June 2001, pp. 62–71 (2001)

    Google Scholar 

  39. Vaidyanathan, K., Harper, R.E., Hunter, S.W., Trivedi, K.S.: Analysis and implementation of software rejuvenation in cluster systems. In: SIGMETRICS 2001, pp. 62–71 (2001)

    Google Scholar 

  40. Xu, J., Kallbarczyk, Z., Iyer, R.K.: Networked windows nt system field failure data analysis. Technical Report CRHC 9808 University of Illinois at Urbana-Champaign (1999)

    Google Scholar 

  41. Zeigler, J.: Terrestrial Cosmic Rays. IBM Journal of Research and Development 40(1), 19–39 (1996)

    Article  Google Scholar 

  42. Zhang, Y., Franke, H., Moreira, J., Sivasubramaniam, A.: The Impact of Migration on Parallel Job Scheduling for Distributed Systems. In: Bode, A., Ludwig, T., Karl, W.C., Wismüller, R. (eds.) Euro-Par 2000. LNCS, vol. 1900, pp. 245–251. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  43. Zhang, Y., Franke, H., Moreira, J., Sivasubramaniam, A.: Improving parallel job scheduling by combining gang scheduling and backfilling techniques. In: Proceedings of the International Parallel and Distributed Processing Symposium, May 2000, pp. 133–142 (2000)

    Google Scholar 

  44. Zhang, Y., Franke, H., Moreira, J., Sivasubramaniam, A.: An integrated approach to parallel scheduling using gang-scheduling backfilling and migration. IEEE Transactions on Parallel and Distributed System 14(3), 236–247 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, Y., Squillante, M.S., Sivasubramaniam, A., Sahoo, R.K. (2005). Performance Implications of Failures in Large-Scale Cluster Scheduling. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2004. Lecture Notes in Computer Science, vol 3277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11407522_13

Download citation

  • DOI: https://doi.org/10.1007/11407522_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25330-3

  • Online ISBN: 978-3-540-31795-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics