Abstract
As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those systems anticipating and accommodating the occurrence of failures. Failures become a commonplace feature of such large-scale systems, and one cannot continue to treat them as an exception. Despite the current and increasing importance of failures in these systems, our understanding of the performance impact of these critical issues on parallel computing environments is extremely limited. In this paper we develop a general failure modeling framework based on recent results from large-scale clusters and then we exploit this framework to conduct a detailed performance analysis of the impact of failures on system performance for a wide range of scheduling policies. Our results demonstrate that such failures can have a significant impact on the mean job response time and mean job slowdown under existing scheduling policies that ignore failures. We therefore investigate different scheduling mechanisms and policies to address these performance issues. Our results show that periodic checkpointing of jobs seems to do little to ease this problem. On the other hand, we demonstrate that information about the spatial and temporal correlation of failure occurrences can be very useful in designing a scheduling (job allocation) strategy to enhance system performance, with the former providing the greatest benefits.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Albers, S., Schmidt, G.: Scheduling with unexpected machine breakdowns. Discrete Applied Mathematics 110(2-3), 85–99 (2001)
S.M., Andrews, D.: On the reliability of the ibm mvs/xa operating system. IEEE Trans. Software Engineering (October 1987)
Arlitt, M., Jin, T.: Workload Characterization of the 1998 World Cup E-Commerce Site. Technical Report Technical Report HPL-1999-62, HP (May 1999)
Bruno, J.L., Coffman, E.G.: Optimal Fault-Tolerant Computing onMultiprocess Systems. Acta Informatica 34, 881–904 (1997)
Buckley, M.F., Siewiorek, D.P.: Vax/vms event monitoring and analysis. In: FTCS-25, Computing Digest of Papers, June 1995, pp. 414–423 (1995)
Buckley, M.F., Siewiorek, D.P.: Comparative analysis of event tupling schemes. In: FTCS-26, Computing Digest of Papers, June 1996, pp. 294–303 (1996)
Castillo, X., Siewiorek, D.P.: A workload dependent software reliability prediction model. In: Proc. 12th. Intl. Symp. Fault-Tolerant Computing, June 1982, pp. 279–286 (1982)
Feitelson, D.: A survey of scheduling in multiprogrammed parallel systems. IBM Research Technical Report, RC 19790 (1994)
Flautner, K., Kim, N., Martin, S., Blaauw, D., Mudge, T.: Drowsy Caches: Simple Techniques for Reducing Leakage Power. In: Proceedings of the International Symposium on Computer Architecture (ISCA), pp. 148–157 (2002)
Franke, H., Jann, J., Moreira, J.E., Pattnaik, P.: An evaluation of parallel job scheduling for asci blue-pacific. In: Proc. of SC 1999. Portland OR, IBM Research Report RC 21559, IBM TJ Watson Research Center (November 1999)
Franke, H., Jann, J., Moreira, J.E., Pattnaik, P., Jette, M.A.: Evaluation of Parallel Job Scheduling for ASCI Blue-Pacific. In: Proceedings of Supercomputing (November 1999)
Gorda, B., Wolski, R.: Time sharing massively parallel machines. In: Proc. of ICPP 1995. Portland OR, pp. 214–217 (August 1995)
Heath, T., Martin, R.P., Nguyen, T.D.: Improving cluster availability using workstation validation. In: Proceedings of the ACM SIGMETRICS 2002 Conference on Measurement and Modeling of Computer Systems, pp. 217–227 (2002)
Hsueh, M.C., Iyer, R.K., Trivedi, K.S.: A measurement-based performability model for a multiprocessor system. In: Computer Performance and Reliability, pp. 337–352 (1987)
Iyer, R.K., Rossetti, D.J.: Effect of system workload on operating system reliability: A study on ibm 3081. IEEE Trans. Software Engineering SE-11, 1438–1448 (1985)
B. Kalyanasundaram and K. R. Pruhs. Fault-tolerant scheduling. In 26th Annual ACM Symposium on Theory of Computing, pages 115–124, 1994.
Kartik, S., Murthy, C.S.R.: Task allocation algorithms for maximizing reliability of distributed computing systems. IEEE Transactions on Computer Systems 46, 719–724 (1997)
Krevat, E., Castanos, J.G., Moreira, J.E.: Job scheduling for the bluegene/l system. In: JSPP (2003)
Lee, I., Iyer, R.K.: Analysis of software halts in tandem system. In: Proceedings 3rd Intl. Software Reliability Engineering, October 1992, pp. 227–236 (1992)
Lin, T.Y., Siewiorek, D.P.: Error log analysis: Statistical modelling and heuristic trend analysis. IEEE Trans. on Reliability 39(4), 419–432 (1990)
Ling, Y., Mi, J., Lin, X.: A Variational Calculus Approach to Optimal Checkpoint Placement. IEEE Transactions on Computer Systems 50(7), 699–708 (2001)
Lohman, G.M., Muckstadt, J.A.: Optimal Policy for Batch Operations: Backup, Checkpointing, Reorganization, and Updating. ACM Transactions on Database Systems 2(3), 209–222 (1977)
Lyu, M., Mendiratta, V.: Software Fault Tolerance in a Clustered Architecture: Techniques and Reliability Modeling. In: Proceedings 1999 IEEE Aerospace Conference, pp. 141–150 (1999)
Meyer, J., Wei, L.: Analysis of workload influence on dependability. In: Proceedings of the International Symposium on Fault-Tolerant Computing, pp. 84–89 (1988)
Mukherjee, S., Weaver, C., Emer, J., Reinhardt, S., Austin, T.: A Systematic Methodology to Compute the Architectural Vulnerabilityi Factors for a High-Performance Microprocessor. In: Proceedings of the International Symposium on Microarchitecture (MICRO), pp. 29–40 (2003)
Plank, J.S., Thomason, M.G.: Processor allocation and checkpoint interval selection in cluster computing systems. Journal of Parallel and Distributed Computing 61(11), 1570–1590 (2001)
Qin, X., Jiang, H., Swanson, D.R.: An efficient fault-tolerant scheduling algorithm for real-time tasks with precedence constraints in heterogeneous systems, citeseer.nj.nec.com/qin02efficient.html
Sahoo, R., Sivasubramaniam, A., Squillante, M., Zhang, Y.: Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. In: Proceedings of the 2004 International Conference on Dependable Systems and Networks, pp. 389–398 (2004) (to appear)
Sahoo, R.K., Oliner, A.J., Rish, I., Gupta, M., Moreira, J.E., Ma, S., Vilalta, R., Sivasubramaniam, A.: Critical event prediction for proactive management in large-scale computer clusters. In: KDD, August 2003, pp. 426–435 (2003)
Shaltz, S.M., Wang, J.P., Goto, M.: Task allocation for maximizing reliability of distributed computer systems. IEEE Transactions on Computer Systems 41, 1156–1168 (1992)
Shivakumar, P., Kistler, M., Keckler, S., Burger, D., Alvisi, L.: Modeling the effect of technology trends on soft error rate of combinational logic. In: Proceedings of the 2002 International Conference on Dependable Systems and Networks, pp. 389–398 (2002)
Squillante, M.S.: Matrix-Analytic Methods in Stochastic Parallel-Server Scheduling Models. Advances in Matrix-Analytic Methods for Stochastic Models. Notable Publications (1998)
Squillante, M.S., Wang, F., Papaefthymiou, M.: Stochastic Analysis of Gang Scheduling in Parallel and Distributed Systems. Technical Report, IBM Research Division (1996)
Squillante, M.S., Zhang, Y., Sivasubramanian, A., Gautam, N., Moreira, J.E., Franke, H.: Modeling and analysis of dynamic coscheduling in parallel and distributed environments. Performance Evaluation Review 30(1), 43–54 (2002)
Sullivan, M., Chillarege, R.: Software Defects and Their Impact on System Availability - A Study of Field Failures in Operating Systems. In: Proceedings of The 21st International Symposium on Fault Tolerant Computer Systems (FTCS), pp. 2–9 (1991)
Tang, D., Iyer, R.K.: Impact of correlated failures on dependability in a vaxcluster system. In: IFIP Working Conference on Dependable Computing for Critical Applications (1991)
Tang, D., Iyer, R.K., Subramani, S.S.: Failure analysis and modelling of a vaxcluster system. In: Proceedings 20th. Intl. Symposium on Fault-tolerant Computing, pp. 244–251 (1990)
Vaidyanathan, K., Harper, R.E., Hunter, S.W., Trivedi, K.S.: Analysis and Implementation of Software Rejuvenation in Cluster Systems. In: Proceedings of the ACM SIGMETRICS 2001 Conference on Measurement and Modeling of Computer Systems, June 2001, pp. 62–71 (2001)
Vaidyanathan, K., Harper, R.E., Hunter, S.W., Trivedi, K.S.: Analysis and implementation of software rejuvenation in cluster systems. In: SIGMETRICS 2001, pp. 62–71 (2001)
Xu, J., Kallbarczyk, Z., Iyer, R.K.: Networked windows nt system field failure data analysis. Technical Report CRHC 9808 University of Illinois at Urbana-Champaign (1999)
Zeigler, J.: Terrestrial Cosmic Rays. IBM Journal of Research and Development 40(1), 19–39 (1996)
Zhang, Y., Franke, H., Moreira, J., Sivasubramaniam, A.: The Impact of Migration on Parallel Job Scheduling for Distributed Systems. In: Bode, A., Ludwig, T., Karl, W.C., Wismüller, R. (eds.) Euro-Par 2000. LNCS, vol. 1900, pp. 245–251. Springer, Heidelberg (2000)
Zhang, Y., Franke, H., Moreira, J., Sivasubramaniam, A.: Improving parallel job scheduling by combining gang scheduling and backfilling techniques. In: Proceedings of the International Parallel and Distributed Processing Symposium, May 2000, pp. 133–142 (2000)
Zhang, Y., Franke, H., Moreira, J., Sivasubramaniam, A.: An integrated approach to parallel scheduling using gang-scheduling backfilling and migration. IEEE Transactions on Parallel and Distributed System 14(3), 236–247 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, Y., Squillante, M.S., Sivasubramaniam, A., Sahoo, R.K. (2005). Performance Implications of Failures in Large-Scale Cluster Scheduling. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2004. Lecture Notes in Computer Science, vol 3277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11407522_13
Download citation
DOI: https://doi.org/10.1007/11407522_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25330-3
Online ISBN: 978-3-540-31795-1
eBook Packages: Computer ScienceComputer Science (R0)