Skip to main content
Log in

SPSRG: a prediction approach for correlated failures in distributed computing systems

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Failure instances in distributed computing systems (DCSs) have exhibited temporal and spatial correlations, where a single failure instance can trigger a set of failure instances simultaneously or successively within a short time interval. In this work, we propose a correlated failure prediction approach (CFPA) to predict correlated failures of computing elements in DCSs. The approach models correlated-failure patterns using the concept of probabilistic shared risk groups and makes a prediction for correlated failures by exploiting an association rule mining approach in a parallel way. We conduct extensive experiments to evaluate the feasibility and effectiveness of CFPA using both failure traces from Los Alamos National Lab and simulated datasets. The experimental results show that the proposed approach outperforms other approaches in both the failure prediction performance and the execution time, and can potentially provide better prediction performance in a larger system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. The operator “\(\backslash \)” denotes set minus as in “\( X \backslash Y\)”, which means ‘ Y is excluded from X’.

  2. Each \(I_{k}\) corresponds to a certain CE in a DCS.

References

  1. Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., Sahoo, R.: BlueGene/L failure analysis and prediction models. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006)

  2. Asyabi, E., Azhdari, A., Dehsangi, M., Khan, M.G., Sharifi, M., Azhari, S.V.: Kani: a QoS-aware hypervisor-level scheduler for cloud computing environments. Clust. Comput. 19(2), 1–17 (2016)

    Article  Google Scholar 

  3. Karim, R., Ding, C., Miri, A., Rahman, M.S.: Incorporating service and user information and latent features to predict QoS for selecting and recommending cloud service compositions. Clust. Comput. 19(2), 1–16 (2016)

    Google Scholar 

  4. Martini, B., Choo, K.K.R.: An integrated conceptual digital forensic framework for cloud computing. Digit. Investig. 9(9), 71–80 (2012)

    Article  Google Scholar 

  5. Quick, D., Choo, K.K.R.: Dropbox analysis: data remnants on user machines. Digit. Investig. 10(1), 3–18 (2013)

    Article  Google Scholar 

  6. Cahyani, N.D.W., Martini, B., Choo, K.R., Al-Azhar, A.M.N.: Forensic data acquisition from cloud-of-things devices: windows smartphones as a case study. Concurr. Comput. Pract. Exp. (2016)

  7. Quick, D., Choo, K.K.R.: Google drive: forensic analysis of data remnants. J. Netw. Comput. Appl. 40(2), 179–193 (2014)

    Article  Google Scholar 

  8. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006)

  9. Gallet, M., Yigitbasi, N., Javadi, B., Kondo, D., Iosup, A., Epema, D.: A model for space-correlated failures in large-scale distributed systems. In: Springer European Conference on Parallel Processing, pp. 88–100. (2010)

  10. Pezoa, J.E., Hayat, M.M.: Reliability of heterogeneous distributed computing systems in the presence of correlated failures. IEEE Trans. Parallel Distrib. Syst. 25(4), 1034–1043 (2014)

    Article  Google Scholar 

  11. Salfner, F., Schieschke, M., Malek, M.: Predicting failures of computer systems: a case study for a telecommunication system. In: IEEE Parallel and Distributed Processing Symposium (IPDPS’ 06). (2006)

  12. Rahman, N.H.A., Glisson, W.B., Yang, Y., Choo, K.K.R.: Forensic-by-design framework for cyber-physical cloud systems. IEEE Cloud Comput. 3(1), 50–59 (2016)

    Article  Google Scholar 

  13. Ab Rahman, N.H., Cahyani, N.D.W., Choo, K.R.: Cloud incident handling and forensic-by-design: cloud storage as a case study. Concurr. Comput. Pract. Exp. (2016)

  14. Quick, D., Choo, K.K.R.: Digital droplets: microsoft skydrive forensic data remnants. Future Gener. Comput. Syst. 29(6), 1378–1394 (2013)

    Article  Google Scholar 

  15. Tep, K.S., Martini, B., Hunt, R., Choo, K.K.R.: A Taxonomy of cloud attack consequences and mitigation strategies: the role of access control and privileged access management. In: IEEE Trustcom/BigDataSE/ISPA’ 15, pp. 1073–1080. (2015)

  16. Baldoni, R., Montanari, L., Rizzuto, M.: On-line failure prediction in safety-critical systems. Future Gener. Comput. Syst. 45, 123–132 (2015)

    Article  Google Scholar 

  17. Quick, D., Choo, K.K.R.: Big forensic data reduction: digital forensic images and electronic evidence. Clust. Comput. 19(2), 1–18 (2016)

    Article  Google Scholar 

  18. Martini, B., Choo, K.K.R.: Cloud storage forensics: owncloud as a case study. Digit. Investig. 10(4), 287–299 (2013)

    Article  Google Scholar 

  19. Quick, D., Martini, B., Choo, R.: Cloud Storage Forensics. Syngress Publishing, Boston (2013)

    Google Scholar 

  20. Martini, B., Choo, K.K.R.: Distributed filesystem forensics: xtreemfs as a case study. Digit. Investig. 11(4), 295–313 (2014)

    Article  Google Scholar 

  21. Quick, D., Choo, K.K.R.: Forensic collection of cloud storage data: does the act of collection result in changes to the data or its metadata? Digit. Investig. 10(3), 266–277 (2013)

    Article  Google Scholar 

  22. Fu, S., Xu, C.Z.: Exploring event correlation for failure prediction in coalitions of clusters. In: ACM/IEEE Supercomputing (SC’ 07). (2007)

  23. Salfner, F., Malek, M.: Using hidden Semi-Markov models for effective online failure prediction. In: IEEE International Symposium on Reliable Distributed Systems (SRDS’ 07). (2007)

  24. Murray, J.F., Hughes, G.F., Kreutz-Delgado, K.: Hard drive failure prediction using non-parametric statistical methods. In: ICANN/ICONIP’ 03. (2003)

  25. Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)

    Article  Google Scholar 

  26. Vidyarthia, D.P., Tripathib, A.K.: Maximizing reliability of a distributed computing system with task allocation using simple genetic algorithm. J. Syst. Archit. 47(6), 549–554 (2001)

    Article  Google Scholar 

  27. Palmer, J., Mitrani, I.: Empirical and analytical evaluation of systems with multiple unreliable servers. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006)

  28. System availability, failure and usage data sets. Los Alamos National Laboratory (LANL). http://institutes.lanl.gov/data/fdata

  29. Salfner, F., Lenk, M., Malek, M.: A survey of online failure prediction methods. J. ACM Comput. Surv. 42(10), 1–68 (2010)

    Article  Google Scholar 

  30. The failure trace archive. http://fta.scem.uws.edu.au

  31. Papadimitriou, D., Poppe, F., Jones, J., Venkatachalam, S., Dharanikota, S., Jain, R., Xue, Y.: Inference of shared risk link groups. IETF draft, OIF contribution, OIF. (2001)

  32. Das, G., Papadimitriou, D., Tavernier, W., Colle, D., Dhaene, T., Pickavet, M., Demeester, P.: Link state protocol data mining for shared risk link group detection. In: IEEE Computer Communications and Networks (ICCCN’ 10), pp. 1–8. (2010)

  33. Soysal, Ö.M.: Association rule mining with mostly associated sequential patterns. Exp. Syst. Appl. 42(5), 2582–2592 (2015)

    Article  Google Scholar 

  34. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  35. http://hadoop.apache.org

  36. Bai, C.G., Hu, Q.P., Xie, M., Ng, S.H.: Software failure prediction based on a Markov Bayesian network model. J. Syst. Softw. 74(3), 275–282 (2005)

    Article  Google Scholar 

  37. Hughes, G.F., Murray, J.F., Kreutz-Delgado, K., Elkan, C.: Improved disk-drive failure warnings. IEEE Trans. Reliab. 51(3), 350–357 (2002)

    Article  Google Scholar 

  38. Fu, S., Xu, C.Z.: Quantifying temporal and spatial correlation of failure events for proactive management. In: IEEE Reliable Distributed Systems (RNS’ 07). (2007)

  39. Jhawar, R., Piuri, V.: Fault tolerance and resilience in cloud computing environments. Comput. Inf. Secur. Handb. (2013)

  40. Yigitbasi, N., Gallet, M., Kondo, D., et al.: Analysis and modeling of time-correlated failures in large-scale distributed systems. In: IEEE/ACM Grid Computing (GRID’ 10), pp. 65–72. (2010)

  41. Hoffmann, G., Malek, M.: Call availability prediction in a telecommunication system: a data driven empirical approach. In: IEEE SRDS’ 06, pp. 83–95. (2006)

  42. Neumayer, S., Modiano, E.: Network reliability with geographically correlated failures. In: IEEE INFOCOM’ 10, pp. 1–9. (2010)

  43. Kim, K., Venkatasubramanian, N.: Assessing the impact of geographically correlated failures on overlay-based data dissemination. In: IEEE GLOBECOM’ 10, pp. 1–5. (2010)

  44. Fiondella, L., Rajasekaran, S., Gokhale, S.S.: Efficient software reliability analysis with correlated component failures. IEEE Trans. Reliab. 62(1), 244–255 (2013)

    Article  Google Scholar 

Download references

Acknowledgments

The author would like to thank the anonymous reviewers for their invaluable suggestions which have been incorporated to improve the quality of the paper. This work was supported by the National Natural Science Foundation of China (No.61372108).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weiwei Zheng.

Appendix : Notations

Appendix : Notations

Table 6 summarizes the notations we used.

Table 6 Summary of notation used throughout this paper

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, W., Wang, Z., Huang, H. et al. SPSRG: a prediction approach for correlated failures in distributed computing systems. Cluster Comput 19, 1703–1721 (2016). https://doi.org/10.1007/s10586-016-0633-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-016-0633-2

Keywords

Navigation