SPSRG: a prediction approach for correlated failures in distributed computing systems

Zheng, Weiwei; Wang, Zhili; Huang, Haoqiu; Meng, Luoming; Qiu, Xuesong

doi:10.1007/s10586-016-0633-2

SPSRG: a prediction approach for correlated failures in distributed computing systems

Published: 13 September 2016

Volume 19, pages 1703–1721, (2016)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Weiwei Zheng¹,
Zhili Wang¹,
Haoqiu Huang¹,
Luoming Meng¹ &
…
Xuesong Qiu¹

321 Accesses
5 Citations
Explore all metrics

Abstract

Failure instances in distributed computing systems (DCSs) have exhibited temporal and spatial correlations, where a single failure instance can trigger a set of failure instances simultaneously or successively within a short time interval. In this work, we propose a correlated failure prediction approach (CFPA) to predict correlated failures of computing elements in DCSs. The approach models correlated-failure patterns using the concept of probabilistic shared risk groups and makes a prediction for correlated failures by exploiting an association rule mining approach in a parallel way. We conduct extensive experiments to evaluate the feasibility and effectiveness of CFPA using both failure traces from Los Alamos National Lab and simulated datasets. The experimental results show that the proposed approach outperforms other approaches in both the failure prediction performance and the execution time, and can potentially provide better prediction performance in a larger system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Failure Prediction of Cluster Systems Based on System Logs

An Approach to Failure Prediction in Cluster by Self-updating Cause-and-Effect Graph

Failure Prediction for Large-Scale Clusters Logs via Mining Frequent Patterns

Notes

The operator “\(\backslash \)” denotes set minus as in “\( X \backslash Y\)”, which means ‘ Y is excluded from X’.
Each \(I_{k}\) corresponds to a certain CE in a DCS.

References

Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., Sahoo, R.: BlueGene/L failure analysis and prediction models. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006)
Asyabi, E., Azhdari, A., Dehsangi, M., Khan, M.G., Sharifi, M., Azhari, S.V.: Kani: a QoS-aware hypervisor-level scheduler for cloud computing environments. Clust. Comput. 19(2), 1–17 (2016)
Article Google Scholar
Karim, R., Ding, C., Miri, A., Rahman, M.S.: Incorporating service and user information and latent features to predict QoS for selecting and recommending cloud service compositions. Clust. Comput. 19(2), 1–16 (2016)
Google Scholar
Martini, B., Choo, K.K.R.: An integrated conceptual digital forensic framework for cloud computing. Digit. Investig. 9(9), 71–80 (2012)
Article Google Scholar
Quick, D., Choo, K.K.R.: Dropbox analysis: data remnants on user machines. Digit. Investig. 10(1), 3–18 (2013)
Article Google Scholar
Cahyani, N.D.W., Martini, B., Choo, K.R., Al-Azhar, A.M.N.: Forensic data acquisition from cloud-of-things devices: windows smartphones as a case study. Concurr. Comput. Pract. Exp. (2016)
Quick, D., Choo, K.K.R.: Google drive: forensic analysis of data remnants. J. Netw. Comput. Appl. 40(2), 179–193 (2014)
Article Google Scholar
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006)
Gallet, M., Yigitbasi, N., Javadi, B., Kondo, D., Iosup, A., Epema, D.: A model for space-correlated failures in large-scale distributed systems. In: Springer European Conference on Parallel Processing, pp. 88–100. (2010)
Pezoa, J.E., Hayat, M.M.: Reliability of heterogeneous distributed computing systems in the presence of correlated failures. IEEE Trans. Parallel Distrib. Syst. 25(4), 1034–1043 (2014)
Article Google Scholar
Salfner, F., Schieschke, M., Malek, M.: Predicting failures of computer systems: a case study for a telecommunication system. In: IEEE Parallel and Distributed Processing Symposium (IPDPS’ 06). (2006)
Rahman, N.H.A., Glisson, W.B., Yang, Y., Choo, K.K.R.: Forensic-by-design framework for cyber-physical cloud systems. IEEE Cloud Comput. 3(1), 50–59 (2016)
Article Google Scholar
Ab Rahman, N.H., Cahyani, N.D.W., Choo, K.R.: Cloud incident handling and forensic-by-design: cloud storage as a case study. Concurr. Comput. Pract. Exp. (2016)
Quick, D., Choo, K.K.R.: Digital droplets: microsoft skydrive forensic data remnants. Future Gener. Comput. Syst. 29(6), 1378–1394 (2013)
Article Google Scholar
Tep, K.S., Martini, B., Hunt, R., Choo, K.K.R.: A Taxonomy of cloud attack consequences and mitigation strategies: the role of access control and privileged access management. In: IEEE Trustcom/BigDataSE/ISPA’ 15, pp. 1073–1080. (2015)
Baldoni, R., Montanari, L., Rizzuto, M.: On-line failure prediction in safety-critical systems. Future Gener. Comput. Syst. 45, 123–132 (2015)
Article Google Scholar
Quick, D., Choo, K.K.R.: Big forensic data reduction: digital forensic images and electronic evidence. Clust. Comput. 19(2), 1–18 (2016)
Article Google Scholar
Martini, B., Choo, K.K.R.: Cloud storage forensics: owncloud as a case study. Digit. Investig. 10(4), 287–299 (2013)
Article Google Scholar
Quick, D., Martini, B., Choo, R.: Cloud Storage Forensics. Syngress Publishing, Boston (2013)
Google Scholar
Martini, B., Choo, K.K.R.: Distributed filesystem forensics: xtreemfs as a case study. Digit. Investig. 11(4), 295–313 (2014)
Article Google Scholar
Quick, D., Choo, K.K.R.: Forensic collection of cloud storage data: does the act of collection result in changes to the data or its metadata? Digit. Investig. 10(3), 266–277 (2013)
Article Google Scholar
Fu, S., Xu, C.Z.: Exploring event correlation for failure prediction in coalitions of clusters. In: ACM/IEEE Supercomputing (SC’ 07). (2007)
Salfner, F., Malek, M.: Using hidden Semi-Markov models for effective online failure prediction. In: IEEE International Symposium on Reliable Distributed Systems (SRDS’ 07). (2007)
Murray, J.F., Hughes, G.F., Kreutz-Delgado, K.: Hard drive failure prediction using non-parametric statistical methods. In: ICANN/ICONIP’ 03. (2003)
Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)
Article Google Scholar
Vidyarthia, D.P., Tripathib, A.K.: Maximizing reliability of a distributed computing system with task allocation using simple genetic algorithm. J. Syst. Archit. 47(6), 549–554 (2001)
Article Google Scholar
Palmer, J., Mitrani, I.: Empirical and analytical evaluation of systems with multiple unreliable servers. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006)
System availability, failure and usage data sets. Los Alamos National Laboratory (LANL). http://institutes.lanl.gov/data/fdata
Salfner, F., Lenk, M., Malek, M.: A survey of online failure prediction methods. J. ACM Comput. Surv. 42(10), 1–68 (2010)
Article Google Scholar
The failure trace archive. http://fta.scem.uws.edu.au
Papadimitriou, D., Poppe, F., Jones, J., Venkatachalam, S., Dharanikota, S., Jain, R., Xue, Y.: Inference of shared risk link groups. IETF draft, OIF contribution, OIF. (2001)
Das, G., Papadimitriou, D., Tavernier, W., Colle, D., Dhaene, T., Pickavet, M., Demeester, P.: Link state protocol data mining for shared risk link group detection. In: IEEE Computer Communications and Networks (ICCCN’ 10), pp. 1–8. (2010)
Soysal, Ö.M.: Association rule mining with mostly associated sequential patterns. Exp. Syst. Appl. 42(5), 2582–2592 (2015)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
http://hadoop.apache.org
Bai, C.G., Hu, Q.P., Xie, M., Ng, S.H.: Software failure prediction based on a Markov Bayesian network model. J. Syst. Softw. 74(3), 275–282 (2005)
Article Google Scholar
Hughes, G.F., Murray, J.F., Kreutz-Delgado, K., Elkan, C.: Improved disk-drive failure warnings. IEEE Trans. Reliab. 51(3), 350–357 (2002)
Article Google Scholar
Fu, S., Xu, C.Z.: Quantifying temporal and spatial correlation of failure events for proactive management. In: IEEE Reliable Distributed Systems (RNS’ 07). (2007)
Jhawar, R., Piuri, V.: Fault tolerance and resilience in cloud computing environments. Comput. Inf. Secur. Handb. (2013)
Yigitbasi, N., Gallet, M., Kondo, D., et al.: Analysis and modeling of time-correlated failures in large-scale distributed systems. In: IEEE/ACM Grid Computing (GRID’ 10), pp. 65–72. (2010)
Hoffmann, G., Malek, M.: Call availability prediction in a telecommunication system: a data driven empirical approach. In: IEEE SRDS’ 06, pp. 83–95. (2006)
Neumayer, S., Modiano, E.: Network reliability with geographically correlated failures. In: IEEE INFOCOM’ 10, pp. 1–9. (2010)
Kim, K., Venkatasubramanian, N.: Assessing the impact of geographically correlated failures on overlay-based data dissemination. In: IEEE GLOBECOM’ 10, pp. 1–5. (2010)
Fiondella, L., Rajasekaran, S., Gokhale, S.S.: Efficient software reliability analysis with correlated component failures. IEEE Trans. Reliab. 62(1), 244–255 (2013)
Article Google Scholar

Download references

Acknowledgments

The author would like to thank the anonymous reviewers for their invaluable suggestions which have been incorporated to improve the quality of the paper. This work was supported by the National Natural Science Foundation of China (No.61372108).

Author information

Authors and Affiliations

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
Weiwei Zheng, Zhili Wang, Haoqiu Huang, Luoming Meng & Xuesong Qiu

Authors

Weiwei Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Zhili Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haoqiu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Luoming Meng
View author publications
You can also search for this author in PubMed Google Scholar
Xuesong Qiu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiwei Zheng.

Appendix : Notations

Table 6 summarizes the notations we used.

Table 6 Summary of notation used throughout this paper

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, W., Wang, Z., Huang, H. et al. SPSRG: a prediction approach for correlated failures in distributed computing systems. Cluster Comput 19, 1703–1721 (2016). https://doi.org/10.1007/s10586-016-0633-2

Download citation

Received: 03 January 2016
Revised: 17 August 2016
Accepted: 29 August 2016
Published: 13 September 2016
Issue Date: December 2016
DOI: https://doi.org/10.1007/s10586-016-0633-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SPSRG: a prediction approach for correlated failures in distributed computing systems

Abstract

Access this article

Similar content being viewed by others

The Failure Prediction of Cluster Systems Based on System Logs

An Approach to Failure Prediction in Cluster by Self-updating Cause-and-Effect Graph

Failure Prediction for Large-Scale Clusters Logs via Mining Frequent Patterns

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix : Notations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SPSRG: a prediction approach for correlated failures in distributed computing systems

Abstract

Access this article

Similar content being viewed by others

The Failure Prediction of Cluster Systems Based on System Logs

An Approach to Failure Prediction in Cluster by Self-updating Cause-and-Effect Graph

Failure Prediction for Large-Scale Clusters Logs via Mining Frequent Patterns

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix : Notations

Appendix : Notations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation