Skip to main content

Improving the Dependability of Grids via Short-Term Failure Predictions

  • Conference paper
  • First Online:
Grids, P2P and Services Computing

Abstract

Computational Grids like EGEE offer sufficient capacity for even most challenging large-scale computational experiments, thus becoming an indispensable tool for researchers in various fields. However, the utility of these infrastructures is severely hampered by their notoriously low reliability: a recent nine-month study found that only 48% of jobs submitted in South-Eastern-Europe completed successfully. We attack this problem by means of proactive failure detection. Specifically, we predict site failures on short-term time scale by deploying machine learning algorithms to discover relationships between site performance variables and subsequent failures. Such predictions can be used by Resource Brokers for deciding where to submit new jobs, and help operators to take preventive measures. Our experimental evaluation on a 30-day trace from 197 EGEE queues shows that the accuracy of results is highly dependent on the selected queue, the type of failure, the preprocessing and the choice of input variables.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Andrzejak and L. Silva. Using machine learning for non-intrusive modeling and prediction of software aging. In JEEF/IFIP Network Operations & Management Symposium (NOMS 2008), Salvador de Bahia, Brazil, Apr 7—11 2008.

    Google Scholar 

  2. A. Cooke et al. The Relational Grid Monitoring Architecture: Mediating Infomiation about the Grid. Journal of Grid Computing, 2(4):323—339, 2004.

    Google Scholar 

  3. R. Duda, P. Hart, and D. Stork. Pattern C1assflcation. John Wiley and Sons, 2001. 0471-05669-3.

    Google Scholar 

  4. EGEE. Service availability monitoring (SAM), http:llsam-docs.web.cern.chlsam-docs/.

    Google Scholar 

  5. I. Foster. Globus Toolkit Version 4: Software for Service-Oriented Systems. Journal of Computer Science and Technology, 21(4):5 13—520, 2006.

    Google Scholar 

  6. Glite. Glite nflddlewaie, http://glite.org/.

    Google Scholar 

  7. GStat. Grid statistics (gstat), http://goc.grid.sinica.edu.tw/gstat!.

    Google Scholar 

  8. E. J. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data nflning. In Proceedings of the Tenth ACM SIGKDD International Conference on Kiwwledge Discovery and Data Mining, pages 206—215, August 2004.

    Google Scholar 

  9. E. Kiciman and A. Fox. Detecting application-level failures in component-based internet services, June 2004.

    Google Scholar 

  10. E. Kiciman and L. Subramanian. Root cause localization in large scale systems. In In Proceedings of the 1 st Workshop on Hot Topics in System Dependability (HotDep-05. IEEE Computer Society, June 2005.

    Google Scholar 

  11. 5. Krishnamurthy, W. H. Sanders, and M. Cukiet A dynamic replica selection algorithm for tolemting timing faults. In 2001 International Conference on Dependable Systems and Networks (DSN 2001) (formerly: FTCS), pages 107—116, Goteborg, Sweden, July 2001. IEEE Computer Society.

    Google Scholar 

  12. M. E. Lccasto, S. Sidiroglou, andA. D. Keromytis. Application communities: Using mono- culture for dependability. In In Proceedings of the 1 st Workshop on Hot Topics in System Dependability (HotDep-05, pages 288—292,2005.

    Google Scholar 

  13. K. Neocleous. Failure analysis, prediction and management on the EGEE grid infrastructure. Master’s thesis, University of Cypms, August 2007.

    Google Scholar 

  14. K. Neocleous, M. D. Dikaiakos, P. Fragopoulou, and E. Markatos. Failure management in grids: The case of the EGEE infrastructure. Parallel Processing Letters, 17(4):391—410, Dec. 2007.

    Google Scholar 

  15. D. G. Stork, E. Yom-Tov, and R. 0. Duda. Computer manual in MATL4B to accompany Pattern Classfication. Wiley, second edition, 2004.

    Google Scholar 

  16. F. van der Heijden, R. P. W. Duin, D. de Ridder, and D. M. J. Tax. Classfi cation, Parameter Estimation and State Estimation. John Wiley & Sons, 2004.

    Google Scholar 

  17. R. Vilalta, C. V. Apte, J. L. Hellerstein, S. Ma, and S. M. Weiss. Predictive algorithms in the management of computer systems. IBM Systems Journal, 41(3):461—474, 2002.

    Google Scholar 

  18. WISDOM. Initiative for grid-enabled thug discovery against neglected and emergent diseases, http://wisdom.eu-egee.fn

    Google Scholar 

  19. I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2nd edition edition, 2005.

    Google Scholar 

  20. D. Zeinalipour-Yazti, H. Papadakis, C. Georgiou, and M. Dikalakos. Metadata ranking and pruning forfailure detection in grids. Parallel Processing Letters, 18(3):371—390, Sept. 2008.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Artur Andrzejak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer US

About this paper

Cite this paper

Andrzejak, A., Zeinalipour-Yazti, D., Dikaiakos, M.D. (2010). Improving the Dependability of Grids via Short-Term Failure Predictions. In: Desprez, F., Getov, V., Priol, T., Yahyapour, R. (eds) Grids, P2P and Services Computing. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-6794-7_3

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-6794-7_3

  • Published:

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4419-6793-0

  • Online ISBN: 978-1-4419-6794-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics