Article

Studying and using failure data from large-scale internet services

Authors:

David Oppenheimer,

David A. PattersonAuthors Info & Claims

EW 10: Proceedings of the 10th workshop on ACM SIGOPS European workshop

Pages 255 - 258

https://doi.org/10.1145/1133373.1133427

Published: 01 July 2002 Publication History

Get Access

Abstract

Large-scale Internet services are the newest and arguably the most commercially important class of systems requiring 24x7 availability. As a result, very little information has been published about their causes of failure. In an attempt to address this deficiency, we have analyzed detailed failure reports from three large-scale Internet services. Our goals are to (1) identify the major factors contributing to user-visible failures, (2) evaluate the (potential) effectiveness of various techniques for preventing and mitigating service failure, and (3) build a fault model for service-level dependability and recovery benchmarks. Our initial results indicate that operator error and network problems are the leading contributors to user-visible failures, that failures in custom-written front-end software are significant, and that online testing and more thoroughly exposing and handling component failures would reduce failure rates in at least one service.

References

[1]

Brown, A., L. C. Chung, D. A. Patterson. Including the Human Factor in Dependability Benchmarks. 2002 DSN Workshop on Dependability Benchmarking, 2002.

Google Scholar

[2]

J. Gray. Why Do Computers Stop and What Can Be Done About It? Symposium on Reliability in Distributed Sofware and Database Systems, 1986.

Google Scholar

[3]

D. Oppenheimer. Why do Internet services fail, and what can be done about it? UC Berkeley Technical Report UCB-CSD-02-1185, 2002.

Digital Library

Google Scholar

Cited By

View all

Di Martino CKalbarczyk ZIyer R(2016)Measuring the Resiliency of Extreme-Scale Computing EnvironmentsPrinciples of Performance and Reliability Modeling and Evaluation10.1007/978-3-319-30599-8_24(609-655)Online publication date: 2-Apr-2016
https://doi.org/10.1007/978-3-319-30599-8_24
Qiu WZheng ZWang XYang XLyu M(2014)Reliability-Based Design Optimization for Cloud MigrationIEEE Transactions on Services Computing10.1109/TSC.2013.387:2(223-236)Online publication date: Apr-2014
https://doi.org/10.1109/TSC.2013.38
Martino CKalbarczyk ZIyer RBaccanico FFullop JKramer W(2014)Lessons Learned from the Analysis of System Failures at PetascaleProceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks10.1109/DSN.2014.62(610-621)Online publication date: 23-Jun-2014
https://dl.acm.org/doi/10.1109/DSN.2014.62
Show More Cited By

Studying and using failure data from large-scale internet services
1. General and reference
  1. Cross-computing tools and techniques
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures

Recommendations

Middleware-Based Failure Detection and Recovery Services for Fault-Tolerant E-services
DESE '09: Proceedings of the 2009 Second International Conference on Developments in eSystems Engineering

The runtime detection of failure and recovery from failure is a major challenge facing e-business and e-commerce applications. Different types of failure are well understood through the failure model, but the detection and differentiation between these ...
Network survivability in large-scale regional failure scenarios
C3S2E '09: Proceedings of the 2nd Canadian Conference on Computer Science and Software Engineering

In this short paper we present a preliminary study of the impact of large-scale failures on communication networks. Models for study of large-scale failures are studied and unique characteristics of these scenarios as well as their differences with ...
Detecting and recovering from large-scale failures in the internet

Comments

Information & Contributors

Information

Published In

EW 10: Proceedings of the 10th workshop on ACM SIGOPS European workshop

July 2002

258 pages

ISBN:9781450378062

DOI:10.1145/1133373

General Chair:
Gilles Muller,
Program Chair:
Eric Jul

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2002

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 37 of 37 submissions, 100%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
154
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Di Martino CKalbarczyk ZIyer R(2016)Measuring the Resiliency of Extreme-Scale Computing EnvironmentsPrinciples of Performance and Reliability Modeling and Evaluation10.1007/978-3-319-30599-8_24(609-655)Online publication date: 2-Apr-2016
https://doi.org/10.1007/978-3-319-30599-8_24
Qiu WZheng ZWang XYang XLyu M(2014)Reliability-Based Design Optimization for Cloud MigrationIEEE Transactions on Services Computing10.1109/TSC.2013.387:2(223-236)Online publication date: Apr-2014
https://doi.org/10.1109/TSC.2013.38
Martino CKalbarczyk ZIyer RBaccanico FFullop JKramer W(2014)Lessons Learned from the Analysis of System Failures at PetascaleProceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks10.1109/DSN.2014.62(610-621)Online publication date: 23-Jun-2014
https://dl.acm.org/doi/10.1109/DSN.2014.62
Pecchia ACinque M(2013)Log-Based Failure Analysis of Complex Systems: Methodology and Relevant ApplicationsInnovative Technologies for Dependable OTS-Based Critical Systems10.1007/978-88-470-2772-5_15(203-215)Online publication date: 24-Jan-2013
https://doi.org/10.1007/978-88-470-2772-5_15
Vanbrabant BJoosen WChemouil PMehaoua AFestor OLupu E(2011)Integrated management of network and security devices in IT infrastructuresProceedings of the 7th International Conference on Network and Services Management10.5555/2147671.2147738(375-379)Online publication date: 24-Oct-2011
https://dl.acm.org/doi/10.5555/2147671.2147738
Andrikopoulos VBucchiarone ADi Nitto EKazhamiakin RLane SMazza VRichardson I(2010)Service engineeringService research challenges and solutions for the future internet10.5555/1985668.1985676(271-337)Online publication date: 1-Jan-2010
https://dl.acm.org/doi/10.5555/1985668.1985676
Delaet TJoosen WVanbrabant BVan Drunen R(2010)A survey of system configuration toolsProceedings of the 24th international conference on Large installation system administration10.5555/1924976.1924977(1-8)Online publication date: 7-Nov-2010
https://dl.acm.org/doi/10.5555/1924976.1924977
Andrikopoulos VBucchiarone ADi Nitto EKazhamiakin RLane SMazza VRichardson I(2010)Service EngineeringService Research Challenges and Solutions for the Future Internet10.1007/978-3-642-17599-2_8(271-337)Online publication date: 2010
https://doi.org/10.1007/978-3-642-17599-2_8
Fox A(2002)Toward recovery-oriented computingProceedings of the 28th international conference on Very Large Data Bases10.5555/1287369.1287443(873-876)Online publication date: 20-Aug-2002
https://dl.acm.org/doi/10.5555/1287369.1287443

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Recommendations

Middleware-Based Failure Detection and Recovery Services for Fault-Tolerant E-services

Network survivability in large-scale regional failure scenarios

Detecting and recovering from large-scale failures in the internet

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations