skip to main content
10.1145/1133373.1133427acmotherconferencesArticle/Chapter ViewAbstractPublication PagesewConference Proceedingsconference-collections
Article

Studying and using failure data from large-scale internet services

Published: 01 July 2002 Publication History

Abstract

Large-scale Internet services are the newest and arguably the most commercially important class of systems requiring 24x7 availability. As a result, very little information has been published about their causes of failure. In an attempt to address this deficiency, we have analyzed detailed failure reports from three large-scale Internet services. Our goals are to (1) identify the major factors contributing to user-visible failures, (2) evaluate the (potential) effectiveness of various techniques for preventing and mitigating service failure, and (3) build a fault model for service-level dependability and recovery benchmarks. Our initial results indicate that operator error and network problems are the leading contributors to user-visible failures, that failures in custom-written front-end software are significant, and that online testing and more thoroughly exposing and handling component failures would reduce failure rates in at least one service.

References

[1]
Brown, A., L. C. Chung, D. A. Patterson. Including the Human Factor in Dependability Benchmarks. 2002 DSN Workshop on Dependability Benchmarking, 2002.
[2]
J. Gray. Why Do Computers Stop and What Can Be Done About It? Symposium on Reliability in Distributed Sofware and Database Systems, 1986.
[3]
D. Oppenheimer. Why do Internet services fail, and what can be done about it? UC Berkeley Technical Report UCB-CSD-02-1185, 2002.

Cited By

View all
  • (2016)Measuring the Resiliency of Extreme-Scale Computing EnvironmentsPrinciples of Performance and Reliability Modeling and Evaluation10.1007/978-3-319-30599-8_24(609-655)Online publication date: 2-Apr-2016
  • (2014)Reliability-Based Design Optimization for Cloud MigrationIEEE Transactions on Services Computing10.1109/TSC.2013.387:2(223-236)Online publication date: Apr-2014
  • (2014)Lessons Learned from the Analysis of System Failures at PetascaleProceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks10.1109/DSN.2014.62(610-621)Online publication date: 23-Jun-2014
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EW 10: Proceedings of the 10th workshop on ACM SIGOPS European workshop
July 2002
258 pages
ISBN:9781450378062
DOI:10.1145/1133373
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2002

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 37 of 37 submissions, 100%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2016)Measuring the Resiliency of Extreme-Scale Computing EnvironmentsPrinciples of Performance and Reliability Modeling and Evaluation10.1007/978-3-319-30599-8_24(609-655)Online publication date: 2-Apr-2016
  • (2014)Reliability-Based Design Optimization for Cloud MigrationIEEE Transactions on Services Computing10.1109/TSC.2013.387:2(223-236)Online publication date: Apr-2014
  • (2014)Lessons Learned from the Analysis of System Failures at PetascaleProceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks10.1109/DSN.2014.62(610-621)Online publication date: 23-Jun-2014
  • (2013)Log-Based Failure Analysis of Complex Systems: Methodology and Relevant ApplicationsInnovative Technologies for Dependable OTS-Based Critical Systems10.1007/978-88-470-2772-5_15(203-215)Online publication date: 24-Jan-2013
  • (2011)Integrated management of network and security devices in IT infrastructuresProceedings of the 7th International Conference on Network and Services Management10.5555/2147671.2147738(375-379)Online publication date: 24-Oct-2011
  • (2010)Service engineeringService research challenges and solutions for the future internet10.5555/1985668.1985676(271-337)Online publication date: 1-Jan-2010
  • (2010)A survey of system configuration toolsProceedings of the 24th international conference on Large installation system administration10.5555/1924976.1924977(1-8)Online publication date: 7-Nov-2010
  • (2010)Service EngineeringService Research Challenges and Solutions for the Future Internet10.1007/978-3-642-17599-2_8(271-337)Online publication date: 2010
  • (2002)Toward recovery-oriented computingProceedings of the 28th international conference on Very Large Data Bases10.5555/1287369.1287443(873-876)Online publication date: 20-Aug-2002

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media