research-article

Barricade: defending systems against operator mistakes

Authors:

Fabio Oliveira,

Ricardo Bianchini,

Richard P. Martin,

Thu D. NguyenAuthors Info & Claims

EuroSys '10: Proceedings of the 5th European conference on Computer systems

Pages 83 - 96

https://doi.org/10.1145/1755913.1755924

Published: 13 April 2010 Publication History

Abstract

In this paper, we propose a management framework for protecting large computer systems against operator mistakes. By detecting and confining mistakes to isolated portions of the managed system, our framework facilitates correct operation even by inexperienced operators. We built a prototype management system called Barricade based on our framework. We evaluate Barricade by deploying it for two different systems, a prototype Internet service and an enterprise computer infrastructure, and conducting experiments with 20 volunteer operators. Our results are very promising. For example, we show that Barricade can detect and contain 39 out of the 43 mistakes that we observed in 49 live operator experiments performed with our Internet service.

References

[1]

AZAR, Y., KUTTEN, S., AND PATT--SHAMIR, B. Distributed Error Confinement. In Proceedings of the 22nd ACM Symposium on Principles of Distributed Computing (PODC'03) (2003).

Digital Library

[2]

BIANCHINI, R., MARTIN, R. P., NAGARAJA, K., NGUYEN, T., AND OLIVEIRA, F. Human--Aware Computer System Design. In Proceedings of the 10th Workshop on Hot Topics in Operating Systems (HotOS X) (2005).

Digital Library

[3]

BODIK, P., FOX, A., JORDAN, M. I., PATTERSON, D., BANERJEE, A., JAGANNATHAN, R., SU, T., TENGINAKAI, S., TURNER, B., AND INGALLS, J. Advanced Tools for Operators at Amazon.com. In Proceedings of the 1st Work-shop on Hot Topics in Autonomic Computing (HotAC'06) (2006).

Digital Library

[4]

BROWN, A. A Recovery-oriented Approach to Dependable Services: Repairing Past Errors with System-wide Undo. PhD thesis, University of California, Berkeley, 2003.

Digital Library

[5]

DEMIRBAS, M., ARORA, A., NOLTE, T., AND LYNCH, N. A Hierarchy-Based Fault-Local Stabilizing Algorithm for Tracking in Sensor Networks. In Proceedings of the 8th International Conference on Principles of Distributed Systems (OPODIS'04) (2004).

Digital Library

[6]

GRAY, J. Why do Computers Stop and What Can Be Done About It? In Proceedings of 5th Symposium on Reliability in Distributed Software and Database Systems (SRDS'86) (Jan. 1986).

[7]

IDC. IDC Virtualization 2.0: The Next Phase in Customer Adoption, December 2006.

[8]

JOSHI, K. R., HILTUNEN, M., SANDERS, W. H., AND SCHLICHTING, R. Automatic Model-Driven Recovery in Distributed Systems. In Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05) (2005).

Digital Library

[9]

KEPHART, J. O., AND CHESS, D. M. The Vision of Autonomic Computing. IEEE Computer 36, 1 (Jan. 2003).

Digital Library

[10]

LEE, P. M. Bayesian Statistics: An Introduction, 3rd ed. Arnold, London, 2004.

[11]

NAGARAJA, K., OLIVEIRA, F., BIANCHINI, R., MARTIN, R. P., AND NGUYEN, T. Understanding and Dealing with Operator Mistakes in Internet Services. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI'04) (2004).

Digital Library

[12]

OLIVEIRA, F., NAGARAJA, K., BACHWANI, R., BIANCHINI, R., MARTIN, R. P., AND NGUYEN, T. D. Understanding and Validating Database System Administration. In Proceedings of the USENIX Annual Technical Conference (2006).

Digital Library

[13]

OLIVEIRA, F., TJANG, A., BIANCHINI, R., MARTIN, R. P., AND NGUYEN, T. Barricade: Defending Systems Against Operator Mistakes. Tech. Rep. 655, Dept. of Computer Science, Rutgers University, Mar. 2009.

[14]

OPPENHEIMER, D., GANAPATHI, A., AND PATTERSON, D. Why do Internet Services Fail, and What Can Be Done About It. In Proceedings of the USENIX Symposium on Internet Technologies and Systems (USITS'03) (2003).

Digital Library

[15]

PERL, S., AND WEIHL, W. E. Performance Assertion Checking. In Proceedings of the 14th ACMSymposium on Operating Systems Principles (SOSP'93) (1993).

Digital Library

[16]

REYNOLDS, P., KILLIAN, C., WIENER, J. L., MOGUL, J. C., SHAH, M. A., AND VAHDAT, A. Pip: Detecting the Unexpected in Distributed Systems. In Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI'06) (2006).

Digital Library

[17]

RICE UNIVERSITY. DynaServer Project. http://www.cs.rice.edu/CS/Systems/DynaServer, 2003.

[18]

SHORROCK, S. T. Errors of Perception in Air Traffic Control. Safety Science 45, 8 (2006), 890--904.

[19]

SU, Y.-Y., ATTARIYAN, M., AND FLINN, J. AutoBash: Improving Configuration Management with Operating System Causality Analysis. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP'07) (2007).

Digital Library

[20]

TJANG, A., OLIVEIRA, F., BIANCHINI, R., MARTIN, R. P., AND NGUYEN, T. Model-Based Validation for Internet Services. In Proceedings of the 28th International Symposium on Reliable Distributed Systems (SRDS'09) (2009).

Digital Library

[21]

VERBOWSKI, C., LEE, J., LIU, X., ROUSSEV, R., AND WANG, Y.-M. LiveOps: Systems Management as a Service. In Proceedings of the 20th Large Installation System Administration Conference (LISA'06) (2006).

Digital Library

[22]

ZHENG, W., BIANCHINI, R., AND NGUYEN, T. Automatic Configuration of Internet Services. In Proceedings of EuroSys 2007 (2007).

Digital Library

Cited By

Gu JSun XZhang WJiang YWang CVaziri MLegunsen OXu TDruschel PKaufmann AMace JFlinn JSeltzer M(2023)Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System ManagementProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613161(96-112)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3600006.3613161
Zhao W(2021)Recovery‐Oriented ComputingFrom Traditional Fault Tolerance to Blockchain10.1002/9781119682127.ch3(63-101)Online publication date: 18-Jun-2021
https://doi.org/10.1002/9781119682127.ch3
Xu TZhou Y(2015)Systems Approaches to Tackling Configuration ErrorsACM Computing Surveys10.1145/279157747:4(1-41)Online publication date: 21-Jul-2015
https://dl.acm.org/doi/10.1145/2791577
Show More Cited By

Index Terms

Barricade: defending systems against operator mistakes
1. Social and professional topics
  1. Professional topics
    1. Management of computing and information systems
      1. File systems management
      2. System management

Recommendations

Automatic configuration of internet services
EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007

Recent research has found that operators frequently misconfigure Internet services, causing various availability and performance problems. In this paper, we propose a software infrastructure that eliminates several types of misconfiguration by automating ...
Automatic configuration of internet services
EuroSys'07 Conference Proceedings

Recent research has found that operators frequently misconfigure Internet services, causing various availability and performance problems. In this paper, we propose a software infrastructure that eliminates several types of misconfiguration by ...
Resource management with X.509 inter-domain authorization certificates (InterAC)
EuroPKI'09: Proceedings of the 6th European conference on Public key infrastructures, services and applications

Collaboration among independent administrative domains would require: i) confidentiality, integrity, non-repudiation of communication between the domains; ii) minimum and reversible modifications to the intra-domain precollaboration setup; iii) maintain ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '10: Proceedings of the 5th European conference on Computer systems

April 2010

388 pages

ISBN:9781605585772

DOI:10.1145/1755913

General Chair:
Christine Morin
INRIA Rennes, France
,
Program Chair:
Gilles Muller
INRIA/LIP6, France

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

EuroSys '10

Sponsor:

SIGOPS

EuroSys '10: Fifth EuroSys Conference 2010

April 13 - 16, 2010

Paris, France

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
282
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gu JSun XZhang WJiang YWang CVaziri MLegunsen OXu TDruschel PKaufmann AMace JFlinn JSeltzer M(2023)Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System ManagementProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613161(96-112)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3600006.3613161
Zhao W(2021)Recovery‐Oriented ComputingFrom Traditional Fault Tolerance to Blockchain10.1002/9781119682127.ch3(63-101)Online publication date: 18-Jun-2021
https://doi.org/10.1002/9781119682127.ch3
Xu TZhou Y(2015)Systems Approaches to Tackling Configuration ErrorsACM Computing Surveys10.1145/279157747:4(1-41)Online publication date: 21-Jul-2015
https://dl.acm.org/doi/10.1145/2791577
Otsuka HWatanabe YMatsumoto Y(2015)Learning from Before and After Recovery to Detect Latent MisconfigurationProceedings of the 2015 IEEE 39th Annual Computer Software and Applications Conference - Volume 0310.1109/COMPSAC.2015.222(141-148)Online publication date: 1-Jul-2015
https://dl.acm.org/doi/10.1109/COMPSAC.2015.222
Chaves L(2014)A cost effective and preventive approach to avoid integration faults caused by mistakes in distribution of software componentsAdvances in Software Engineering10.1155/2014/4394622014(7-7)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.1155/2014/439462
Mclarnon BRobinson PMilligan PSage P(2014)An Iterative Approach to Trustable Systems Management Automation and Fault HandlingJournal of Network and Systems Management10.1007/s10922-013-9295-z22:3(366-395)Online publication date: 1-Jul-2014
https://dl.acm.org/doi/10.1007/s10922-013-9295-z
Zhao W(2014)Recovery‐Oriented ComputingBuilding Dependable Distributed Systems10.1002/9781118912744.ch3(57-95)Online publication date: Mar-2014
https://doi.org/10.1002/9781118912744.ch3
Arshad FKrause RBagchi S(2013)Characterizing configuration problems in Java EE application servers: An empirical study with GlassFish and JBoss2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE.2013.6698919(198-207)Online publication date: Nov-2013
https://doi.org/10.1109/ISSRE.2013.6698919
Wang QKanemasa YLi JJayasinghe DShimizu TMatsubara MKawaba MPu C(2013)Detecting Transient Bottlenecks in n-Tier Applications through Fine-Grained AnalysisProceedings of the 2013 IEEE 33rd International Conference on Distributed Computing Systems10.1109/ICDCS.2013.17(31-40)Online publication date: 8-Jul-2013
https://dl.acm.org/doi/10.1109/ICDCS.2013.17
Mann VVishnoi A(2012)iTrack: Correlating user activity with system data2012 IEEE Network Operations and Management Symposium10.1109/NOMS.2012.6212031(1068-1074)Online publication date: Apr-2012
https://doi.org/10.1109/NOMS.2012.6212031
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten