skip to main content
10.1145/1755913.1755924acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Barricade: defending systems against operator mistakes

Published: 13 April 2010 Publication History

Abstract

In this paper, we propose a management framework for protecting large computer systems against operator mistakes. By detecting and confining mistakes to isolated portions of the managed system, our framework facilitates correct operation even by inexperienced operators. We built a prototype management system called Barricade based on our framework. We evaluate Barricade by deploying it for two different systems, a prototype Internet service and an enterprise computer infrastructure, and conducting experiments with 20 volunteer operators. Our results are very promising. For example, we show that Barricade can detect and contain 39 out of the 43 mistakes that we observed in 49 live operator experiments performed with our Internet service.

References

[1]
AZAR, Y., KUTTEN, S., AND PATT--SHAMIR, B. Distributed Error Confinement. In Proceedings of the 22nd ACM Symposium on Principles of Distributed Computing (PODC'03) (2003).
[2]
BIANCHINI, R., MARTIN, R. P., NAGARAJA, K., NGUYEN, T., AND OLIVEIRA, F. Human--Aware Computer System Design. In Proceedings of the 10th Workshop on Hot Topics in Operating Systems (HotOS X) (2005).
[3]
BODIK, P., FOX, A., JORDAN, M. I., PATTERSON, D., BANERJEE, A., JAGANNATHAN, R., SU, T., TENGINAKAI, S., TURNER, B., AND INGALLS, J. Advanced Tools for Operators at Amazon.com. In Proceedings of the 1st Work-shop on Hot Topics in Autonomic Computing (HotAC'06) (2006).
[4]
BROWN, A. A Recovery-oriented Approach to Dependable Services: Repairing Past Errors with System-wide Undo. PhD thesis, University of California, Berkeley, 2003.
[5]
DEMIRBAS, M., ARORA, A., NOLTE, T., AND LYNCH, N. A Hierarchy-Based Fault-Local Stabilizing Algorithm for Tracking in Sensor Networks. In Proceedings of the 8th International Conference on Principles of Distributed Systems (OPODIS'04) (2004).
[6]
GRAY, J. Why do Computers Stop and What Can Be Done About It? In Proceedings of 5th Symposium on Reliability in Distributed Software and Database Systems (SRDS'86) (Jan. 1986).
[7]
IDC. IDC Virtualization 2.0: The Next Phase in Customer Adoption, December 2006.
[8]
JOSHI, K. R., HILTUNEN, M., SANDERS, W. H., AND SCHLICHTING, R. Automatic Model-Driven Recovery in Distributed Systems. In Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05) (2005).
[9]
KEPHART, J. O., AND CHESS, D. M. The Vision of Autonomic Computing. IEEE Computer 36, 1 (Jan. 2003).
[10]
LEE, P. M. Bayesian Statistics: An Introduction, 3rd ed. Arnold, London, 2004.
[11]
NAGARAJA, K., OLIVEIRA, F., BIANCHINI, R., MARTIN, R. P., AND NGUYEN, T. Understanding and Dealing with Operator Mistakes in Internet Services. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI'04) (2004).
[12]
OLIVEIRA, F., NAGARAJA, K., BACHWANI, R., BIANCHINI, R., MARTIN, R. P., AND NGUYEN, T. D. Understanding and Validating Database System Administration. In Proceedings of the USENIX Annual Technical Conference (2006).
[13]
OLIVEIRA, F., TJANG, A., BIANCHINI, R., MARTIN, R. P., AND NGUYEN, T. Barricade: Defending Systems Against Operator Mistakes. Tech. Rep. 655, Dept. of Computer Science, Rutgers University, Mar. 2009.
[14]
OPPENHEIMER, D., GANAPATHI, A., AND PATTERSON, D. Why do Internet Services Fail, and What Can Be Done About It. In Proceedings of the USENIX Symposium on Internet Technologies and Systems (USITS'03) (2003).
[15]
PERL, S., AND WEIHL, W. E. Performance Assertion Checking. In Proceedings of the 14th ACMSymposium on Operating Systems Principles (SOSP'93) (1993).
[16]
REYNOLDS, P., KILLIAN, C., WIENER, J. L., MOGUL, J. C., SHAH, M. A., AND VAHDAT, A. Pip: Detecting the Unexpected in Distributed Systems. In Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI'06) (2006).
[17]
RICE UNIVERSITY. DynaServer Project. http://www.cs.rice.edu/CS/Systems/DynaServer, 2003.
[18]
SHORROCK, S. T. Errors of Perception in Air Traffic Control. Safety Science 45, 8 (2006), 890--904.
[19]
SU, Y.-Y., ATTARIYAN, M., AND FLINN, J. AutoBash: Improving Configuration Management with Operating System Causality Analysis. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP'07) (2007).
[20]
TJANG, A., OLIVEIRA, F., BIANCHINI, R., MARTIN, R. P., AND NGUYEN, T. Model-Based Validation for Internet Services. In Proceedings of the 28th International Symposium on Reliable Distributed Systems (SRDS'09) (2009).
[21]
VERBOWSKI, C., LEE, J., LIU, X., ROUSSEV, R., AND WANG, Y.-M. LiveOps: Systems Management as a Service. In Proceedings of the 20th Large Installation System Administration Conference (LISA'06) (2006).
[22]
ZHENG, W., BIANCHINI, R., AND NGUYEN, T. Automatic Configuration of Internet Services. In Proceedings of EuroSys 2007 (2007).

Cited By

View all
  • (2023)Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System ManagementProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613161(96-112)Online publication date: 23-Oct-2023
  • (2021)Recovery‐Oriented ComputingFrom Traditional Fault Tolerance to Blockchain10.1002/9781119682127.ch3(63-101)Online publication date: 18-Jun-2021
  • (2015)Systems Approaches to Tackling Configuration ErrorsACM Computing Surveys10.1145/279157747:4(1-41)Online publication date: 21-Jul-2015
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroSys '10: Proceedings of the 5th European conference on Computer systems
April 2010
388 pages
ISBN:9781605585772
DOI:10.1145/1755913
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. manageability
  2. operator mistakes

Qualifiers

  • Research-article

Conference

EuroSys '10
Sponsor:
EuroSys '10: Fifth EuroSys Conference 2010
April 13 - 16, 2010
Paris, France

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System ManagementProceedings of the 29th Symposium on Operating Systems Principles10.1145/3600006.3613161(96-112)Online publication date: 23-Oct-2023
  • (2021)Recovery‐Oriented ComputingFrom Traditional Fault Tolerance to Blockchain10.1002/9781119682127.ch3(63-101)Online publication date: 18-Jun-2021
  • (2015)Systems Approaches to Tackling Configuration ErrorsACM Computing Surveys10.1145/279157747:4(1-41)Online publication date: 21-Jul-2015
  • (2015)Learning from Before and After Recovery to Detect Latent MisconfigurationProceedings of the 2015 IEEE 39th Annual Computer Software and Applications Conference - Volume 0310.1109/COMPSAC.2015.222(141-148)Online publication date: 1-Jul-2015
  • (2014)A cost effective and preventive approach to avoid integration faults caused by mistakes in distribution of software componentsAdvances in Software Engineering10.1155/2014/4394622014(7-7)Online publication date: 1-Jan-2014
  • (2014)An Iterative Approach to Trustable Systems Management Automation and Fault HandlingJournal of Network and Systems Management10.1007/s10922-013-9295-z22:3(366-395)Online publication date: 1-Jul-2014
  • (2014)Recovery‐Oriented ComputingBuilding Dependable Distributed Systems10.1002/9781118912744.ch3(57-95)Online publication date: Mar-2014
  • (2013)Characterizing configuration problems in Java EE application servers: An empirical study with GlassFish and JBoss2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE.2013.6698919(198-207)Online publication date: Nov-2013
  • (2013)Detecting Transient Bottlenecks in n-Tier Applications through Fine-Grained AnalysisProceedings of the 2013 IEEE 33rd International Conference on Distributed Computing Systems10.1109/ICDCS.2013.17(31-40)Online publication date: 8-Jul-2013
  • (2012)iTrack: Correlating user activity with system data2012 IEEE Network Operations and Management Symposium10.1109/NOMS.2012.6212031(1068-1074)Online publication date: Apr-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media