skip to main content
10.1145/1075405.1075415acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
Article

Combining statistical monitoring and predictable recovery for self-management

Published: 31 October 2004 Publication History

Abstract

Complex distributed Internet services form the basis not only of e-commerce but increasingly of mission-critical network-based applications. What is new is that the workload and internal architecture of three-tier enterprise applications presents the opportunity for a new approach to keeping them running in the face of many common recoverable failures. The core of the approach is anomaly detection and localization based on statistical machine learning techniques. Unlike previous approaches, we propose anomaly detection and pattern mining not only for operational statistics such as mean response time, but also for structural behaviors of the system---what parts of the system, in what combinations, are being exercised in response to different kinds of external stimuli. In addition, rather than building baseline models a priori, we extract them by observing the behavior of the system over a short period of time during normal operation. We explain the necessary underlying assumptions and why they can be realized by systems research, report on some early successes using the approach, describe benefits of the approach that make it competitive as a path toward self-managing systems, and outline some research challenges. Our hope is that this approach will enable "new science" in the design of self-managing systems by allowing the rapid and widespread application of statistical learning theory techniques (SLT) to problems of system dependability.

References

[1]
Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan. Magpie: real-time modelling and performance-aware systems. In Proc. 9th Workshop on Hot Topics in Operating Systems, Lihue, HI, June 2003.
[2]
Michèle Basseville and Igor V. Nikiforov. Detection of Abrupt Changes---Theory and Application. Prentice-Hall Inc., Englewood Cliffs, NJ, 1993.
[3]
Eric Brewer. Lessons from giant-scale services. IEEE Internet Computing, 5(4):46--55, July 2001.
[4]
Business Internet Group. The black Friday report on Web application integrity. San Francisco, CA, 2003.
[5]
George Candea and Armando Fox. Crash-only software. In Proc. 9th Workshop on Hot Topics in Operating Systems, Lihue, HI, June 2003.
[6]
George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and Armando Fox. A microrebootable system -- design, implementation, and evaluation. In Proc. 6th USENIX Symposium on Operating Systems Design and Implementation, San Francisco, December 2004.
[7]
George Candea, Shinichi Kawamoto, Emre Kiciman, and Armando Fox. Autonomous recovery in componentized internet applications. Cluster Computing Journal, To appear, 2005.
[8]
George Candea, Pedram Keyani, Emre Kiciman, Steve Zhang, and Armando Fox. JAGR: An autonomous self-recovering application server. In Proc. 5th International Workshop on Active Middleware Services, Seattle, WA, June 2003.
[9]
Mike Chen, Emre Kiciman, Anthony Accardi, Armando Fox, and Eric Brewer. Using runtime paths for macro analysis. In Proc. 9th Workshop on Hot Topics in Operating Systems, Lihue, HI, 2003.
[10]
Dawson Engler, David Yu Chen, Seth Hallem, Andy Chou, and Benjamin Chelf. Bugs as deviant behavior: A general approach to inferring errors in systems code. In Proc. 18th ACM Symposium on Operating Systems Principles, pages 57--72, Lake Louise, Canada, Oct 2001.
[11]
David A. Patterson et al. Recovery-Oriented Computing: motivation, definition, techniques, and case studies. Technical Report CSD-02-1175, University of California at Berkeley, 2002.
[12]
Sudheendra Hangal and Monica Lam. Tracking down software bugs using automatic anomaly detection. In Proceedings of the International Conference on Software Engineering, May 2002.
[13]
Andrew C. Huang and Armando Fox. Cheap recovery: A key to self-managing state. Submitted for publication.
[14]
E. Keogh, S. Lonardi, and W. Chiu. Finding surprising patterns in a time series database in linear time and space. In In proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 550--556, Edmonton, Alberta, Canada, Jul 2002.
[15]
Emre Kiciman and Armando Fox. Detecting application-level failures in component-based internet services. Submitted for publication, September 2004.
[16]
Ben Liblit, Alex Aiken, Alice X. Zheng, and Michael I. Jordan. Bug isolation via remote program sampling. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, San Diego, California, June 9--11 2003.
[17]
Benjamin Ling, Emre Kiciman, and Armando Fox. Session state: Beyond soft state. In Proc. 1st Symposium on Networked Systems Design and Implementation, San Francisco, CA, March 2004.
[18]
Benjamin C. Ling, Emre Kiciman, and Armando Fox. Session state: Beyond soft state. In Proc. 1st USENIX/ACM Symposium on Networked Systems Design and Implementation, San Francisco, CA, March 2004.
[19]
Christopher D. Manning and Hinrich Shutze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, 1999.
[20]
S. Parekh, N Gandhi, JL Hellerstein, D Tilbury, TS Jayram, and J Bigus. Using control theory to achieve service level objectives in performance management. Real Time Systems Journal, 23(1--2), 2002.
[21]
Michael Swift, Brian N. Bershad, and Henry M. Levy. Improving the reliability of commodity operating systems. In Proc. 19th ACM Symposium on Operating Systems Principles, 2003.
[22]
David L. Tennenhouse and David J. Wetherall. Towards an active network architecture. In ACM SIGCOMM '96 (Computer Communications Review). ACM, 1996.
[23]
Yi-Min Wang, Chad Verbowski, and Daniel R. Simon. Persistent-state checkpoint comparison for troubleshooting configuration failures. In Proc. International Conference on Dependable Systems and Networks, San Francisco, CA, June 2003.
[24]
K. Whisnant, S. Bagchi, B. Srinivasan, Z. Kalbarczyk, and R. K. Iyer. Incorporating reconfigurability, error detection, and recovery into the Chameleon ARMOR architecture. Technical Report CRHC-98-13, University of Illinois at Urbana-Champaign, 1998.

Cited By

View all
  • (2020)Toward ML-centric cloud platformsCommunications of the ACM10.1145/336468463:2(50-59)Online publication date: 22-Jan-2020
  • (2019)Fast. Efficient Performance Predictions for Big Data Applications2019 IEEE 22nd International Symposium on Real-Time Distributed Computing (ISORC)10.1109/ISORC.2019.00034(126-133)Online publication date: May-2019
  • (2018)SaaS software performance issue identification using HMRF‐MAP frameworkSoftware: Practice and Experience10.1002/spe.260748:11(2000-2018)Online publication date: 18-Jul-2018
  • Show More Cited By
  1. Combining statistical monitoring and predictable recovery for self-management

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WOSS '04: Proceedings of the 1st ACM SIGSOFT workshop on Self-managed systems
      October 2004
      119 pages
      ISBN:1581139896
      DOI:10.1145/1075405
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 31 October 2004

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Article

      Conference

      WOSS04
      Sponsor:
      WOSS04: Workshop on Self-Healing Systems [co-located with ACM SIGSOFT 2004 )
      October 31 - November 1, 2004
      California, Newport Beach

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 15 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)Toward ML-centric cloud platformsCommunications of the ACM10.1145/336468463:2(50-59)Online publication date: 22-Jan-2020
      • (2019)Fast. Efficient Performance Predictions for Big Data Applications2019 IEEE 22nd International Symposium on Real-Time Distributed Computing (ISORC)10.1109/ISORC.2019.00034(126-133)Online publication date: May-2019
      • (2018)SaaS software performance issue identification using HMRF‐MAP frameworkSoftware: Practice and Experience10.1002/spe.260748:11(2000-2018)Online publication date: 18-Jul-2018
      • (2015)Node anomaly detection for homogeneous distributed environmentsExpert Systems with Applications: An International Journal10.1016/j.eswa.2015.04.03742:20(7012-7025)Online publication date: 15-Nov-2015
      • (2012)PREPAREProceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems10.1109/ICDCS.2012.65(285-294)Online publication date: 18-Jun-2012
      • (2011)Multiple linear regression based parameter influencer model of a Self-healing network2011 Third International Workshop on Security and Communication Networks (IWSCN)10.1109/IWSCN.2011.6827716(45-51)Online publication date: May-2011
      • (2011)A P2P-Based self-healing service for network maintenance12th IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops10.1109/INM.2011.5990706(313-320)Online publication date: May-2011
      • (2011)A parameter-influencer based model of a Self-healing network2011 Third International Conference on Communication Systems and Networks (COMSNETS 2011)10.1109/COMSNETS.2011.5716506(1-8)Online publication date: Jan-2011
      • (2010)Towards pro-active adaptation with confidenceProceedings of the 2010 ICSE Workshop on Software Engineering for Adaptive and Self-Managing Systems10.1145/1808984.1808987(20-28)Online publication date: 3-May-2010
      • (2010)Enabling technologies for self-aware adaptive systems2010 NASA/ESA Conference on Adaptive Hardware and Systems10.1109/AHS.2010.5546266(149-156)Online publication date: Jun-2010
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media