ABSTRACT
The Software-as-a-Service (SaaS) paradigm and corresponding service-oriented technologies have simplified the development of larger, more complex software systems that routinely span administrative and organisational boundaries. These systems inhabit a complex operating environment with numerous threats to the dependability of service compositions. These threats include many system-level failures whose causes are difficult and time-consuming to determine. It is difficult to detect vulnerabilities to these failures prior to deployment of an application into production and applications are currently not well-equipped to handle them effectively. This results in lengthy downtimes of production systems and hence low availability. The goal of this PhD is to increase the availability of such systems by eliminating as many failures as possible before deployment and by assisting administrators to diagnose their causes more efficiently. We propose a novel monitoring technique and apply failure injection techniques that target these difficult failures and enable separate administrative domains to cooperate in handling them. Furthermore, we investigate the extent to which we can equip these systems to be self-diagnosing.
- Amazon web service outage. http://bit.ly/amMZbn, April 2008.Google Scholar
- C. Bartolini, A. Bertolino, E. Marchetti, and A. Polini. WS-TAXI: A WSDL-based Testing Tool for Web Services. In ICST, pages 326--335, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
- A. Bertolino, L. Frantzen, A. Polini, and J. Tretmans. Audition of Web Services for Testing Conformance to Open Specified Protocols. 3938:1--25. Google ScholarDigital Library
- W. Emmerich, B. Butchart, L. Chen, B. Wassermann, and S. L. Price. Grid Service Orchestration using the Business Process Execution Language (BPEL). Journal of Grid Computing, 3(3--4):283--304, 2005.Google ScholarCross Ref
- R. D. Gardner and D. A. Harle. Pattern discovery and specification techniques for alarm correlation. In NOMS, volume 3, pages 713--722. IEEE, February 1998.Google Scholar
- E. Kiciman and A. Fox. Detecting application-level failures in component-based internet services. IEEE TNN, 16(5):1027--1041, September 2005. Google ScholarDigital Library
- T. J. LeBlanc and J. M. Mellor-Crummey. Debugging parallel programs with instant replay. IEEE TC, 36(4):471--482, 1987. Google ScholarDigital Library
- M. L. Massie, B. N. Chun, and D. E. Culler. The Ganglia Distributed Monitoring System: Design, Implementation, and Experience. Parallel Computing, (7), July 2004.Google Scholar
- K. Park and V. S. Pai. CoMon: a mostly-scalable monitoring system for PlanetLab. SIGOPS OSR, 40(1):65--74, 2006. Google ScholarDigital Library
- M. J. Rutherford, A. Carzaniga, and A. L. Wolf. Evaluating test suites and adequacy criteria using simulation-based models of distributed systems. IEEE TSE, 34(4):452--470, 2008. Google ScholarDigital Library
- S. A. Yemini, S. Kliger, E. Mozses, Y. Yemini, and D. Ohsie. High speed and robust event correlation. IEEE Communications, 34(5):82--90, 1996. Google ScholarDigital Library
- Improving wide-area distributed system availability
Recommendations
Reliability and availability of a wide area network-based education system
ISSRE '96: Proceedings of the The Seventh International Symposium on Software Reliability EngineeringAn important class of quality of service (QoS)-dependent network-based applications are computer-based education systems. A successful network-based education (NBE) system needs to provide appropriate QoS at the user level. This includes adequate end-to-...
Improving availability with recursive microreboots: a soft-state system case study
Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papersEven after decades of software engineering research, complex computer systems still fail. This paper makes the case for increasing research emphasis on dependability and, specifically, on improving availability by reducing time-to-recover.All software ...
Comments