research-article

Improving wide-area distributed system availability

Author:
Bruno Wassermann

University College London, London, UK

University College London, London, UK
View Profile

ICSE '10: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2May 2010Pages 347–348https://doi.org/10.1145/1810295.1810386

Published:01 May 2010Publication History

ICSE '10: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2

Pages 347–348

ABSTRACT

The Software-as-a-Service (SaaS) paradigm and corresponding service-oriented technologies have simplified the development of larger, more complex software systems that routinely span administrative and organisational boundaries. These systems inhabit a complex operating environment with numerous threats to the dependability of service compositions. These threats include many system-level failures whose causes are difficult and time-consuming to determine. It is difficult to detect vulnerabilities to these failures prior to deployment of an application into production and applications are currently not well-equipped to handle them effectively. This results in lengthy downtimes of production systems and hence low availability. The goal of this PhD is to increase the availability of such systems by eliminating as many failures as possible before deployment and by assisting administrators to diagnose their causes more efficiently. We propose a novel monitoring technique and apply failure injection techniques that target these difficult failures and enable separate administrative domains to cooperate in handling them. Furthermore, we investigate the extent to which we can equip these systems to be self-diagnosing.

References

Amazon web service outage. http://bit.ly/amMZbn, April 2008.Google Scholar
C. Bartolini, A. Bertolino, E. Marchetti, and A. Polini. WS-TAXI: A WSDL-based Testing Tool for Web Services. In ICST, pages 326--335, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
A. Bertolino, L. Frantzen, A. Polini, and J. Tretmans. Audition of Web Services for Testing Conformance to Open Specified Protocols. 3938:1--25. Google ScholarDigital Library
W. Emmerich, B. Butchart, L. Chen, B. Wassermann, and S. L. Price. Grid Service Orchestration using the Business Process Execution Language (BPEL). Journal of Grid Computing, 3(3--4):283--304, 2005.Google ScholarCross Ref
R. D. Gardner and D. A. Harle. Pattern discovery and specification techniques for alarm correlation. In NOMS, volume 3, pages 713--722. IEEE, February 1998.Google Scholar
E. Kiciman and A. Fox. Detecting application-level failures in component-based internet services. IEEE TNN, 16(5):1027--1041, September 2005. Google ScholarDigital Library
T. J. LeBlanc and J. M. Mellor-Crummey. Debugging parallel programs with instant replay. IEEE TC, 36(4):471--482, 1987. Google ScholarDigital Library
M. L. Massie, B. N. Chun, and D. E. Culler. The Ganglia Distributed Monitoring System: Design, Implementation, and Experience. Parallel Computing, (7), July 2004.Google Scholar
K. Park and V. S. Pai. CoMon: a mostly-scalable monitoring system for PlanetLab. SIGOPS OSR, 40(1):65--74, 2006. Google ScholarDigital Library
M. J. Rutherford, A. Carzaniga, and A. L. Wolf. Evaluating test suites and adequacy criteria using simulation-based models of distributed systems. IEEE TSE, 34(4):452--470, 2008. Google ScholarDigital Library
S. A. Yemini, S. Kliger, E. Mozses, Y. Yemini, and D. Ohsie. High speed and robust event correlation. IEEE Communications, 34(5):82--90, 1996. Google ScholarDigital Library

Improving wide-area distributed system availability
1. Computer systems organization
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures

Recommendations

Reliability and availability of a wide area network-based education system
ISSRE '96: Proceedings of the The Seventh International Symposium on Software Reliability Engineering

An important class of quality of service (QoS)-dependent network-based applications are computer-based education systems. A successful network-based education (NBE) system needs to provide appropriate QoS at the user level. This includes adequate end-to-...
Read More
Improving availability with recursive microreboots: a soft-state system case study
Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers

Even after decades of software engineering research, complex computer systems still fail. This paper makes the case for increasing research emphasis on dependability and, specifically, on improving availability by reducing time-to-recover.All software ...
Read More
On increasing reliability and availability in distributed database systems
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICSE '10: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2
May 2010
554 pages
ISBN:9781605587196
DOI:10.1145/1810295
General Chairs:
Jeff Kramer
Imperial College, London, UK
,
Judith Bishop
Microsoft Research, Redmond
,
Program Chairs:
Prem Devanbu
University of California at Davis
,
Sebastian Uchitel
University of Buenos Aires, Argentina and Imperial College London, UK
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate276of1,856submissions,15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 167
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Improving wide-area distributed system availability

ICSE '10: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2

ABSTRACT

References

Cited By

Recommendations

Reliability and availability of a wide area network-based education system

Improving availability with recursive microreboots: a soft-state system case study

On increasing reliability and availability in distributed database systems