research-article

GRETEL: Lightweight Fault Localization for OpenStack

Authors:

Mohan DhawanAuthors Info & Claims

CoNEXT '16: Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies

Pages 413 - 426

https://doi.org/10.1145/2999572.2999600

Published: 06 December 2016 Publication History

Abstract

Like any other distributed system, cloud management stacks such as OpenStack, are susceptible to faults whose root cause is often hard to diagnose and may take hours or days to fix. We present GRETEL, a system that leverages non-intrusive system monitoring, to expedite root cause analysis of both operational and performance faults manifesting in OpenStack operations. GRETEL uses unique operational fingerprints to quickly identify faulty operations at runtime. GRETEL is accurate in its diagnosis, and achieves >98% precision in identifying the faulty operation with very few false positives and negatives even under conditions of stress. GRETEL is lightweight and orders of magnitude faster than prior work, sustaining a throughput of ~77 Mbps.

References

[1]

Apache CloudStack. https://goo.gl/1S3K9W.

[2]

Bro. https://www.bro.org/.

[3]

Broccoli. https://goo.gl/4NUdFi.

[4]

Cloud block storage issues. https://goo.gl/E6BPxG.

[5]

Cloud servers issues. https://goo.gl/yg9gFB.

[6]

Cloudera HTrace. https://goo.gl/Pz3lQu.

[7]

collectd. https://collectd.org/.

[8]

Correlation id in python-glanceclient. https://goo.gl/UyFDOr.

[9]

OpenStack. https://www.openstack.org/.

[10]

Rackspace Issue 1. https://goo.gl/2tdjHB.

[11]

Rackspace Issue 2. https://goo.gl/CnqSTl.

[12]

Rackspace Issue 3. https://goo.gl/JVVpX0.

[13]

tc: Traffic Control in the Linux kernel. http://goo.gl/f8YDaH.

[14]

Tcpreplay. http://tcpreplay.synfin.net/.

[15]

Tempest. http://goo.gl/OZiXTV.

[16]

tsoutliers. https://goo.gl/aVxsSJ.

[17]

Twitter Zipkin. https://goo.gl/bHtUKc.

[18]

VM build intermittent failure. https://goo.gl/s3VP3j.

[19]

VMware vSphere. http://goo.gl/kNAR0f.

[20]

M. K. Aguilera et al. Performance Debugging for Distributed Systems of Black Boxes. In SOSP'13.

Digital Library

[21]

M. Attariyan et al. X-ray: Automating Root-cause Diagnosis of Performance Anomalies in Production Software. In OSDI'12.

Digital Library

[22]

P. Bahl et al. Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies. In SIGCOMM'07.

Digital Library

[23]

P. Barham et al. Using Magpie for Request Extraction and Workload Modelling. In OSDI'04.

Digital Library

[24]

L. Bitincka et al. Optimizing Data Analysis with a Semi-structured Time Series Database. In SLAML'10.

Digital Library

[25]

A. Chanda et al. Whodunit: Transactional Profiling for Multi-tier Applications. In SOSP'07.

Digital Library

[26]

M. Y. Chen et al. Path-based Failure and Evolution Management. In NSDI'04.

Digital Library

[27]

M. Y. Chen et al. Pinpoint: Problem Determination in Large, Dynamic Internet Services. In DSN'02.

Digital Library

[28]

R. Fonseca et al. Experiences with Tracing Causality in Networked Services. In INM/WREN'10.

Digital Library

[29]

R. Fonseca et al. X-trace: A Pervasive Network Tracing Framework. In NSDI'07.

Digital Library

[30]

T. Gschwind et al. Webmon: A Performance Profiler for Web Transactions. In WECWIS'02.

Digital Library

[31]

H. S. Gunawi et al. What Bugs Live in the Cloud?: A Study of 3000

[32]

Issues in Cloud Systems. In SOCC'14.

[33]

J. L. Hellerstein et al. ETE: A Customizable Approach to Measuring End-to-end Response Times and their Components in Distributed Systems. In ICDCS'99.

Digital Library

[34]

X. Ju et al. On Fault Resilience of OpenStack. In SOCC'13.

Digital Library

[35]

S. Kandula et al. Detailed Diagnosis in Enterprise Networks. In SIGCOMM'09.

Digital Library

[36]

S. P. Kavulya et al. Draco: Statistical Diagnosis of Chronic Problems in Distributed Systems. In DSN'12.

Digital Library

[37]

R. R. Kompella et al. IP Fault Localization via Risk Modeling. In NSDI'05.

Digital Library

[38]

E. Koskinen et al. BorderPatrol: Isolating Events for Black-box Tracing. In Eurosys'08.

Digital Library

[39]

H. Nguyen et al. FChain: Toward Black-box Online Fault Localization for Cloud Systems. In ICDCS'13.

Digital Library

[40]

V. Paxson. Bro: A System for Detecting Network Intruders in Real-time. In USENIX Security'98.

Digital Library

[41]

P. Prakash et al. dFault: Fault Localization in Large-scale Peer-to-peer Systems. In Middleware'10.

Digital Library

[42]

P. Reynolds et al. Pip: Detecting the Unexpected in Distributed Systems. In NSDI'06.

Digital Library

[43]

R. R. Sambasivan et al. Diagnosing Performance Changes by Comparing Request Flows. In NSDI'11.

Digital Library

[44]

D. Sharma et al. HANSEL: Diagnosing Faults in OpenStack. In CoNEXT'15.

Digital Library

[45]

B. H. Sigelman et al. Dapper: A Large-scale Distributed Systems Tracing Infrastructure. Google Research, 2010.

[46]

B. C. Tak et al. vPath: Precise Discovery of Request Processing Paths from Black-Box Observations of Thread and Network Activities. In ATC'09.

Digital Library

[47]

J. Tan et al. Visual, Log-based Causal Tracing for Performance Debugging of MapReduce Systems. In ICDCS'10.

Digital Library

[48]

E. Thereska et al. Stardust: Tracking Activity in a Distributed Storage System. In SIGMETRICS'06.

Digital Library

[49]

W. Xu et al. Detecting Large-scale System Problems by Mining Console Logs. In SOSP'09.

Digital Library

[50]

D. Yuan et al. SherLog: Error Diagnosis by Connecting Clues from Run-time Logs. In ASPLOS'10.

Digital Library

Cited By

Chakraborty AEswaran AThorat PVerma MGupta PJayachandran P(2024)Intent-Driven Multi-Engine Observability Dataflows for Heterogeneous Geo-Distributed Clouds2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00014(30-41)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00014
Wang XYang LLi DMa LHe YXiao JLiu JYang Y(2022)MADDC: Multi-Scale Anomaly Detection, Diagnosis and Correction for Discrete Event LogsProceedings of the 38th Annual Computer Security Applications Conference10.1145/3564625.3567972(769-784)Online publication date: 5-Dec-2022
https://dl.acm.org/doi/10.1145/3564625.3567972
Cinque MCorte RPecchia A(2022)Microservices Monitoring with Event Logs and Black Box Execution TracingIEEE Transactions on Services Computing10.1109/TSC.2019.294000915:1(294-307)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TSC.2019.2940009
Show More Cited By

Index Terms

GRETEL: Lightweight Fault Localization for OpenStack
1. Networks
  1. Network services
  2. Network types
    1. Data center networks
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Fault tree analysis

Recommendations

Hansel: diagnosing faults in openStack
CoNEXT '15: Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies

With majority of the world's data and computation handled by cloud-based systems, cloud management stacks such as Apache's CloudStack, VMware's vSphere and OpenStack have become an increasingly important component in cloud software. However, like every ...
Efficient probe selection algorithms for fault diagnosis

Increase in the network usage for more and more performance critical applications has caused a demand for tools that can monitor network health with minimum management traffic. Adaptive probing has the potential to provide effective tools for end-to-end ...
cfaults: Model-Based Diagnosis for Fault Localization in C with Multiple Test Cases
Formal Methods
Abstract
Debugging is one of the most time-consuming and expensive tasks in software development. Several formula-based fault localization (FBFL) methods have been proposed, but they fail to guarantee a set of diagnoses across all failing tests or may ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CoNEXT '16: Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies

December 2016

524 pages

ISBN:9781450342926

DOI:10.1145/2999572

General Chairs:
Athina Markopoulou
University of California, Irvine, USA
,
Michalis Faloutsos
University of California, Riverside, USA
,
Program Chairs:
Vyas Sekar
Carnegie Mellon University, USA
,
Dejan Kostic
KTH Royal Institute of Technology, Sweden

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CoNEXT '16

Sponsor:

SIGCOMM

CoNEXT '16: The 12th International Conference on emerging Networking EXperiments and Technologies

December 12 - 15, 2016

California, Irvine, USA

Acceptance Rates

CoNEXT '16 Paper Acceptance Rate 30 of 160 submissions, 19%;

Overall Acceptance Rate 198 of 789 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
509
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chakraborty AEswaran AThorat PVerma MGupta PJayachandran P(2024)Intent-Driven Multi-Engine Observability Dataflows for Heterogeneous Geo-Distributed Clouds2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00014(30-41)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00014
Wang XYang LLi DMa LHe YXiao JLiu JYang Y(2022)MADDC: Multi-Scale Anomaly Detection, Diagnosis and Correction for Discrete Event LogsProceedings of the 38th Annual Computer Security Applications Conference10.1145/3564625.3567972(769-784)Online publication date: 5-Dec-2022
https://dl.acm.org/doi/10.1145/3564625.3567972
Cinque MCorte RPecchia A(2022)Microservices Monitoring with Event Logs and Black Box Execution TracingIEEE Transactions on Services Computing10.1109/TSC.2019.294000915:1(294-307)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TSC.2019.2940009
Kalaki PShameli‐Sendi AAbbasi B(2022)Anomaly detection on OpenStack logs based on an improved robust principal component analysis model and its projection onto column spaceSoftware: Practice and Experience10.1002/spe.316453:3(665-681)Online publication date: 7-Nov-2022
https://doi.org/10.1002/spe.3164
Wang LJiang YWang ZHuo QDai JXie SLi RFeng MXu YJiang Z(2022)The operation and maintenance governance of microservices architecture systems: A systematic literature reviewJournal of Software: Evolution and Process10.1002/smr.2433Online publication date: 10-Feb-2022
https://doi.org/10.1002/smr.2433
Zhou PWang YLi ZWang XTyson GXie G(2020)LogSayer: Log Pattern-driven Cloud Component Anomaly Diagnosis with Machine Learning2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS)10.1109/IWQoS49365.2020.9212954(1-10)Online publication date: Jun-2020
https://doi.org/10.1109/IWQoS49365.2020.9212954
Yuan YTorgonshar AShi WLiang BQin B(2019)Digging Evidence for Violation of Cloud Security Compliance with Knowledge Learned from LogsTrusted Computing and Information Security10.1007/978-981-13-5913-2_20(318-337)Online publication date: 9-Jan-2019
https://doi.org/10.1007/978-981-13-5913-2_20
Weng JWang JYang JYang (2018)Root Cause Analysis of Anomalies of Multitier Services in Public CloudsIEEE/ACM Transactions on Networking10.1109/TNET.2018.284380526:4(1646-1659)Online publication date: 1-Aug-2018
https://dl.acm.org/doi/10.1109/TNET.2018.2843805
Gogunska KBarakat CUrvoy-Keller GLopez-Pacheco D(2018)On the Cost of Measuring Traffic in a Virtualized Environment2018 IEEE 7th International Conference on Cloud Networking (CloudNet)10.1109/CloudNet.2018.8549537(1-6)Online publication date: Oct-2018
https://doi.org/10.1109/CloudNet.2018.8549537
Thalheim JRodrigues AAkkus IBhatotia PChen RViswanath BJiao LFetzer CJayaram KGandhi AKemme BPietzuch P(2017)SieveProceedings of the 18th ACM/IFIP/USENIX Middleware Conference10.1145/3135974.3135977(14-27)Online publication date: 11-Dec-2017
https://dl.acm.org/doi/10.1145/3135974.3135977

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten