skip to main content
10.1145/2999572.2999600acmconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article

GRETEL: Lightweight Fault Localization for OpenStack

Published: 06 December 2016 Publication History

Abstract

Like any other distributed system, cloud management stacks such as OpenStack, are susceptible to faults whose root cause is often hard to diagnose and may take hours or days to fix. We present GRETEL, a system that leverages non-intrusive system monitoring, to expedite root cause analysis of both operational and performance faults manifesting in OpenStack operations. GRETEL uses unique operational fingerprints to quickly identify faulty operations at runtime. GRETEL is accurate in its diagnosis, and achieves >98% precision in identifying the faulty operation with very few false positives and negatives even under conditions of stress. GRETEL is lightweight and orders of magnitude faster than prior work, sustaining a throughput of ~77 Mbps.

References

[1]
Apache CloudStack. https://goo.gl/1S3K9W.
[2]
Bro. https://www.bro.org/.
[3]
Broccoli. https://goo.gl/4NUdFi.
[4]
Cloud block storage issues. https://goo.gl/E6BPxG.
[5]
Cloud servers issues. https://goo.gl/yg9gFB.
[6]
Cloudera HTrace. https://goo.gl/Pz3lQu.
[7]
collectd. https://collectd.org/.
[8]
Correlation id in python-glanceclient. https://goo.gl/UyFDOr.
[9]
OpenStack. https://www.openstack.org/.
[10]
Rackspace Issue 1. https://goo.gl/2tdjHB.
[11]
Rackspace Issue 2. https://goo.gl/CnqSTl.
[12]
Rackspace Issue 3. https://goo.gl/JVVpX0.
[13]
tc: Traffic Control in the Linux kernel. http://goo.gl/f8YDaH.
[14]
Tcpreplay. http://tcpreplay.synfin.net/.
[15]
Tempest. http://goo.gl/OZiXTV.
[16]
tsoutliers. https://goo.gl/aVxsSJ.
[17]
Twitter Zipkin. https://goo.gl/bHtUKc.
[18]
VM build intermittent failure. https://goo.gl/s3VP3j.
[19]
VMware vSphere. http://goo.gl/kNAR0f.
[20]
M. K. Aguilera et al. Performance Debugging for Distributed Systems of Black Boxes. In SOSP'13.
[21]
M. Attariyan et al. X-ray: Automating Root-cause Diagnosis of Performance Anomalies in Production Software. In OSDI'12.
[22]
P. Bahl et al. Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies. In SIGCOMM'07.
[23]
P. Barham et al. Using Magpie for Request Extraction and Workload Modelling. In OSDI'04.
[24]
L. Bitincka et al. Optimizing Data Analysis with a Semi-structured Time Series Database. In SLAML'10.
[25]
A. Chanda et al. Whodunit: Transactional Profiling for Multi-tier Applications. In SOSP'07.
[26]
M. Y. Chen et al. Path-based Failure and Evolution Management. In NSDI'04.
[27]
M. Y. Chen et al. Pinpoint: Problem Determination in Large, Dynamic Internet Services. In DSN'02.
[28]
R. Fonseca et al. Experiences with Tracing Causality in Networked Services. In INM/WREN'10.
[29]
R. Fonseca et al. X-trace: A Pervasive Network Tracing Framework. In NSDI'07.
[30]
T. Gschwind et al. Webmon: A Performance Profiler for Web Transactions. In WECWIS'02.
[31]
H. S. Gunawi et al. What Bugs Live in the Cloud?: A Study of 3000
[32]
Issues in Cloud Systems. In SOCC'14.
[33]
J. L. Hellerstein et al. ETE: A Customizable Approach to Measuring End-to-end Response Times and their Components in Distributed Systems. In ICDCS'99.
[34]
X. Ju et al. On Fault Resilience of OpenStack. In SOCC'13.
[35]
S. Kandula et al. Detailed Diagnosis in Enterprise Networks. In SIGCOMM'09.
[36]
S. P. Kavulya et al. Draco: Statistical Diagnosis of Chronic Problems in Distributed Systems. In DSN'12.
[37]
R. R. Kompella et al. IP Fault Localization via Risk Modeling. In NSDI'05.
[38]
E. Koskinen et al. BorderPatrol: Isolating Events for Black-box Tracing. In Eurosys'08.
[39]
H. Nguyen et al. FChain: Toward Black-box Online Fault Localization for Cloud Systems. In ICDCS'13.
[40]
V. Paxson. Bro: A System for Detecting Network Intruders in Real-time. In USENIX Security'98.
[41]
P. Prakash et al. dFault: Fault Localization in Large-scale Peer-to-peer Systems. In Middleware'10.
[42]
P. Reynolds et al. Pip: Detecting the Unexpected in Distributed Systems. In NSDI'06.
[43]
R. R. Sambasivan et al. Diagnosing Performance Changes by Comparing Request Flows. In NSDI'11.
[44]
D. Sharma et al. HANSEL: Diagnosing Faults in OpenStack. In CoNEXT'15.
[45]
B. H. Sigelman et al. Dapper: A Large-scale Distributed Systems Tracing Infrastructure. Google Research, 2010.
[46]
B. C. Tak et al. vPath: Precise Discovery of Request Processing Paths from Black-Box Observations of Thread and Network Activities. In ATC'09.
[47]
J. Tan et al. Visual, Log-based Causal Tracing for Performance Debugging of MapReduce Systems. In ICDCS'10.
[48]
E. Thereska et al. Stardust: Tracking Activity in a Distributed Storage System. In SIGMETRICS'06.
[49]
W. Xu et al. Detecting Large-scale System Problems by Mining Console Logs. In SOSP'09.
[50]
D. Yuan et al. SherLog: Error Diagnosis by Connecting Clues from Run-time Logs. In ASPLOS'10.

Cited By

View all
  • (2024)Intent-Driven Multi-Engine Observability Dataflows for Heterogeneous Geo-Distributed Clouds2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00014(30-41)Online publication date: 7-Jul-2024
  • (2022)MADDC: Multi-Scale Anomaly Detection, Diagnosis and Correction for Discrete Event LogsProceedings of the 38th Annual Computer Security Applications Conference10.1145/3564625.3567972(769-784)Online publication date: 5-Dec-2022
  • (2022)Microservices Monitoring with Event Logs and Black Box Execution TracingIEEE Transactions on Services Computing10.1109/TSC.2019.294000915:1(294-307)Online publication date: 1-Jan-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CoNEXT '16: Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies
December 2016
524 pages
ISBN:9781450342926
DOI:10.1145/2999572
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. fault localization
  2. network monitoring
  3. openstack

Qualifiers

  • Research-article

Conference

CoNEXT '16
Sponsor:

Acceptance Rates

CoNEXT '16 Paper Acceptance Rate 30 of 160 submissions, 19%;
Overall Acceptance Rate 198 of 789 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)2
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Intent-Driven Multi-Engine Observability Dataflows for Heterogeneous Geo-Distributed Clouds2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00014(30-41)Online publication date: 7-Jul-2024
  • (2022)MADDC: Multi-Scale Anomaly Detection, Diagnosis and Correction for Discrete Event LogsProceedings of the 38th Annual Computer Security Applications Conference10.1145/3564625.3567972(769-784)Online publication date: 5-Dec-2022
  • (2022)Microservices Monitoring with Event Logs and Black Box Execution TracingIEEE Transactions on Services Computing10.1109/TSC.2019.294000915:1(294-307)Online publication date: 1-Jan-2022
  • (2022)Anomaly detection on OpenStack logs based on an improved robust principal component analysis model and its projection onto column spaceSoftware: Practice and Experience10.1002/spe.316453:3(665-681)Online publication date: 7-Nov-2022
  • (2022)The operation and maintenance governance of microservices architecture systems: A systematic literature reviewJournal of Software: Evolution and Process10.1002/smr.2433Online publication date: 10-Feb-2022
  • (2020)LogSayer: Log Pattern-driven Cloud Component Anomaly Diagnosis with Machine Learning2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS)10.1109/IWQoS49365.2020.9212954(1-10)Online publication date: Jun-2020
  • (2019)Digging Evidence for Violation of Cloud Security Compliance with Knowledge Learned from LogsTrusted Computing and Information Security10.1007/978-981-13-5913-2_20(318-337)Online publication date: 9-Jan-2019
  • (2018)Root Cause Analysis of Anomalies of Multitier Services in Public CloudsIEEE/ACM Transactions on Networking10.1109/TNET.2018.284380526:4(1646-1659)Online publication date: 1-Aug-2018
  • (2018)On the Cost of Measuring Traffic in a Virtualized Environment2018 IEEE 7th International Conference on Cloud Networking (CloudNet)10.1109/CloudNet.2018.8549537(1-6)Online publication date: Oct-2018
  • (2017)SieveProceedings of the 18th ACM/IFIP/USENIX Middleware Conference10.1145/3135974.3135977(14-27)Online publication date: 11-Dec-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media