research-article

CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs

Authors:
Xiao Yu

North Carolina State University, Raleigh, NC, USA

North Carolina State University, Raleigh, NC, USA
View Profile

,
Pallavi Joshi

NEC Laboratories America, Princeton, NJ, USA

NEC Laboratories America, Princeton, NJ, USA
View Profile

,
Jianwu Xu

NEC Laboratories America, Princeton, NJ, USA

NEC Laboratories America, Princeton, NJ, USA
View Profile

,
Guoliang Jin

North Carolina State University, Raleigh, NC, USA

North Carolina State University, Raleigh, NC, USA
View Profile

,
Hui Zhang

NEC Laboratories America, Princeton, NJ, USA

NEC Laboratories America, Princeton, NJ, USA
View Profile

,
Guofei Jiang

NEC Laboratories America, Princeton, NJ, USA

NEC Laboratories America, Princeton, NJ, USA
View Profile

ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating SystemsMarch 2016Pages 489–502https://doi.org/10.1145/2872362.2872407

Published:25 March 2016Publication History

ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 489–502

ABSTRACT

Cloud infrastructures provide a rich set of management tasks that operate computing, storage, and networking resources in the cloud. Monitoring the executions of these tasks is crucial for cloud providers to promptly find and understand problems that compromise cloud availability. However, such monitoring is challenging because there are multiple distributed service components involved in the executions. CloudSeer enables effective workflow monitoring. It takes a lightweight non-intrusive approach that purely works on interleaved logs widely existing in cloud infrastructures. CloudSeer first builds an automaton for the workflow of each management task based on normal executions, and then it checks log messages against a set of automata for workflow divergences in a streaming manner. Divergences found during the checking process indicate potential execution problems, which may or may not be accompanied by error log messages. For each potential problem, CloudSeer outputs necessary context information including the affected task automaton and related log messages hinting where the problem occurs to help further diagnosis. Our experiments on OpenStack, a popular open-source cloud infrastructure, show that CloudSeer's efficiency and problem-detection capability are suitable for online monitoring.

References

2013 Path to an OpenStack-Powered Cloud Survey Results Highlight Aggressive OpenStack Adoption Plans by Enterprises. http://www.redhat.com/en/about/press-releases/2013-path-to-an-openstack-powered-cloud-survey-results-highlight-aggressive-openstack-adoption-plans-by-enterprises.Google Scholar
Amazon CloudWatch. https://aws.amazon.com/cloudwatch/.Google Scholar
Amazon Elastic Compute Cloud. http://aws.amazon.com/ec2/.Google Scholar
Apache HTrace. http://htrace.incubator.apache.org/.Google Scholar
Architecture. OpenStack Installation Guide, http://docs.openstack.org/havana/install-guide/install/apt/content/ch_overview.html.Google Scholar
CirrOS: A Tiny Cloud Guest. https://launchpad.net/cirros.Google Scholar
Elasticsearch. http://www.elasticsearch.org/overview/elasticsearch/.Google Scholar
Logging and Monitoring. OpenStack Operations Guide, http://docs.openstack.org/openstack-ops/content/logging_monitoring.html.Google Scholar
Logstash. http://www.elasticsearch.org/overview/logstash/.Google Scholar
Microsoft Azure. http://azure.microsoft.com/en-us/.Google Scholar
OpenStack. http://www.openstack.org/.Google Scholar
Zipkin. http://zipkin.io/.Google Scholar
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 18--18, Berkeley, CA, USA, 2004. USENIX Association.Google ScholarDigital Library
I. Beschastnikh, Y. Brun, M. D. Ernst, and A. Krishnamurthy. Inferring Models of Concurrent Systems from Logs of Their Behavior with CSight. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 468--479, New York, NY, USA, 2014. ACM.Google ScholarDigital Library
I. Beschastnikh, Y. Brun, M. D. Ernst, A. Krishnamurthy, and T. E. Anderson. Mining Temporal Invariants from Partially Ordered Logs. ACM SIGOPS Operating Systems Review, 45(3):39--46, Jan. 2012.Google ScholarDigital Library
M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 217--231, Berkeley, CA, USA, 2014. USENIX Association.Google ScholarDigital Library
T. Do, M. Hao, T. Leesatapornwongsa, T. Patana-anake, and H. S. Gunawi. Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 14:1--14:14, New York, NY, USA, 2013. ACM.Google ScholarDigital Library
Q. Fu, J.-G. Lou, Y. Wang, and J. Li. Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM '09, pages 149--158, Washington, DC, USA, 2009. IEEE Computer Society.Google ScholarDigital Library
P. Joshi, H. S. Gunawi, and K. Sen. PREFAIL: A Programmable Tool for Multiple-Failure Injection. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '11, pages 171--188, New York, NY, USA, 2011. ACM.Google ScholarDigital Library
X. Ju, L. Soares, K. G. Shin, K. D. Ryu, and D. Da Silva. On Fault Resilience of OpenStack. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 2:1--2:16, New York, NY, USA, 2013. ACM.Google ScholarDigital Library
K. Kc and X. Gu. ELT: Efficient Log-based Troubleshooting System for Cloud Computing Infrastructures. In 2011 30th IEEE Symposium on Reliable Distributed Systems (SRDS), pages 11--20, Oct 2011.Google Scholar
D. Lo, L. Mariani, and M. Pezzè. Automatic Steering of Behavioral Model Inference. In Proceedings of the the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ESEC/FSE '09, pages 345--354, New York, NY, USA, 2009. ACM.Google ScholarDigital Library
J.-G. Lou, Q. Fu, S. Yang, J. Li, and B. Wu. Mining Program Workflow from Interleaved Traces. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '10, pages 613--622, New York, NY, USA, 2010. ACM.Google ScholarDigital Library
J.-G. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li. Mining Invariants from Console Logs for System Problem Detection. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'10, pages 24--24, Berkeley, CA, USA, 2010. USENIX Association.Google ScholarDigital Library
K. Nagaraj, C. Killian, and J. Neville. Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI'12, pages 26--26, Berkeley, CA, USA, 2012. USENIX Association.Google ScholarDigital Library
H. Nguyen, D. J. Dean, K. Kc, and X. Gu. Insight: In-situ Online Service Failure Path Inference in Production Computing Infrastructures. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC'14, pages 269--280, Berkeley, CA, USA, 2014. USENIX Association.Google Scholar
B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical report, Google, Inc., 2010.Google Scholar
N. Walkinshaw and K. Bogdanov. Inferring Finite-State Models with Temporal Constraints. In Proceedings of the 2008 23rd IEEE/ACM International Conference on Automated Software Engineering, ASE '08, pages 248--257, Washington, DC, USA, 2008. IEEE Computer Society.Google ScholarDigital Library
W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting Large-Scale System Problems by Mining Console Logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP '09, pages 117--132, New York, NY, USA, 2009. ACM.Google ScholarDigital Library
D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: Error Diagnosis by Connecting Clues from Run-time Logs. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 143--154, New York, NY, USA, 2010. ACM.Google ScholarDigital Library
D. Yuan, S. Park, P. Huang, Y. Liu, M. M. Lee, X. Tang, Y. Zhou, and S. Savage. Be Conservative: Enhancing Failure Diagnosis with Proactive Logging. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI'12, pages 293--306, Berkeley, CA, USA, 2012. USENIX Association.Google ScholarDigital Library
X. Zhao, Y. Zhang, D. Lion, M. F. Ullah, Y. Luo, D. Yuan, and M. Stumm. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 629--644, Berkeley, CA, USA, 2014. USENIX Association.Google ScholarDigital Library

Index Terms

CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs
1. Networks
  1. Network services
    1. Cloud computing
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Operational analysis
      2. Software defect analysis
        Software testing and debugging
  2. Software organization and properties
    1. Extra-functional properties
      1. Software performance
      2. Software reliability

Recommendations

CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs
ASPLOS'16

Cloud infrastructures provide a rich set of management tasks that operate computing, storage, and networking resources in the cloud. Monitoring the executions of these tasks is crucial for cloud providers to promptly find and understand problems that ...
Read More
CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs
ASPLOS '16

Cloud infrastructures provide a rich set of management tasks that operate computing, storage, and networking resources in the cloud. Monitoring the executions of these tasks is crucial for cloud providers to promptly find and understand problems that ...
Read More
Cost-aware Application Development and Management using CLOUD-METRIC
CLOSER 2017: Proceedings of the 7th International Conference on Cloud Computing and Services Science

Traditional application development tends to focus on two key objectives: the best possible performance and a scalable system architecture. This application development logic works well on private resources but with the growing use of public IaaS it is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
March 2016
824 pages
ISBN:9781450340915
DOI:10.1145/2872362
General Chair:
Tom Conte
Georgia Tech, USA
,
Program Chair:
Yuanyuan Zhou
University of California, San Diego, USA
ACM SIGPLAN Notices Volume 51, Issue 4
ASPLOS '16
April 2016
774 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2954679
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents
ACM SIGARCH Computer Architecture News Volume 44, Issue 2
ASPLOS'16
May 2016
774 pages
ISSN:0163-5964
DOI:10.1145/2980024
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 March 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cloud infrastructures
distributed systems
log analysis
workflow monitoring
Qualifiers
- research-article
Conference

Acceptance Rates
ASPLOS '16 Paper Acceptance Rate53of232submissions,23%Overall Acceptance Rate535of2,713submissions,20%
More
Upcoming Conference
ASPLOS '24

Sponsor:

sigarch

sigarch

sigarch

29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

April 27 - May 1, 2024

La Jolla , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 119
  Total Citations
  View Citations
- 1,211
  Total Downloads
- Downloads (Last 12 months)88
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs

ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs

CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs

Cost-aware Application Development and Management using CLOUD-METRIC