ABSTRACT
Cloud infrastructures provide a rich set of management tasks that operate computing, storage, and networking resources in the cloud. Monitoring the executions of these tasks is crucial for cloud providers to promptly find and understand problems that compromise cloud availability. However, such monitoring is challenging because there are multiple distributed service components involved in the executions. CloudSeer enables effective workflow monitoring. It takes a lightweight non-intrusive approach that purely works on interleaved logs widely existing in cloud infrastructures. CloudSeer first builds an automaton for the workflow of each management task based on normal executions, and then it checks log messages against a set of automata for workflow divergences in a streaming manner. Divergences found during the checking process indicate potential execution problems, which may or may not be accompanied by error log messages. For each potential problem, CloudSeer outputs necessary context information including the affected task automaton and related log messages hinting where the problem occurs to help further diagnosis. Our experiments on OpenStack, a popular open-source cloud infrastructure, show that CloudSeer's efficiency and problem-detection capability are suitable for online monitoring.
- 2013 Path to an OpenStack-Powered Cloud Survey Results Highlight Aggressive OpenStack Adoption Plans by Enterprises. http://www.redhat.com/en/about/press-releases/2013-path-to-an-openstack-powered-cloud-survey-results-highlight-aggressive-openstack-adoption-plans-by-enterprises.Google Scholar
- Amazon CloudWatch. https://aws.amazon.com/cloudwatch/.Google Scholar
- Amazon Elastic Compute Cloud. http://aws.amazon.com/ec2/.Google Scholar
- Apache HTrace. http://htrace.incubator.apache.org/.Google Scholar
- Architecture. OpenStack Installation Guide, http://docs.openstack.org/havana/install-guide/install/apt/content/ch_overview.html.Google Scholar
- CirrOS: A Tiny Cloud Guest. https://launchpad.net/cirros.Google Scholar
- Elasticsearch. http://www.elasticsearch.org/overview/elasticsearch/.Google Scholar
- Logging and Monitoring. OpenStack Operations Guide, http://docs.openstack.org/openstack-ops/content/logging_monitoring.html.Google Scholar
- Logstash. http://www.elasticsearch.org/overview/logstash/.Google Scholar
- Microsoft Azure. http://azure.microsoft.com/en-us/.Google Scholar
- OpenStack. http://www.openstack.org/.Google Scholar
- Zipkin. http://zipkin.io/.Google Scholar
- P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 18--18, Berkeley, CA, USA, 2004. USENIX Association.Google ScholarDigital Library
- I. Beschastnikh, Y. Brun, M. D. Ernst, and A. Krishnamurthy. Inferring Models of Concurrent Systems from Logs of Their Behavior with CSight. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 468--479, New York, NY, USA, 2014. ACM.Google ScholarDigital Library
- I. Beschastnikh, Y. Brun, M. D. Ernst, A. Krishnamurthy, and T. E. Anderson. Mining Temporal Invariants from Partially Ordered Logs. ACM SIGOPS Operating Systems Review, 45(3):39--46, Jan. 2012.Google ScholarDigital Library
- M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 217--231, Berkeley, CA, USA, 2014. USENIX Association.Google ScholarDigital Library
- T. Do, M. Hao, T. Leesatapornwongsa, T. Patana-anake, and H. S. Gunawi. Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 14:1--14:14, New York, NY, USA, 2013. ACM.Google ScholarDigital Library
- Q. Fu, J.-G. Lou, Y. Wang, and J. Li. Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM '09, pages 149--158, Washington, DC, USA, 2009. IEEE Computer Society.Google ScholarDigital Library
- P. Joshi, H. S. Gunawi, and K. Sen. PREFAIL: A Programmable Tool for Multiple-Failure Injection. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '11, pages 171--188, New York, NY, USA, 2011. ACM.Google ScholarDigital Library
- X. Ju, L. Soares, K. G. Shin, K. D. Ryu, and D. Da Silva. On Fault Resilience of OpenStack. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 2:1--2:16, New York, NY, USA, 2013. ACM.Google ScholarDigital Library
- K. Kc and X. Gu. ELT: Efficient Log-based Troubleshooting System for Cloud Computing Infrastructures. In 2011 30th IEEE Symposium on Reliable Distributed Systems (SRDS), pages 11--20, Oct 2011.Google Scholar
- D. Lo, L. Mariani, and M. Pezzè. Automatic Steering of Behavioral Model Inference. In Proceedings of the the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ESEC/FSE '09, pages 345--354, New York, NY, USA, 2009. ACM.Google ScholarDigital Library
- J.-G. Lou, Q. Fu, S. Yang, J. Li, and B. Wu. Mining Program Workflow from Interleaved Traces. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '10, pages 613--622, New York, NY, USA, 2010. ACM.Google ScholarDigital Library
- J.-G. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li. Mining Invariants from Console Logs for System Problem Detection. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'10, pages 24--24, Berkeley, CA, USA, 2010. USENIX Association.Google ScholarDigital Library
- K. Nagaraj, C. Killian, and J. Neville. Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI'12, pages 26--26, Berkeley, CA, USA, 2012. USENIX Association.Google ScholarDigital Library
- H. Nguyen, D. J. Dean, K. Kc, and X. Gu. Insight: In-situ Online Service Failure Path Inference in Production Computing Infrastructures. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC'14, pages 269--280, Berkeley, CA, USA, 2014. USENIX Association.Google Scholar
- B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical report, Google, Inc., 2010.Google Scholar
- N. Walkinshaw and K. Bogdanov. Inferring Finite-State Models with Temporal Constraints. In Proceedings of the 2008 23rd IEEE/ACM International Conference on Automated Software Engineering, ASE '08, pages 248--257, Washington, DC, USA, 2008. IEEE Computer Society.Google ScholarDigital Library
- W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting Large-Scale System Problems by Mining Console Logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP '09, pages 117--132, New York, NY, USA, 2009. ACM.Google ScholarDigital Library
- D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: Error Diagnosis by Connecting Clues from Run-time Logs. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 143--154, New York, NY, USA, 2010. ACM.Google ScholarDigital Library
- D. Yuan, S. Park, P. Huang, Y. Liu, M. M. Lee, X. Tang, Y. Zhou, and S. Savage. Be Conservative: Enhancing Failure Diagnosis with Proactive Logging. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI'12, pages 293--306, Berkeley, CA, USA, 2012. USENIX Association.Google ScholarDigital Library
- X. Zhao, Y. Zhang, D. Lion, M. F. Ullah, Y. Luo, D. Yuan, and M. Stumm. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 629--644, Berkeley, CA, USA, 2014. USENIX Association.Google ScholarDigital Library
Index Terms
- CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs
Recommendations
CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs
ASPLOS'16Cloud infrastructures provide a rich set of management tasks that operate computing, storage, and networking resources in the cloud. Monitoring the executions of these tasks is crucial for cloud providers to promptly find and understand problems that ...
CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs
ASPLOS '16Cloud infrastructures provide a rich set of management tasks that operate computing, storage, and networking resources in the cloud. Monitoring the executions of these tasks is crucial for cloud providers to promptly find and understand problems that ...
Cost-aware Application Development and Management using CLOUD-METRIC
CLOSER 2017: Proceedings of the 7th International Conference on Cloud Computing and Services ScienceTraditional application development tends to focus on two key objectives: the best possible performance and a scalable system architecture. This application development logic works well on private resources but with the growing use of public IaaS it is ...
Comments