skip to main content
10.1145/2872362.2872407acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs

Authors Info & Claims
Published:25 March 2016Publication History

ABSTRACT

Cloud infrastructures provide a rich set of management tasks that operate computing, storage, and networking resources in the cloud. Monitoring the executions of these tasks is crucial for cloud providers to promptly find and understand problems that compromise cloud availability. However, such monitoring is challenging because there are multiple distributed service components involved in the executions. CloudSeer enables effective workflow monitoring. It takes a lightweight non-intrusive approach that purely works on interleaved logs widely existing in cloud infrastructures. CloudSeer first builds an automaton for the workflow of each management task based on normal executions, and then it checks log messages against a set of automata for workflow divergences in a streaming manner. Divergences found during the checking process indicate potential execution problems, which may or may not be accompanied by error log messages. For each potential problem, CloudSeer outputs necessary context information including the affected task automaton and related log messages hinting where the problem occurs to help further diagnosis. Our experiments on OpenStack, a popular open-source cloud infrastructure, show that CloudSeer's efficiency and problem-detection capability are suitable for online monitoring.

References

  1. 2013 Path to an OpenStack-Powered Cloud Survey Results Highlight Aggressive OpenStack Adoption Plans by Enterprises. http://www.redhat.com/en/about/press-releases/2013-path-to-an-openstack-powered-cloud-survey-results-highlight-aggressive-openstack-adoption-plans-by-enterprises.Google ScholarGoogle Scholar
  2. Amazon CloudWatch. https://aws.amazon.com/cloudwatch/.Google ScholarGoogle Scholar
  3. Amazon Elastic Compute Cloud. http://aws.amazon.com/ec2/.Google ScholarGoogle Scholar
  4. Apache HTrace. http://htrace.incubator.apache.org/.Google ScholarGoogle Scholar
  5. Architecture. OpenStack Installation Guide, http://docs.openstack.org/havana/install-guide/install/apt/content/ch_overview.html.Google ScholarGoogle Scholar
  6. CirrOS: A Tiny Cloud Guest. https://launchpad.net/cirros.Google ScholarGoogle Scholar
  7. Elasticsearch. http://www.elasticsearch.org/overview/elasticsearch/.Google ScholarGoogle Scholar
  8. Logging and Monitoring. OpenStack Operations Guide, http://docs.openstack.org/openstack-ops/content/logging_monitoring.html.Google ScholarGoogle Scholar
  9. Logstash. http://www.elasticsearch.org/overview/logstash/.Google ScholarGoogle Scholar
  10. Microsoft Azure. http://azure.microsoft.com/en-us/.Google ScholarGoogle Scholar
  11. OpenStack. http://www.openstack.org/.Google ScholarGoogle Scholar
  12. Zipkin. http://zipkin.io/.Google ScholarGoogle Scholar
  13. P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 18--18, Berkeley, CA, USA, 2004. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. I. Beschastnikh, Y. Brun, M. D. Ernst, and A. Krishnamurthy. Inferring Models of Concurrent Systems from Logs of Their Behavior with CSight. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 468--479, New York, NY, USA, 2014. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Beschastnikh, Y. Brun, M. D. Ernst, A. Krishnamurthy, and T. E. Anderson. Mining Temporal Invariants from Partially Ordered Logs. ACM SIGOPS Operating Systems Review, 45(3):39--46, Jan. 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 217--231, Berkeley, CA, USA, 2014. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Do, M. Hao, T. Leesatapornwongsa, T. Patana-anake, and H. S. Gunawi. Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 14:1--14:14, New York, NY, USA, 2013. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Q. Fu, J.-G. Lou, Y. Wang, and J. Li. Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM '09, pages 149--158, Washington, DC, USA, 2009. IEEE Computer Society.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Joshi, H. S. Gunawi, and K. Sen. PREFAIL: A Programmable Tool for Multiple-Failure Injection. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA '11, pages 171--188, New York, NY, USA, 2011. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. X. Ju, L. Soares, K. G. Shin, K. D. Ryu, and D. Da Silva. On Fault Resilience of OpenStack. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 2:1--2:16, New York, NY, USA, 2013. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. K. Kc and X. Gu. ELT: Efficient Log-based Troubleshooting System for Cloud Computing Infrastructures. In 2011 30th IEEE Symposium on Reliable Distributed Systems (SRDS), pages 11--20, Oct 2011.Google ScholarGoogle Scholar
  22. D. Lo, L. Mariani, and M. Pezzè. Automatic Steering of Behavioral Model Inference. In Proceedings of the the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ESEC/FSE '09, pages 345--354, New York, NY, USA, 2009. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J.-G. Lou, Q. Fu, S. Yang, J. Li, and B. Wu. Mining Program Workflow from Interleaved Traces. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '10, pages 613--622, New York, NY, USA, 2010. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J.-G. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li. Mining Invariants from Console Logs for System Problem Detection. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'10, pages 24--24, Berkeley, CA, USA, 2010. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Nagaraj, C. Killian, and J. Neville. Structured Comparative Analysis of Systems Logs to Diagnose Performance Problems. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI'12, pages 26--26, Berkeley, CA, USA, 2012. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H. Nguyen, D. J. Dean, K. Kc, and X. Gu. Insight: In-situ Online Service Failure Path Inference in Production Computing Infrastructures. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC'14, pages 269--280, Berkeley, CA, USA, 2014. USENIX Association.Google ScholarGoogle Scholar
  27. B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical report, Google, Inc., 2010.Google ScholarGoogle Scholar
  28. N. Walkinshaw and K. Bogdanov. Inferring Finite-State Models with Temporal Constraints. In Proceedings of the 2008 23rd IEEE/ACM International Conference on Automated Software Engineering, ASE '08, pages 248--257, Washington, DC, USA, 2008. IEEE Computer Society.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting Large-Scale System Problems by Mining Console Logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, SOSP '09, pages 117--132, New York, NY, USA, 2009. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: Error Diagnosis by Connecting Clues from Run-time Logs. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, pages 143--154, New York, NY, USA, 2010. ACM.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Yuan, S. Park, P. Huang, Y. Liu, M. M. Lee, X. Tang, Y. Zhou, and S. Savage. Be Conservative: Enhancing Failure Diagnosis with Proactive Logging. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI'12, pages 293--306, Berkeley, CA, USA, 2012. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. X. Zhao, Y. Zhang, D. Lion, M. F. Ullah, Y. Luo, D. Yuan, and M. Stumm. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 629--644, Berkeley, CA, USA, 2014. USENIX Association.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              ASPLOS '16: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems
              March 2016
              824 pages
              ISBN:9781450340915
              DOI:10.1145/2872362
              • General Chair:
              • Tom Conte,
              • Program Chair:
              • Yuanyuan Zhou

              Copyright © 2016 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 25 March 2016

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              ASPLOS '16 Paper Acceptance Rate53of232submissions,23%Overall Acceptance Rate535of2,713submissions,20%

              Upcoming Conference

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader