ABSTRACT
In this paper, we describe Roots - a system for automatically identifying the "root cause" of performance anomalies in web applications deployed in Platform-as-a-Service (PaaS) clouds. Roots does not require application-level instrumentation. Instead, it tracks events within the PaaS cloud that are triggered by application requests using a combination of metadata injection and platform-level instrumentation.
We describe the extensible architecture of Roots, a prototype implementation of the system, and a statistical methodology for performance anomaly detection and diagnosis. We evaluate the efficacy of Roots using a set of PaaS-hosted web applications, and detail the performance overhead and scalability of the implementation.
- M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, 2003. Google ScholarDigital Library
- N. Antonopoulos and L. Gillam. Cloud Computing: Principles, Systems and Applications. Springer Publishing Company, Incorporated, 1st edition, 2010. Google ScholarDigital Library
- M. Attariyan, M. Chow, and J. Flinn. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, 2012. Google ScholarDigital Library
- V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3), 2009. Google ScholarDigital Library
- C. Chen and L.-M. Liu. Joint estimation of model parameters and outlier effects in time series. Journal of the American Statistical Association, 88(421):284--297, 1993.Google Scholar
- M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, 2002. Google ScholarDigital Library
- Amazon cloud watch, 2016. https://aws.amazon.com/cloudwatch {Accessed Sep 2016}.Google Scholar
- G. Da Cunha Rodrigues, R. N. Calheiros, V. T. Guimaraes, G. L. d. Santos, M. B. de Carvalho, L. Z. Granville, L. M. R. Tarouco, and R. Buyya. Monitoring of cloud computing environments: Concepts, solutions, trends, and future directions. In Proceedings of the 31st Annual ACM Symposium on Applied Computing, 2016. Google ScholarDigital Library
- Datadog: Cloud monitoring as a service, 2016. https://www.datadoghq.com {Accessed Sep 2016}.Google Scholar
- D. J. Dean, H. Nguyen, P. Wang, and X. Gu. Perfcompass: Toward runtime performance anomaly fault localization for infrastructure-as-a-service clouds. In Proceedings of the 6th USENIX Conference on Hot Topics in Cloud Computing, 2014. Google ScholarDigital Library
- Dynatrace: Digital performance management and application performance monitoring, 2016. https://www.dynatrace.com {Accessed Sep 2016}.Google Scholar
- R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design #38; Implementation, 2007. Google ScholarDigital Library
- App Engine - Run your applications on a fully managed PaaS, 2015. "https://cloud.google.com/appengine" {Accessed March 2015}.Google Scholar
- Google Cloud SDK Service Quotas, 2015. https://cloud.google.com/appengine/docs/quotas {Accessed March 2015}.Google Scholar
- U. Groemping. Relative importance for linear regression in r: The package relaimpo. Journal of Statistical Software, 17(1), 2006.Google Scholar
- Q. Guan, Z. Zhang, and S. Fu. Proactive failure management by integrated unsupervised and semi-supervised learning for dependable cloud systems. In Availability, Reliability and Security (ARES), 2011 Sixth International Conference on, 2011. Google ScholarDigital Library
- O. Ibidunmoye, F. Hernández-Rodriguez, and E. Elmroth. Performance anomaly detection and bottleneck identification. ACM Comput. Surv., 48(1), July 2015. Google ScholarDigital Library
- H. Jayathilaka, C. Krintz, and R. Wolski. Response time service level agreements for cloud-hosted web applications. In Proceedings of the Sixth ACM Symposium on Cloud Computing, 2015. Google ScholarDigital Library
- A. Keller and H. Ludwig. The WSLA Framework: Specifying and Monitoring Service Level Agreements for Web Services. J. Netw. Syst. Manage., 11(1), Mar. 2003. Google ScholarDigital Library
- R. Killick, P. Fearnhead, and I. A. Eckley. Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500):1590--1598, 2012.Google ScholarCross Ref
- O. Kononenko, O. Baysal, R. Holmes, and M. W. Godfrey. Mining modern repositories with elasticsearch. In Proceedings of the 11th Working Conference on Mining Software Repositories, 2014. Google ScholarDigital Library
- C. Krintz. The appscale cloud platform: Enabling portable, scalable web application deployment. IEEE Internet Computing, 17(2), 2013. Google ScholarDigital Library
- Latency is Everywhere and it Costs Your Sales, 2009. http://highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-it {Accessed Sep 2016}.Google Scholar
- G. R. Lindeman R.H., Merenda P.F. Introduction to Bivariate and Multivariate Analysis. Scott, Foresman, Glenview, IL, 1980.Google Scholar
- J. a. P. Magalhaes and L. M. Silva. Root-cause analysis of performance anomalies in web-based applications. In Proceedings of the 2011 ACM Symposium on Applied Computing, 2011. Google ScholarDigital Library
- J. P. Magalhaes and L. M. Silva. Detection of performance anomalies in web-based applications. In Proceedings of the 2010 Ninth IEEE International Symposium on Network Computing and Applications, 2010. Google ScholarDigital Library
- Microsoft Azure Cloud SDK Service Quotas, 2015. http://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits {Accessed March 2015}.Google Scholar
- M. Natu, R. K. Ghosh, R. K. Shyamsundar, and R. Ranjan. Holistic performance monitoring of hybrid clouds: Complexities and future directions. IEEE Cloud Computing, 3(1), Jan 2016.Google Scholar
- New relic: Application performance management and monitoring, 2016. https://newrelic.com {Accessed Sep 2016}.Google Scholar
- H. Nguyen, Y. Tan, and X. Gu. Pal: Propagation-aware anomaly localization for cloud hosted distributed applications. In Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, 2011. Google ScholarDigital Library
- D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov. The Eucalyptus open-source cloud-computing system. In IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009. Google ScholarDigital Library
- P. Pinheiro, M. Aparicio, and C. Costa. Adoption of cloud computing systems. In Proceedings of the International Conference on Information Systems and Design of Communication, 2014. Google ScholarDigital Library
- M. Soni. Cloud computing basics--platform as a service (paas). Linux J., 2014(238), 2014. Google ScholarDigital Library
Index Terms
- Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications
Recommendations
Root Cause Analysis Using Sequence Alignment and Latent Semantic Indexing
ASWEC '08: Proceedings of the 19th Australian Conference on Software EngineeringAutomatic identification of software faults has enormous practical significance. This requires characterizing program execution behavior. Equally important is the aspect of diagnosing (finding root-cause of) faults encountered. In this article, we ...
Empirical study of root cause analysis of software failure
Root Cause Analysis (RCA) is the process of identifying project issues, correcting them and taking preventive actions to avoid occurrences of such issues in the future. Issues could be variance in schedule, effort, cost, productivity, expected results ...
Clustering intrusion detection alarms to support root cause analysis
It is a well-known problem that intrusion detection systems overload their human operators by triggering thousands of alarms per day. This paper presents a new approach for handling intrusion detection alarms more efficiently. Central to this approach ...
Comments