Connecting the dots: anomaly and discontinuity detection in large-scale systems

Malik, Haroon; Davis, Ian J.; Godfrey, Michael W.; Neuse, Douglas; Manskovskii, Serge

doi:10.1007/s12652-016-0381-4

Connecting the dots: anomaly and discontinuity detection in large-scale systems

Original Research
Published: 16 June 2016

Volume 7, pages 509–522, (2016)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Haroon Malik¹,
Ian J. Davis²,
Michael W. Godfrey²,
Douglas Neuse³ &
…
Serge Manskovskii³

278 Accesses
3 Citations
Explore all metrics

Abstract

Cloud providers and data centers rely heavily on forecasts to accurately predict future workload. This information helps them in appropriate virtualization and cost-effective provisioning of the infrastructure. The accuracy of a forecast greatly depends upon the merit of performance data fed to the underlying algorithms. One of the fundamental problems faced by analysts in preparing data for use in forecasting is the timely identification of data discontinuities. A discontinuity is an abrupt change in a time-series pattern of a performance counter that persists but does not recur. Analysts need to identify discontinuities in performance data so that they can (a) remove the discontinuities from the data before building a forecast model and (b) retrain an existing forecast model on the performance data from the point in time where a discontinuity occurred. There exist several approaches and tools to help analysts identify anomalies in performance data. However, there exists no automated approach to assist data center operators in detecting discontinuities. In this paper, we present and evaluate our proposed approach to help data center analysts and cloud providers automatically detect discontinuities. A case study on the performance data obtained from a large cloud provider and performance tests conducted using an open source benchmark system show that our proposed approach provides on average precision of 84 % and recall 88 %. The approach does not require any domain knowledge to operate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance evaluation of counter selection techniques to detect discontinuity in large-scale-systems

Article 04 July 2017

Finding Performance Patterns from Logs with High Confidence

Diagnosing Performance Variations in HPC Applications Using Machine Learning

References

Ahn D, Vetter J (2002) Scalable analysis techniques for microprocessor performance counter metrics. In: Proceedings of Supercomputing
Attariyan M, Chow M, Flinn J (2012) X-ray: automating root-cause diagnosis of performance anomalies in production software. In: Proceedings of the OSDI, pp 307–320
Bondi A (2007) Automating the analysis of load test results to assess the scalability and stability of a component. In: Proceedings of the CMG-CONFERENCE, pp 133
Cherkasova L, Ozonat K, Mi N, Symons J, Smirni E (2009) Automated anomaly detection and performance modeling of enterprise applications. ACM Trans Comput Syst (TOCS) 27(3):32
Article Google Scholar
Cohen J (1988) Statistical power analysis for the behavioral sciences. 2nd edition, Routledge Academic
Creţu-Ciocârlie GF, Budiu M, Goldszmidt M (2008) Hunting for problems with Artemis. In: Anonymous proceedings of the First USENIX conference on Analysis of system logs. USENIX Association, pp 2–2
Dasu T, Johnson T (2003) Exploratory data mining and data cleaning. J Stat Soft. doi:10.1002/0471448354.ch4
Davis I, Hemmati H, Holt R, Godfrey M, Neuse D, Mankovskii S (2012) An empirical investigation of an adaptive utilization prediction algorithm. IBM, Centre for advance studies conference (CASCON)
Davis I, Hemmati H, Holt RC, Godfrey MW, Neuse D, Mankovskii S (2013) Storm prediction in a cloud. In: Proceedings of the Principles of Engineering Service-Oriented Systems (PESOS), pp 37–40
Delimitrou C, Kozyrakis C (2013) iBench: quantifying interference for datacenter applications, In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pp 23–33
Foo KCD (2011) Automated discovery of performance regressions in enterprise applications. Canadian theses
Foo KC, Jiang ZM, Adams B, Hassan AE, Zou Y, Flora P (45-2010) Mining performance regression testing repositories for automated performance analysis. In: Proceedings of 10th IEEE International Conference on Quality Software. pp 32–41
Foong A, Fung J, Newell D (2004) An in-depth analysis of the impact of processor affinity on network performance. In: Proceedings of the 12th IEEE International Conference on Networks, pp 244–250
Georges A, Buytaert D, Eeckhout L (2007) Statistically rigorous java performance evaluation. ACM SIGPLAN Notices 42:57–76
Article Google Scholar
Gunasekaran R, Dillow DA, Shipman GM, Maxwell DE, Hill JJ, Park BH, Geist A (2010) Correlating log messages for system diagnostics. In: Proceedings of the Cray Users Group Conference
Gunther HW (2000) Websphere application server development best practices for performance and scalability. IBM WebSphere Application Server Standard and Advanced Editions-White paper
Hartung J, Knapp G, Sinha BK (2011) Statistical meta-analysis with applications, vol. 738. John Wiley & Sons
Jaffe D, Muirhead T (2005) The open source DVD store application. http://linux.dell.com/dvdstore/
Jiang ZM (2010) Automated analysis of load testing results. In: Proceedings of the 19th international symposium on Software testing and analysis, pp 143–146, 42
Jiang ZM, Hassan AE, Hamann G, Flora P (2008) An automated approach for abstracting execution logs to execution events. 43(20):249–267
Jolliffe I (2002) Principal component analysis, Springer
Kampenes VB, Dybå T, Hannay JE, Sjøberg DI (2007) A systematic review of effect size in software engineering experiments. Inf Softw Technol 49:1073–1086
Article Google Scholar
Kitchenham BA, Pfleeger SL, Pickard LM, Jones PW, Hoaglin DC, El Emam K, Rosenberg J (2002) Preliminary guidelines for empirical research in software engineering. IEEE Trans Softw Eng: 721–734
Knop M, Schopf J, Dinda P (2002) Windows performance monitoring and data reduction using watchtower. In: Proceedings of 11th IEEE Symposium on High-Performance Distributed Computing (HPDC11)
Langin C, Rahimi S (2010) Soft computing in intrusion detection: the state of the art. J Ambient Intell Humaniz Comput (JAIHC) 1(2):133–145
Article Google Scholar
Leyda M, Geiss R (2010) WinThrottle, [TOOL]
Limbrunner JF, Vogel RM, Brown LC (2000) Estimation of harmonic mean of a lognormal variable. J Hydrol Eng 5:59–66
Article Google Scholar
Malik H, Adams B, Hassan AE (2010a) Pinpointing the subsystems responsible for the performance deviations in a load test. In: Proceedings of IEEE 21st International Symposium on, San Jose, CA, USA
Malik H, Jiang ZM, Adams B, Hassan AE, Flora P, Hamann G (2010b) Automatic comparison of load tests to support the performance analysis of large enterprise systems. In: Proceedings of Software Maintenance and Reengineering (CSMR), pp 222–231
Malik H, Jiang ZM, Adams B, Hassan AE, Flora P, Hamann G (2010c) Automatic comparison of load tests to support the performance analysis of large enterprise systems In: Proceedings of 14th European Conference on Software Maintenance and Reengineering, pp 222–231
Malik H, Hemmati H, Hassan AE (2013) Automatic detection of performance deviations in the load testing of large scale systems. In: Proceedings of the 35th International Conference on Software Engineering (ICSE), pp 1012–1021
McCaffrey J (2011) “Eat-Mem” [BOOK + Tool]
Meng B, Jian X (2016) Anomaly detection model of user behavior based on principal component analysis. J Ambient Intell Humaniz Comput (JAIHC). doi:10.1007/s12652-015-0341-4
Google Scholar
Nguyen TH, Adams B, Jiang ZM, Hassan AE, Nasser M, Flora P (2011) Automated verification of load tests using control charts. In: proceedings of the 18th Asia Pacific Software Engineering Conference (APSEC), pp 282–289
Nguyen TH, Nagappan M, Hassan AE, Nasser M, Flora P (2014) An industrial case study of automatically identifying performance regression-causes. In: Proceedings of the 11th Working Conference on Mining Software Repositories, pp 232–241
Pertet S, Narasimhan P (2012) Causes of failure in web applications. Parallel Data Laboratory, Carnegie Mellon University, CMU-PDL-05-109
Rigatos G, Siano P (2013) An approach to fault diagnosis of nonlinear systems using neural networks with invariance to Fourier transform. J Ambient Intell Humaniz Comput (JAIHC) 4(6):621–639
Article Google Scholar
Stanford S (2003) MMB3 Comparative analysis–White Paper
Syer MD, Adams B, Hassan AE (2011) Identifying performance deviations in thread pools. In: Proceedings of the 27th IEEE International Conference on Software Maintenance (ICSM), pp 83–92
Syer MD, Jiang ZM, Nagappan M, Hassan AE, Nasser M, Flora P (2013) Leveraging performance counters and execution logs to diagnose memory-related performance issues. In: Proceedings of the 7th international workshop on Software and performance, pp 55–66
Thakkar D, Hassan AE, Hamann G, Flora P (2008) A framework for measurement based performance modeling. In: Anonymous WOSP ‘08: Proceedings of the 7th international workshop on Software and performance, Princeton, NJ, USA. ACM, New York, NY, USA, pp 55–66
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bull 1(6):80–83. doi:10.2307/3001968
Article Google Scholar

Download references

Acknowledgments

We are grateful to C.A. Technologies Inc., for supporting and funding this research, and for providing access to the production data used in our case study. The findings and opinions expressed in this paper are those of the authors and do not necessarily represent or reflect those of C.A Technologies and/or its subsidiaries and affiliates. This work was funded in part by a Collaborative Research and Development grant from the National Science and Engineering Research Council of Canada (NSERC).

Author information

Authors and Affiliations

Weisberg Division of Computer Science, Marshall Univeristy, Huntington, WV, USA
Haroon Malik
David R. Cheriton School of Computing, University of Waterloo, Waterloo, ON, Canada
Ian J. Davis & Michael W. Godfrey
CA Labs, CA Technologies, Redwood City, CA, USA
Douglas Neuse & Serge Manskovskii

Authors

Haroon Malik
View author publications
You can also search for this author in PubMed Google Scholar
Ian J. Davis
View author publications
You can also search for this author in PubMed Google Scholar
Michael W. Godfrey
View author publications
You can also search for this author in PubMed Google Scholar
Douglas Neuse
View author publications
You can also search for this author in PubMed Google Scholar
Serge Manskovskii
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haroon Malik.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Malik, H., Davis, I.J., Godfrey, M.W. et al. Connecting the dots: anomaly and discontinuity detection in large-scale systems. J Ambient Intell Human Comput 7, 509–522 (2016). https://doi.org/10.1007/s12652-016-0381-4

Download citation

Received: 28 February 2016
Accepted: 08 April 2016
Published: 16 June 2016
Issue Date: August 2016
DOI: https://doi.org/10.1007/s12652-016-0381-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Connecting the dots: anomaly and discontinuity detection in large-scale systems

Abstract

Access this article

Similar content being viewed by others

Performance evaluation of counter selection techniques to detect discontinuity in large-scale-systems

Finding Performance Patterns from Logs with High Confidence

Diagnosing Performance Variations in HPC Applications Using Machine Learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Connecting the dots: anomaly and discontinuity detection in large-scale systems

Abstract

Access this article

Similar content being viewed by others

Performance evaluation of counter selection techniques to detect discontinuity in large-scale-systems

Finding Performance Patterns from Logs with High Confidence

Diagnosing Performance Variations in HPC Applications Using Machine Learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation