Detecting performance anomalies in large-scale software systems using entropy

Malik, Haroon; Shakshuki, Elhadi M.

doi:10.1007/s00779-017-1036-y

Detecting performance anomalies in large-scale software systems using entropy

Original Article
Published: 17 July 2017

Volume 21, pages 1127–1137, (2017)
Cite this article

Personal and Ubiquitous Computing Aims and scope Submit manuscript

Haroon Malik¹ &
Elhadi M. Shakshuki²

409 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Large-scale software systems (LSSs) are composed of hundreds of subsystems that interact with each other in an unforeseen and complex ways. The operators of these LSSs strictly monitor thousands of metrics (performance counters) to quickly identify performance anomalies before a catastrophe. The existing monitoring tools and methodologies have not kept in pace with the rapid growth and inherit complexity of these LSSs; hence are ineffective in assisting practitioners to effectively pinpoint performance anomalies. We propose two methodologies that use entropy measure to assist practitioners/operators of LSSs in quickly detecting both system-wide and underlying localized subsystem anomalies. Our performance tests conducted on an open-source benchmark system reveal that the proposed methodologies are robust in pinpointing anomalies, do not require any domain knowledge to operate, and avoid information overload on practitioners.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Lehman MM (1980) Programs, life cycles, and laws of software evolution. Proc IEEE 68:1060–1076
Article Google Scholar
Munawar MA, Ward P (2006) Adaptive monitoring in enterprise software systems. SysML
Miao J, Munawar MA, Reidemeister T, Ward PAS (2009) Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring. In: Proceedings of International Conference On Dependable Systems & Networks, DSN '09. IEEE/IFIP, pp. 285–294
Jiang M, Munawar MA, Reidemeister T, Ward PAS (2008) Information-theoretic modeling for tracking the health of complex software systems. In: Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds, Toronto, Canada, pp. 236–247
Weyuker EJ, Vokolos FI (2000) Experience with performance testing of software systems: issues, an approach, and case study. IEEE TransSoftwEng 26:1147–1156
Google Scholar
Boschi E, Denazis S, Zseby T (2006) A measurement framework for inter-domain SLA validation. Comput Commun 29:703–716 3/31
Article Google Scholar
Di Stefano A, Morana G, Zito D (2009) A P2P strategy for QoS discovery and SLA negotiation in grid environment. Future Generation Comput Syst 25:862–875
Article Google Scholar
Patrick R, Charles K, Janet Wi, Jeffrey M, Mehul S, and Amin V (2009) Pip: detecting the unexpected in distributed systems. In: Proceedings of of 3rd Symp on Networked Systems Design and Implementation (NSDI)
Stephen S (2009) PayPal hit by global outage. [Web: https://www.finextra.com/newsarticle/20336/paypal-hit-by-global-outage]
Xikun D, Huiqiang W, Hongwu L (2010) A comprehensive monitor model for self-healing systems. In: Proceedings of International Conference on Multimedia Information Networking and Security (MINES), pp. 751–756
Voicu R, Legrand IC, Dobre C (2011) A monitoring framework for large scale networks. In: Proceedings of IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 429–432
Acharya M, Kommineni V (2009) Mining health models for performance monitoring of services. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, pp. 409–420
Rathfelder C, Becker S, Krogmann K, Reussner R (2012) Workload-aware system monitoring using performance predictions applied to a large-scale e-mail system. In Proceedings of Joint Working IEEE/IFIP Conference Software Architecture (WICSA) and European Conference on Software Architecture (ECSA), pp. 31–40
JunPing W, Qiuming K, ShiHui D (2015) A new online anomaly learning and detection for large-scale service of Internet of thing. Pers Ubiquit Comput 19:1021–1031
Article Google Scholar
Toledo P, Sanchis A (2015) Sensor-based Bayesian detection of anomalous living patterns in a home setting. Personal Ubiquit J 19:259–270
Article Google Scholar
Tian Y, Nguyen H, Shen Z, Gu Xu, Venkatramani C, Rajan D (2012) PREPARE: predictive performance anomaly prevention for virtualized cloud systems. In: Proceedings of the IEEE 32^nd International Conference on Distributed Computing Systems (ICDCS), pp. (12)
Jiang Z (2010) Automated analysis of load testing results. In: Proceedings of the 19^th international symposium on Software testing and analysis (ISSTA '10), pp 143–146
Ibidunmoye O, Hernández-Rodriguez F, Elmroth E (2015) Performance anomaly detection and bottleneck identification. ACM Comput. Surv. 48, 1, Article 4 (July 2015), pp. (10)
Kang H, Zhu X, Wong J (2012) DAPA: diagnosing application performance anomalies for virtualized infrastructures. In: Proceedings of the 2^nd USENIX Workshop on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services, pp. (10)
Chen MY, Kiciman E, Fratkin E, Fox A, Brewer E (2002) Pinpoint: problem determination in large, dynamic internet services. In: Proceedings of the International Conference on Dependable Systems and Networks, DSN, pp. 595–604
Pertet S, Gandhi R, Narasimhan P (2007) Fingerpointing correlated failures in replicated systems. In: Proceedings of the Second Workshop on Tackling Computer Systems Problems with Machine Learning, pp 220–230
Pertet S, Gandhi R, Narasimhan P (2006) Group communication: helping or obscuring failure diagnosis. Technical Report (CMU-PDL-06-107)
Jaesung C, Myungwhan C, Sang-Hyuk L (1999) An alarm correlation and fault identification scheme based on OSI managed object classes. In: Proceedings of IEE International Conference on Communications (ICC'99), vol.3, pp. 1547–1551
Rouvellou I, Hart GW (1995) Automatic alarm correlation for fault identification. In: Proceedings of Fourteenth Annual Joint Conference of the IEEE Computer and Communications Societies INFOCOM '95 pp. 553–561 vol.2., 1995
Bouloutas AT, Calo S, Finkel A (1994) Alarm correlation and fault identification. In Proceedings of IEEE Transactions on communication networks 42:523–533
Article Google Scholar
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. In proceedings of SIAM Journal on Scientific Computing 20(1):359–392
Article MathSciNet MATH Google Scholar
Ponnusamy R, Mansour N, Choudhary A, Fox G (1993) Graph contraction and physical optimization methods: a quality-cost tradeoff for mapping data on parallel computers. In: Proceedings of International Conference of Supercomputing, pp: 501–515
Dave J (2016) Dell DVD store. Available: http://linux.dell.com/dvdstore/, downloaded 2016
Cohen I, Zhang S, Goldszmidt M, Symons J, Kelly T, Fox A (2005) Capturing, indexing, clustering, and retrieving system history. In: Proceedings of the Twentieth ACM Symposium on Operating Systems Principles SOSP '05, Brighton, United Kingdom, pp.105–118
Jiang Z, Hassan A, Hamann G, Flora P (2009) Automated performance analysis of load tests. In: Proceedings of the 25th IEEE International Conference on Software Maintenance (ICSM), Edmonton, Canada, September. pp. 20–26
Huck K, Malony A.D (2005) PerfExplorer: a performance data mining framework for large-scale parallel computing. In: Proceedings of the ACM/IEEE Supercomputing (SC) Conference, vol., no., pp. 41–41
Sandeep S, Swapna M (2008) CLUEBOX: a performance log analyzer for automated troubleshooting. In: Proceedings of the USENIX Workshop on Analysis of System Logs (WASL ‘08)
Brown A, Patterson D (2001) An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In: Proceedings of seventh IFIP/IEEE International Symposium on Integrated Network Management, Seattle, WA, pp 99–108
Choi J, Choi M, Lee S (1999) An alarm correlation and fault identification scheme based on OSI managed object classes. In: Proceedings of IEEE International Conference on Communications, Vancouver, BC, Canada
Gruschke B (1998) A new approach for event correlation basedon dependency graphs. In: Proceedings 5th Workshop of the OpenView University Association: OVUA’98, Rennes, France
Yemini A, Kliger S (1996) High speed and robust event correlation. IEEE Commun Mag 34(5):82–90
Article Google Scholar
Barham P, Donnelly A, Isaacs R, Mortier R (2004). Using Magpie for request extraction and workload modelling. In: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation. vol. 6, pp. 18–27
Chen MY, Kiciman E, Fratkin E, Fox A, Brewer E (27–2002) Pinpoint: problem determination in large, dynamic Internet services. In Proceedings of International Conference on Dependable Systems and Networks (DSN), vol., no., pp. 595–604
Reynolds P, Killian C, Wiener J, Mogul J, Shah M, Vahdat A (2006) Pip: detecting the unexpected in distributed systems. In: Proceedings of the 3rd Conference on Networked Systems Design & Implementation, USENIX Association, Berkeley, CA, vol 3, pp (10)
Thakkar D, Hassan A, Hamann G, Flora P (2008) A framework for measurement based performance modeling. In: Proceedings of the 7th International Workshop on Software and Performance, WOSP '08: Princeton, NJ, USA, pp. 55–66

Download references

Author information

Authors and Affiliations

Weisberg Division of Computer Science, Marshall University, Huntington, WV, USA
Haroon Malik
Jodrey School of Computer Science, Acadia University, Wolfville, NS, Canada
Elhadi M. Shakshuki

Authors

Haroon Malik
View author publications
Search author on:PubMed Google Scholar
Elhadi M. Shakshuki
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Haroon Malik.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Malik, H., Shakshuki, E.M. Detecting performance anomalies in large-scale software systems using entropy. Pers Ubiquit Comput 21, 1127–1137 (2017). https://doi.org/10.1007/s00779-017-1036-y

Download citation

Received: 15 November 2016
Accepted: 31 March 2017
Published: 17 July 2017
Issue Date: December 2017
DOI: https://doi.org/10.1007/s00779-017-1036-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detecting performance anomalies in large-scale software systems using entropy

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Detecting Performance Degradation in Cloud Systems Using LSTM Autoencoders

Performance Anomaly and Change Point Detection for Large-Scale System Management

Monitoring Health of Large Scale Software Systems Using Drift Detection Techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Detecting performance anomalies in large-scale software systems using entropy

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Detecting Performance Degradation in Cloud Systems Using LSTM Autoencoders

Performance Anomaly and Change Point Detection for Large-Scale System Management

Monitoring Health of Large Scale Software Systems Using Drift Detection Techniques

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now