Skip to main content
Log in

Detecting performance anomalies in large-scale software systems using entropy

  • Original Article
  • Published:
Personal and Ubiquitous Computing Aims and scope Submit manuscript

Abstract

Large-scale software systems (LSSs) are composed of hundreds of subsystems that interact with each other in an unforeseen and complex ways. The operators of these LSSs strictly monitor thousands of metrics (performance counters) to quickly identify performance anomalies before a catastrophe. The existing monitoring tools and methodologies have not kept in pace with the rapid growth and inherit complexity of these LSSs; hence are ineffective in assisting practitioners to effectively pinpoint performance anomalies. We propose two methodologies that use entropy measure to assist practitioners/operators of LSSs in quickly detecting both system-wide and underlying localized subsystem anomalies. Our performance tests conducted on an open-source benchmark system reveal that the proposed methodologies are robust in pinpointing anomalies, do not require any domain knowledge to operate, and avoid information overload on practitioners.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Lehman MM (1980) Programs, life cycles, and laws of software evolution. Proc IEEE 68:1060–1076

    Article  Google Scholar 

  2. Munawar MA, Ward P (2006) Adaptive monitoring in enterprise software systems. SysML

  3. Miao J, Munawar MA, Reidemeister T, Ward PAS (2009) Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring. In: Proceedings of International Conference On Dependable Systems & Networks, DSN '09. IEEE/IFIP, pp. 285–294

  4. Jiang M, Munawar MA, Reidemeister T, Ward PAS (2008) Information-theoretic modeling for tracking the health of complex software systems. In: Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds, Toronto, Canada, pp. 236–247

  5. Weyuker EJ, Vokolos FI (2000) Experience with performance testing of software systems: issues, an approach, and case study. IEEE TransSoftwEng 26:1147–1156

    Google Scholar 

  6. Boschi E, Denazis S, Zseby T (2006) A measurement framework for inter-domain SLA validation. Comput Commun 29:703–716 3/31

    Article  Google Scholar 

  7. Di Stefano A, Morana G, Zito D (2009) A P2P strategy for QoS discovery and SLA negotiation in grid environment. Future Generation Comput Syst 25:862–875

    Article  Google Scholar 

  8. Patrick R, Charles K, Janet Wi, Jeffrey M, Mehul S, and Amin V (2009) Pip: detecting the unexpected in distributed systems. In: Proceedings of of 3rd Symp on Networked Systems Design and Implementation (NSDI)

  9. Stephen S (2009) PayPal hit by global outage. [Web: https://www.finextra.com/newsarticle/20336/paypal-hit-by-global-outage]

  10. Xikun D, Huiqiang W, Hongwu L (2010) A comprehensive monitor model for self-healing systems. In: Proceedings of International Conference on Multimedia Information Networking and Security (MINES), pp. 751–756

  11. Voicu R, Legrand IC, Dobre C (2011) A monitoring framework for large scale networks. In: Proceedings of IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 429–432

  12. Acharya M, Kommineni V (2009) Mining health models for performance monitoring of services. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, pp. 409–420

  13. Rathfelder C, Becker S, Krogmann K, Reussner R (2012) Workload-aware system monitoring using performance predictions applied to a large-scale e-mail system. In Proceedings of Joint Working IEEE/IFIP Conference Software Architecture (WICSA) and European Conference on Software Architecture (ECSA), pp. 31–40

  14. JunPing W, Qiuming K, ShiHui D (2015) A new online anomaly learning and detection for large-scale service of Internet of thing. Pers Ubiquit Comput 19:1021–1031

    Article  Google Scholar 

  15. Toledo P, Sanchis A (2015) Sensor-based Bayesian detection of anomalous living patterns in a home setting. Personal Ubiquit J 19:259–270

    Article  Google Scholar 

  16. Tian Y, Nguyen H, Shen Z, Gu Xu, Venkatramani C, Rajan D (2012) PREPARE: predictive performance anomaly prevention for virtualized cloud systems. In: Proceedings of the IEEE 32nd International Conference on Distributed Computing Systems (ICDCS), pp. (12)

  17. Jiang Z (2010) Automated analysis of load testing results. In: Proceedings of the 19th international symposium on Software testing and analysis (ISSTA '10), pp 143–146

  18. Ibidunmoye O, Hernández-Rodriguez F, Elmroth E (2015) Performance anomaly detection and bottleneck identification. ACM Comput. Surv. 48, 1, Article 4 (July 2015), pp. (10)

  19. Kang H, Zhu X, Wong J (2012) DAPA: diagnosing application performance anomalies for virtualized infrastructures. In: Proceedings of the 2nd USENIX Workshop on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services, pp. (10)

  20. Chen MY, Kiciman E, Fratkin E, Fox A, Brewer E (2002) Pinpoint: problem determination in large, dynamic internet services. In: Proceedings of the International Conference on Dependable Systems and Networks, DSN, pp. 595–604

  21. Pertet S, Gandhi R, Narasimhan P (2007) Fingerpointing correlated failures in replicated systems. In: Proceedings of the Second Workshop on Tackling Computer Systems Problems with Machine Learning, pp 220–230

  22. Pertet S, Gandhi R, Narasimhan P (2006) Group communication: helping or obscuring failure diagnosis. Technical Report (CMU-PDL-06-107)

  23. Jaesung C, Myungwhan C, Sang-Hyuk L (1999) An alarm correlation and fault identification scheme based on OSI managed object classes. In: Proceedings of IEE International Conference on Communications (ICC'99), vol.3, pp. 1547–1551

  24. Rouvellou I, Hart GW (1995) Automatic alarm correlation for fault identification. In: Proceedings of Fourteenth Annual Joint Conference of the IEEE Computer and Communications Societies INFOCOM '95 pp. 553–561 vol.2., 1995

  25. Bouloutas AT, Calo S, Finkel A (1994) Alarm correlation and fault identification. In Proceedings of IEEE Transactions on communication networks 42:523–533

    Article  Google Scholar 

  26. Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. In proceedings of SIAM Journal on Scientific Computing 20(1):359–392

    Article  MathSciNet  MATH  Google Scholar 

  27. Ponnusamy R, Mansour N, Choudhary A, Fox G (1993) Graph contraction and physical optimization methods: a quality-cost tradeoff for mapping data on parallel computers. In: Proceedings of International Conference of Supercomputing, pp: 501–515

  28. Dave J (2016) Dell DVD store. Available: http://linux.dell.com/dvdstore/, downloaded 2016

  29. Cohen I, Zhang S, Goldszmidt M, Symons J, Kelly T, Fox A (2005) Capturing, indexing, clustering, and retrieving system history. In: Proceedings of the Twentieth ACM Symposium on Operating Systems Principles SOSP '05, Brighton, United Kingdom, pp.105–118

  30. Jiang Z, Hassan A, Hamann G, Flora P (2009) Automated performance analysis of load tests. In: Proceedings of the 25th IEEE International Conference on Software Maintenance (ICSM), Edmonton, Canada, September. pp. 20–26

  31. Huck K, Malony A.D (2005) PerfExplorer: a performance data mining framework for large-scale parallel computing. In: Proceedings of the ACM/IEEE Supercomputing (SC) Conference, vol., no., pp. 41–41

  32. Sandeep S, Swapna M (2008) CLUEBOX: a performance log analyzer for automated troubleshooting. In: Proceedings of the USENIX Workshop on Analysis of System Logs (WASL ‘08)

  33. Brown A, Patterson D (2001) An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In: Proceedings of seventh IFIP/IEEE International Symposium on Integrated Network Management, Seattle, WA, pp 99–108

  34. Choi J, Choi M, Lee S (1999) An alarm correlation and fault identification scheme based on OSI managed object classes. In: Proceedings of IEEE International Conference on Communications, Vancouver, BC, Canada

  35. Gruschke B (1998) A new approach for event correlation basedon dependency graphs. In: Proceedings 5th Workshop of the OpenView University Association: OVUA’98, Rennes, France

  36. Yemini A, Kliger S (1996) High speed and robust event correlation. IEEE Commun Mag 34(5):82–90

    Article  Google Scholar 

  37. Barham P, Donnelly A, Isaacs R, Mortier R (2004). Using Magpie for request extraction and workload modelling. In: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation. vol. 6, pp. 18–27

  38. Chen MY, Kiciman E, Fratkin E, Fox A, Brewer E (27–2002) Pinpoint: problem determination in large, dynamic Internet services. In Proceedings of International Conference on Dependable Systems and Networks (DSN), vol., no., pp. 595–604

  39. Reynolds P, Killian C, Wiener J, Mogul J, Shah M, Vahdat A (2006) Pip: detecting the unexpected in distributed systems. In: Proceedings of the 3rd Conference on Networked Systems Design & Implementation, USENIX Association, Berkeley, CA, vol 3, pp (10)

  40. Thakkar D, Hassan A, Hamann G, Flora P (2008) A framework for measurement based performance modeling. In: Proceedings of the 7th International Workshop on Software and Performance, WOSP '08: Princeton, NJ, USA, pp. 55–66

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haroon Malik.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Malik, H., Shakshuki, E.M. Detecting performance anomalies in large-scale software systems using entropy. Pers Ubiquit Comput 21, 1127–1137 (2017). https://doi.org/10.1007/s00779-017-1036-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00779-017-1036-y

Keywords

Navigation