Abstract
Large-scale software systems (LSSs) are composed of hundreds of subsystems that interact with each other in an unforeseen and complex ways. The operators of these LSSs strictly monitor thousands of metrics (performance counters) to quickly identify performance anomalies before a catastrophe. The existing monitoring tools and methodologies have not kept in pace with the rapid growth and inherit complexity of these LSSs; hence are ineffective in assisting practitioners to effectively pinpoint performance anomalies. We propose two methodologies that use entropy measure to assist practitioners/operators of LSSs in quickly detecting both system-wide and underlying localized subsystem anomalies. Our performance tests conducted on an open-source benchmark system reveal that the proposed methodologies are robust in pinpointing anomalies, do not require any domain knowledge to operate, and avoid information overload on practitioners.
Similar content being viewed by others
References
Lehman MM (1980) Programs, life cycles, and laws of software evolution. Proc IEEE 68:1060–1076
Munawar MA, Ward P (2006) Adaptive monitoring in enterprise software systems. SysML
Miao J, Munawar MA, Reidemeister T, Ward PAS (2009) Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring. In: Proceedings of International Conference On Dependable Systems & Networks, DSN '09. IEEE/IFIP, pp. 285–294
Jiang M, Munawar MA, Reidemeister T, Ward PAS (2008) Information-theoretic modeling for tracking the health of complex software systems. In: Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds, Toronto, Canada, pp. 236–247
Weyuker EJ, Vokolos FI (2000) Experience with performance testing of software systems: issues, an approach, and case study. IEEE TransSoftwEng 26:1147–1156
Boschi E, Denazis S, Zseby T (2006) A measurement framework for inter-domain SLA validation. Comput Commun 29:703–716 3/31
Di Stefano A, Morana G, Zito D (2009) A P2P strategy for QoS discovery and SLA negotiation in grid environment. Future Generation Comput Syst 25:862–875
Patrick R, Charles K, Janet Wi, Jeffrey M, Mehul S, and Amin V (2009) Pip: detecting the unexpected in distributed systems. In: Proceedings of of 3rd Symp on Networked Systems Design and Implementation (NSDI)
Stephen S (2009) PayPal hit by global outage. [Web: https://www.finextra.com/newsarticle/20336/paypal-hit-by-global-outage]
Xikun D, Huiqiang W, Hongwu L (2010) A comprehensive monitor model for self-healing systems. In: Proceedings of International Conference on Multimedia Information Networking and Security (MINES), pp. 751–756
Voicu R, Legrand IC, Dobre C (2011) A monitoring framework for large scale networks. In: Proceedings of IEEE International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 429–432
Acharya M, Kommineni V (2009) Mining health models for performance monitoring of services. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, pp. 409–420
Rathfelder C, Becker S, Krogmann K, Reussner R (2012) Workload-aware system monitoring using performance predictions applied to a large-scale e-mail system. In Proceedings of Joint Working IEEE/IFIP Conference Software Architecture (WICSA) and European Conference on Software Architecture (ECSA), pp. 31–40
JunPing W, Qiuming K, ShiHui D (2015) A new online anomaly learning and detection for large-scale service of Internet of thing. Pers Ubiquit Comput 19:1021–1031
Toledo P, Sanchis A (2015) Sensor-based Bayesian detection of anomalous living patterns in a home setting. Personal Ubiquit J 19:259–270
Tian Y, Nguyen H, Shen Z, Gu Xu, Venkatramani C, Rajan D (2012) PREPARE: predictive performance anomaly prevention for virtualized cloud systems. In: Proceedings of the IEEE 32nd International Conference on Distributed Computing Systems (ICDCS), pp. (12)
Jiang Z (2010) Automated analysis of load testing results. In: Proceedings of the 19th international symposium on Software testing and analysis (ISSTA '10), pp 143–146
Ibidunmoye O, Hernández-Rodriguez F, Elmroth E (2015) Performance anomaly detection and bottleneck identification. ACM Comput. Surv. 48, 1, Article 4 (July 2015), pp. (10)
Kang H, Zhu X, Wong J (2012) DAPA: diagnosing application performance anomalies for virtualized infrastructures. In: Proceedings of the 2nd USENIX Workshop on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services, pp. (10)
Chen MY, Kiciman E, Fratkin E, Fox A, Brewer E (2002) Pinpoint: problem determination in large, dynamic internet services. In: Proceedings of the International Conference on Dependable Systems and Networks, DSN, pp. 595–604
Pertet S, Gandhi R, Narasimhan P (2007) Fingerpointing correlated failures in replicated systems. In: Proceedings of the Second Workshop on Tackling Computer Systems Problems with Machine Learning, pp 220–230
Pertet S, Gandhi R, Narasimhan P (2006) Group communication: helping or obscuring failure diagnosis. Technical Report (CMU-PDL-06-107)
Jaesung C, Myungwhan C, Sang-Hyuk L (1999) An alarm correlation and fault identification scheme based on OSI managed object classes. In: Proceedings of IEE International Conference on Communications (ICC'99), vol.3, pp. 1547–1551
Rouvellou I, Hart GW (1995) Automatic alarm correlation for fault identification. In: Proceedings of Fourteenth Annual Joint Conference of the IEEE Computer and Communications Societies INFOCOM '95 pp. 553–561 vol.2., 1995
Bouloutas AT, Calo S, Finkel A (1994) Alarm correlation and fault identification. In Proceedings of IEEE Transactions on communication networks 42:523–533
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. In proceedings of SIAM Journal on Scientific Computing 20(1):359–392
Ponnusamy R, Mansour N, Choudhary A, Fox G (1993) Graph contraction and physical optimization methods: a quality-cost tradeoff for mapping data on parallel computers. In: Proceedings of International Conference of Supercomputing, pp: 501–515
Dave J (2016) Dell DVD store. Available: http://linux.dell.com/dvdstore/, downloaded 2016
Cohen I, Zhang S, Goldszmidt M, Symons J, Kelly T, Fox A (2005) Capturing, indexing, clustering, and retrieving system history. In: Proceedings of the Twentieth ACM Symposium on Operating Systems Principles SOSP '05, Brighton, United Kingdom, pp.105–118
Jiang Z, Hassan A, Hamann G, Flora P (2009) Automated performance analysis of load tests. In: Proceedings of the 25th IEEE International Conference on Software Maintenance (ICSM), Edmonton, Canada, September. pp. 20–26
Huck K, Malony A.D (2005) PerfExplorer: a performance data mining framework for large-scale parallel computing. In: Proceedings of the ACM/IEEE Supercomputing (SC) Conference, vol., no., pp. 41–41
Sandeep S, Swapna M (2008) CLUEBOX: a performance log analyzer for automated troubleshooting. In: Proceedings of the USENIX Workshop on Analysis of System Logs (WASL ‘08)
Brown A, Patterson D (2001) An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In: Proceedings of seventh IFIP/IEEE International Symposium on Integrated Network Management, Seattle, WA, pp 99–108
Choi J, Choi M, Lee S (1999) An alarm correlation and fault identification scheme based on OSI managed object classes. In: Proceedings of IEEE International Conference on Communications, Vancouver, BC, Canada
Gruschke B (1998) A new approach for event correlation basedon dependency graphs. In: Proceedings 5th Workshop of the OpenView University Association: OVUA’98, Rennes, France
Yemini A, Kliger S (1996) High speed and robust event correlation. IEEE Commun Mag 34(5):82–90
Barham P, Donnelly A, Isaacs R, Mortier R (2004). Using Magpie for request extraction and workload modelling. In: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation. vol. 6, pp. 18–27
Chen MY, Kiciman E, Fratkin E, Fox A, Brewer E (27–2002) Pinpoint: problem determination in large, dynamic Internet services. In Proceedings of International Conference on Dependable Systems and Networks (DSN), vol., no., pp. 595–604
Reynolds P, Killian C, Wiener J, Mogul J, Shah M, Vahdat A (2006) Pip: detecting the unexpected in distributed systems. In: Proceedings of the 3rd Conference on Networked Systems Design & Implementation, USENIX Association, Berkeley, CA, vol 3, pp (10)
Thakkar D, Hassan A, Hamann G, Flora P (2008) A framework for measurement based performance modeling. In: Proceedings of the 7th International Workshop on Software and Performance, WOSP '08: Princeton, NJ, USA, pp. 55–66
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Malik, H., Shakshuki, E.M. Detecting performance anomalies in large-scale software systems using entropy. Pers Ubiquit Comput 21, 1127–1137 (2017). https://doi.org/10.1007/s00779-017-1036-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00779-017-1036-y