ABSTRACT
The increasing complexity of today's systems makes fast and accurate failure detection essential for their use in mission-critical applications. Various monitoring methods provide a large amount of data about system's behavior. Analyzing this data with advanced statistical methods holds the promise of not only detecting the errors faster, but also detecting errors which are difficult to catch with current monitoring tools. Two challenges to building such detection tools are: the high dimensionality of observation data, which makes the models expensive to apply, and frequent system changes, which make the models expensive to update. In this paper, we present algorithms to reduce the dimensionality of data in a way that makes it easy to adapt to system changes. We decompose the observation data into signal and noise subspaces. Two statistics, the Hotelling T2 score and squared prediction error (SPE) are calculated to represent the data characteristics in signal and noise subspaces respectively. Instead of tracking the original data, we use a sequentially discounting expectation maximization (SDEM) algorithm to learn the distribution of the two extracted statistics. A failure event can then be detected based on the abnormal change of the distribution. Applying our technique to component interaction data in a simple e-commerce application shows better accuracy than building independent profiles for each component. Additionally, experiments on synthetic data show that the detection accuracy is high even for changing systems.
- M. K. Aguilera and J. C. Mogul and J. L. Wiener and P. Reynolds and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. Proceedings of the nineteenth ACM symposium on Operating systems principles, 74--89, Bolton Landing, NY, 2003. Google ScholarDigital Library
- E. Bauer and R. Kohavi. An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning, 36:105--139, 1999. Google ScholarDigital Library
- M. Chen and E. Kiciman and E. Fratkin and A. Fox and E. Brewer. Pinpoint: Problem Determination in Large, Dynamic Systems. 2002 International Performance and Dependability Symposium, June, Washington, DC, 2002. Google ScholarDigital Library
- G. H. Golub and C. F. Van Loan. Matrix Computations. The John Hopkins University Press, 1996.Google ScholarDigital Library
- I. T. Jolliffe. Principal Component Analysis. New York: Spriger Verlag, 1986.Google Scholar
- V. Kumar and U. Sundararaj and S.L. Shah and D. Hair and L.J. Vande Griend. Multivariate Statistical Monitoring of a High-Pressure Polymerization Process. Polymer Reaction Engineering, 11:1017--1052, 2003.Google ScholarCross Ref
- K. Yamanishi and J. Takeuchi and G. Williams and P. Milne On-line Unsupervised Oultlier Detection Using Finite Mixtures with Discounting Learning Algorithms. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD2000), 320--324, 2000. Google ScholarDigital Library
- E.B. Martin and A.J. Morris and C. Kiparrisides. Manufacturing Performance Enhancement through Multivariate Statistical Process Control. Annual Reviews in Control, 23:35--44, 1999.Google ScholarCross Ref
Index Terms
- Failure detection and localization in component based systems by online tracking
Recommendations
Online Tracking of Component Interactions for Failure Detection and Localization in Distributed Systems
This paper proposes a novel failure-detection approach that can handle high-dimensional observation and frequent system changes. We extract two statistics from the subspace decomposition of observations, and use the mixture of Gaussians to model their ...
Monitoring High-Dimensional Data for Failure Detection and Localization in Large-Scale Computing Systems
It is a major challenge to process the high dimensional measurements for failure detection and localization in large scale computing systems. However, it is observed that in information systems those measurements are usually located in a low dimensional ...
A statistical approach to detect application-level failures in internet services
FSKD'09: Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 5To ensure the QoS in Internet Services, it is critical to detect the failures quickly and accurately. However, it is a difficult problem because one must extract and interpret fail patterns from large amounts of high-dimensional data. Presently, most ...
Comments