ABSTRACT
To ensure high availability, self-managing systems require self-monitoring and a system model against which to analyze monitoring data. Characterizing relationships between system metrics has been shown to model simple multi-tier transaction systems effectively, enabling failure detection and fault diagnosis. In this paper we show how to extend this invariant metric-relationships approach to clustered multi-tier systems. We show through analysis and experimentation that naive application of the approach increases cost dramatically while reducing diagnosis accuracy. We demonstrate that randomization at the load balancer during the invariant-identification phase will improve diagnosis accuracy, though it neither completely eliminates the problem nor reduces the cost; indeed, it may increase the cost, as this approach will require a long learning phase to remove all accidental correlations. Finally, we argue that knowing the system structure is necessary to effectively apply invariants to the clustered environment.
- M. Agarwal, N. Anerousis, M. Gupta, V. Mann, L. Mummert, and N. Sachindran. Problem determination in enterprise middleware systems using change point correlation of time series data. In NOMS, April 2006.Google Scholar
- A. Brown, G. Kar, and A. Keller. An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In IM, 2001.Google Scholar
- I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. Chase. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In OSDI, pages 231--244, December 2004. Google ScholarDigital Library
- J. Coleman and T. Lau. Set up and run a Trade6 benchmark with DB2 UDB. IBM developerWorks. http://www128.ibm.com/developerworks/edu/dm-dw-dm-0506lau.html?S_TACT=105AGX11&S_CMP=LIB.Google Scholar
- Y. Diao, F. Eskesen, S. Froehlich, J. L. Hellerstein, A. Keller, L. Spainhower, and M. Surendra. Generic on-line discovery of quantitative models for service level management. In IM, pages 157--170, 2003.Google ScholarCross Ref
- Z. Guo, G. Jiang, H. Chen, and K. Yoshihira. Tracking probabilistic correlation of monitoring data for fault detection in complex systems. In DSN, pages 259--268, 2006. Google ScholarDigital Library
- M. Hauswirth, P. F. Sweeney, A. Diwan, and M. Hind. Vertical profiling: Understanding the behavior of object-oriented applications. In OOPSLA, 2004. Google ScholarDigital Library
- R. Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation and Modeling. Wiley, New York, 1991.Google Scholar
- G. Jiang, H. Chen, and K. Yoshihira. Discovering likely invariants of distributed transaction systems for autonomic system management. In ICAC, 2006.Google ScholarDigital Library
- G. Jiang, H. Chen, and K. Yoshihira. Modeling and tracking of transaction flow dynamics for fault detection in complex systems. IEEE Transactions on Dependable and Secure Computing, pages 312--326, 2006. Google ScholarDigital Library
- J. O. Kephart and D. M. Chess. The vision of autonomic computing. IEEE Computer, 36(1):41--50, 2003. Google ScholarDigital Library
- E. Kiciman and A. Fox. Detecting application-level failures in component based internet services. IEEE Trans. on Neural Networks, 16(5):1027--1041, Sept. 2005. Google ScholarDigital Library
- J. Mickens, M. Szummer, and D. Narayanan. Snitch: Interactive decision trees for troubleshooting misconfigurations. In SysML, April 2007. Google ScholarDigital Library
- J. Moore, J. Chase, P. Ranganathan, and R. Sharma. Making scheduling "cool": temperature-aware workload placement in data centers. In USENIX ATEC, 2005. Google ScholarDigital Library
- M. A. Munawar, K. Quan, and P. A. Ward. Integrating Monitoring data for problem determination in business-critical software systems. In JoATC, 2008.Google Scholar
- M. A. Munawar and P. A. Ward. Adaptive monitoring in enterprise software systems. In SysML, June 2006.Google Scholar
- M. A. Munawar and P. A. Ward. A comparative study of pairwise regression techniques for problem determination. In CASCON, pages 152--166, 2007. Google ScholarDigital Library
- M. A. Munawar and P. A. S. Ward. Leveraging many simple statistical models to adaptively monitor software systems. In ISPA, pages 457--470, August 2007. Google ScholarDigital Library
- S. Pertet, R. Gandhi, and P. Narasimhan. Fingerpointing correlated failures in replicated systems. In SysML, April 2007. Google ScholarDigital Library
- Sun Microsystems Inc. JMX - Java Management Extensions. Available at http://java.sun.com/products/JavaManagement/.Google Scholar
- H. J. Wang, J. C. Platt, Y. Chen, R. Zhang, and Y.-M. Wang. Automatic misconfiguration troubleshooting with Peerpressure. In OSDI, pages 17--17,Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
Index Terms
- Monitoring multi-tier clustered systems with invariant metric relationships
Recommendations
Performance modeling and analysis of virtualized multi-tier applications under dynamic workloads
Virtual machine technology facilitates implementation of modern Internet services, especially multi-tier applications. Server virtualization aims to reduce the cost of service provisioning and improve fault tolerance, portability and security of ...
Untangling mixed information to calibrate resource utilization in virtual machines
ICAC '11: Proceedings of the 8th ACM international conference on Autonomic computingServer virtualization brings benefits in autonomic resource management, but also leads to new challenges. The challenge the paper addresses is on profiling physical resource utilization information of VMs when consolidated on a single server. Profiling ...
VCONF: a reinforcement learning approach to virtual machines auto-configuration
ICAC '09: Proceedings of the 6th international conference on Autonomic computingVirtual machine (VM) technology enables multiple VMs to share resources on the same host. Resources allocated to the VMs should be re-configured dynamically in response to the change of application demands or resource supply. Because VM execution ...
Comments