Abstract
We introduce SelfTalk, a novel declarative language that allows users to query and understand the status of a large scale system. SelfTalk is sufficiently expressive to encode an administrator's high level hypotheses/expectations about normal system behavior, such as, "I expect that the throughputs across all system components are linearly correlated". SelfTalk works in conjunction with Dena, a runtime support system designed to help system administrators detect the root cause of system misbehavior quickly and accurately. Given a user hypothesis, Dena instantiates and validates it using actual monitored data within specific system contexts. We evaluate Dena by posing several hypotheses about system behavior and querying Dena to diagnose anomalies in a virtual storage system. We find that Dena can automatically validate the system performance based on the user hypotheses and also accurately diagnose system misbehavior.
- P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. In OSDI, pages 259--272, 2004. Google ScholarDigital Library
- M.Y. Chen, A. Accardi, E. Kiciman, D.A. Patterson, A. Fox, and E.A. Brewer. Path-Based Failure and Evolution Management. In NSDI, pages 309--322, 2004. Google ScholarDigital Library
- M.Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E.A. Brewer. Pinpoint: Problem Determination in Large, Dynamic Internet Services. In DSN, pages 595--604, 2002. Google ScholarDigital Library
- I. Cohen, J.S. Chase, M. Goldszmidt, T. Kelly, and J. Symons. Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control. In OSDI, pages 231--244, 2004. Google ScholarDigital Library
- I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In SOSP, pages 105--118, 2005. Google ScholarDigital Library
- S. Ghanbari and C. Amza. Semantic-Driven Model Composition for Accurate Anomaly Diagnosis. In ICAC, pages 35--44, 2008. Google ScholarDigital Library
- Z. Guo, G. Jiang, H. Chen, and K. Yoshihira. Tracking Probabilistic Correlation of Monitoring Data for Fault Detection in Complex Systems. In DSN, pages 259--268, 2006. Google ScholarDigital Library
- R. Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation and Modelling. John Wiley & Sons, New York, 1991.Google Scholar
- G. Jiang, H. Chen, and K. Yoshihira. Discovering Likely Invariants of Distributed Transaction Systems for Autonomic System Management. Cluster Computing, 9(4):385--399, 2006. Google ScholarDigital Library
- J.O. Kephart and D.M. Chess. The Vision of Autonomic Computing. IEEE Computer, 36(1):41--50, 2003. Google ScholarDigital Library
- C.E. Killian, J.W. Anderson, R. Braud, R. Jhala, and A. Vahdat. Mace: Language Support for Building Distributed Systems. In PLDI, pages 179--188, 2007. Google ScholarDigital Library
- L. Lamport. Specifying Systems, The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley, 2002. Google ScholarDigital Library
- P. Reynolds, C.E. Killian, J.L. Wiener, J.C. Mogul, M.A. Shah, and A. Vahdat. Pip: Detecting the Unexpected in Distributed Systems. In NSDI, pages 115--128, 2006. Google ScholarDigital Library
- K. Shen, M. Zhong, and C. Li. I/O System Performance Debugging Using Model-driven Anomaly Characterization. In FAST, pages 309--322, 2005. Google ScholarDigital Library
- G. Soundararajan, D. Lupei, S. Ghanbari, A.D. Popescu, J. Chen, and C. Amza. Dynamic Resource Allocation for Database Servers Running on Virtual Storage. In FAST, pages 71--84, 2009. Google ScholarDigital Library
- E. Thereska and G.R. Ganger. Ironmodel: Robust Performance Models in the Wild. In SIGMETRICS, pages 253--264, 2008. Google ScholarDigital Library
- H.J. Wang, J.C. Platt, Y. Chen, R. Zhang, and Y.-M. Wang. Automatic Misconfiguration Troubleshooting with PeerPressure. In OSDI, pages 245--258, 2004. Google ScholarDigital Library
- P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic Tracking of Page Miss Ratio Curve for Memory Management. In ASPLOS, pages 177--188, 2004. Google ScholarDigital Library
Index Terms
- SelfTalk for Dena: query language and runtime support for evaluating system behavior
Recommendations
Comparing reproduced cyber experimentation studies across different emulation testbeds
CSET '21: Proceedings of the 14th Cyber Security Experimentation and Test WorkshopCyber testbeds provide an important mechanism for experimentally evaluating cyber security performance. However, as an experimental discipline, reproducible cyber experimentation is essential to assure valid, unbiased results. Even minor differences in ...
Applying and Exploring Bayesian Hypothesis Testing for Large Scale Experimentation in Online Tutoring Systems
L@S '17: Proceedings of the Fourth (2017) ACM Conference on Learning @ ScaleThis paper demonstrates the viability of using Bayesian hypothesis testing for statistical analysis of experiments run in online learning systems. An empirical Bayesian method for learning a genuine prior from past historical experiment data is applied ...
Comments