skip to main content
research-article

Vigilant: out-of-band detection of failures in virtual machines

Published: 01 January 2008 Publication History

Abstract

What do our computer systems do all day? How do we make sure they continue doing it when failures occur? Traditional approaches to answering these questions often involve in-band monitoring agents. However in-band agents suffer from several drawbacks: they need to be written or customized for every workload (operating system and possibly also application), they comprise potential security liabilities, and are themselves affected by adverse conditions in the monitored systems.
Virtualization technology makes it possible to encapsulate an entire operating system or application instance within a virtual object that can then be easily monitored and manipulated without any knowledge of the contents or behavior of that object. This can be done out-of-band, using general purpose agents that do not reside inside the object, and hence are not affected by the behavior of the object.
This paper describes Vigilant, a novel way of monitoring virtual machines for problems. Vigilant requires no specialized agents inside a virtual object it is monitoring. Instead, it uses the hypervisor to directly monitor the resource requests and utilization of an object. Machine learning methods are then used to analyze the readings. Our experimental results show that problems can be detected out-of-band with high accuracy. Using Vigilant we demonstrate that out-of-band monitoring using virtualization and machine learning can accurately identify faults in the guest OS, while avoiding the many pitfalls associated with in-band monitoring.

References

[1]
P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 164--177, New York, NY, USA, 2003. ACM Press.
[2]
F. Bellard. Qemu, a fast and portable dynamic translator. In USENIX 2005 Annual Technical Conference, FREENIX Track, pages 41--46, 2005.
[3]
A. Brown and D. Patterson. An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In Seventh IFIP/IEEE International Symposium on Integrated Network Management, 2001, 2001.
[4]
K. J. Cassidy, K. C. Gross, and A. Malekpour. Advanced pattern recognition for detection of complex software aging phenomena in online transaction processing servers. In DSN '02: Proceedings of the 2002 International Conference on Dependable Systems and Networks, pages 478--482, Washington, DC, USA, 2002. IEEE Computer Society.
[5]
M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic, internet services. In International Conference on Dependable Systems and Networks (IPDS Track), 2002.
[6]
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, 2000.
[7]
R. Gardner, L. Cherkasovah, and D. Gupta. Xen performance monitoring. Xen Summit 2006. http://xen.xensource.com/files/xs0106_xenmon_brief.pdf.
[8]
K. C. Gross, V. Bhardwaj, and R. Bickford. Proactive detection of software aging mechanisms in performance critical computers. In SEW '02: Proceedings of the 27th Annual NASA Goddard Software Engineering Workshop (SEW-27'02), page 17, Washington, DC, USA, 2002. IEEE Computer Society.
[9]
D. Gupta, R. Gardner, and L. Cherkasovah. Xenmon: Qos monitoring and performance profiling tool. Technical Report HPL-2005-187, HP Labs, 2005.
[10]
HP. HP Openview. http://www.hp.com/openview/index.html.
[11]
IBM. Tivoli Business Systems Manager. http://www.tivoli.com.
[12]
J. O. Kephart. Research challenges of autonomic computing. In ICSE '05: Proceedings of the 27th international conference on Software engineering, pages 15--22, New York, NY, USA, 2005. ACM Press.
[13]
T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.
[14]
W. Norcutt. The iozone filesystem benchmark. http://www.iozone.org/.
[15]
M. J. Ranum, K. Landfield, M. T. Stolarchuk, M. Sienkiewicz, A. Lambeth, and E. Wall. Implementing a generalized tool for network monitoring. In LISA '97: Proceedings of the 11th Conference on Systems Administration, pages 1--8, Berkeley, CA, USA, 1997. USENIX Association.
[16]
L. M. Silva, J. Alonso, P. Silva, J. Torres, and A. Andrzejak. Using virtualization to improve software rejuvenation. In IEEE International Symposium on Network Computing and Applications (IEEE-NCA), Cambridge, MA, USA, July 2007.
[17]
B. Smaalders and P. Harman. libMicro. http://www.opensolaris.org/os/project/libmicro/.
[18]
A. Tirumala and J. Ferguson. Iperf. http://dast.nlanr.net/Projects/Iperf/.
[19]
C. Verbowski. The secret lives of computers exposed: Flight data recorder for windows. USENIX ;login, 32(2), April 2007.

Cited By

View all
  • (2023)Reliable and Accurate Fault Detection with GPGPUs and LLVM2023 IEEE 16th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD60044.2023.00072(540-546)Online publication date: Jul-2023
  • (2022)Extensive Study of Cloud Computing Technologies, Threats and Solutions ProspectiveComputer Systems Science and Engineering10.32604/csse.2022.01954741:1(225-240)Online publication date: 2022
  • (2022)Security Issues and Defenses in VirtualizationProceedings of International Conference on Information Technology and Applications10.1007/978-981-16-7618-5_52(605-617)Online publication date: 21-Apr-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGOPS Operating Systems Review
ACM SIGOPS Operating Systems Review  Volume 42, Issue 1
January 2008
133 pages
ISSN:0163-5980
DOI:10.1145/1341312
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2008
Published in SIGOPS Volume 42, Issue 1

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Reliable and Accurate Fault Detection with GPGPUs and LLVM2023 IEEE 16th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD60044.2023.00072(540-546)Online publication date: Jul-2023
  • (2022)Extensive Study of Cloud Computing Technologies, Threats and Solutions ProspectiveComputer Systems Science and Engineering10.32604/csse.2022.01954741:1(225-240)Online publication date: 2022
  • (2022)Security Issues and Defenses in VirtualizationProceedings of International Conference on Information Technology and Applications10.1007/978-981-16-7618-5_52(605-617)Online publication date: 21-Apr-2022
  • (2022)Multi-layered Monitoring for Virtual MachinesSystem Dependability and Analytics10.1007/978-3-031-02063-6_6(99-140)Online publication date: 26-Jul-2022
  • (2018)Hang doctorProceedings of the Thirteenth EuroSys Conference10.1145/3190508.3190525(1-15)Online publication date: 23-Apr-2018
  • (2017)A Host-Agnostic, Supervised Machine Learning Approach to Automated Overload Detection in Virtual Machine Workloads2017 IEEE International Conference on Smart Cloud (SmartCloud)10.1109/SmartCloud.2017.10(13-23)Online publication date: Nov-2017
  • (2017)A Host-Independent Supervised Machine Learning Approach to Automated Overload Detection in Virtual Machine Workloads2017 IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS*W)10.1109/FAS-W.2017.145(181-186)Online publication date: Sep-2017
  • (2016)A Practical Approach to Hard Disk Failure Prediction in Cloud Platforms: Big Data Model for Failure Management in Datacenters2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService)10.1109/BigDataService.2016.10(105-116)Online publication date: Mar-2016
  • (2014)Building Reliable and Secure Virtual Machines Using Architectural InvariantsIEEE Security & Privacy10.1109/MSP.2014.8712:5(82-85)Online publication date: Sep-2014
  • (2014)Reliability and Security Monitoring of Virtual Machines Using Hardware Architectural InvariantsProceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks10.1109/DSN.2014.19(13-24)Online publication date: 23-Jun-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media