skip to main content
article

Improving the reliability of commodity operating systems

Published:02 February 2005Publication History
Skip Abstract Section

Abstract

Despite decades of research in extensible operating system technology, extensions such as device drivers remain a significant cause of system failures. In Windows XP, for example, drivers account for 85% of recently reported failures.This article describes Nooks, a reliability subsystem that seeks to greatly enhance operating system (OS) reliability by isolating the OS from driver failures. The Nooks approach is practical: rather than guaranteeing complete fault tolerance through a new (and incompatible) OS or driver architecture, our goal is to prevent the vast majority of driver-caused crashes with little or no change to the existing driver and system code. Nooks isolates drivers within lightweight protection domains inside the kernel address space, where hardware and software prevent them from corrupting the kernel. Nooks also tracks a driver's use of kernel resources to facilitate automatic cleanup during recovery.To prove the viability of our approach, we implemented Nooks in the Linux operating system and used it to fault-isolate several device drivers. Our results show that Nooks offers a substantial increase in the reliability of operating systems, catching and quickly recovering from many faults that would otherwise crash the system. Under a wide range and number of fault conditions, we show that Nooks recovers automatically from 99% of the faults that otherwise cause Linux to crash.While Nooks was designed for drivers, our techniques generalize to other kernel extensions. We demonstrate this by isolating a kernel-mode file system and an in-kernel Internet service. Overall, because Nooks supports existing C-language extensions, runs on a commodity operating system and hardware, and enables automated recovery, it represents a substantial step beyond the specialized architectures and type-safe languages required by previous efforts directed at safe extensibility.

References

  1. Apache Project. 2000. Apache HTTP server version 2.0. Available online at http://httpd.apache.org.Google ScholarGoogle Scholar
  2. Ball, T. and Rajamani, S. K. 2001. Automatically validating temporal safety properties of interfaces. In SPIN 2001, Workshop on Model Checking of Software. Lecturer Notes in Computer Science, vol. 2057. Springer-Verlag, Berlin, Germany, 103--122. Google ScholarGoogle Scholar
  3. Bershad, B. N. 1992. The increasing irrelevance of IPC performance for microkernel-based operating systems. In Proceedings of Workshop on Micro-Kernels and Other Kernel Architectures (Seattle, WA). 205--211. Google ScholarGoogle Scholar
  4. Bershad, B. N., Anderson, T. E., Lazowska, E. D., and Levy, H. M. 1990. Lightweight remote procedure call. ACM Trans. Comput. Syst. 8, 1 (Feb.), 37--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bershad, B. N., Savage, S., Pardyak, P., Sirer, E. G., Fiuczynski, M. E., Becker, D., Chambers, C., and Eggers, S. 1995. Extensibility, safety and performance in the SPIN operating system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (Copper Mountain, CO). 267--284. Google ScholarGoogle Scholar
  6. Birrell, A. D. and Nelson, B. J. 1984. Implementing remote procedure calls. ACM Trans. Comput. Syst. 2, 1 (Feb.), 39--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bovet, D. P. and Cesati, M. 2001. Understanding the Linux Kernel. O'Reilly, Sebastopal, CA. Google ScholarGoogle Scholar
  8. Candea, G. and Fox, A. 2001. Recursive restartability: Turning the reboot sledgehammer into a scalpel. In Proceedings of the Eighth IEEE HOTOS. 125--132. Google ScholarGoogle Scholar
  9. Chapin, J., Rosenblum, M., Devine, S., Lahiri, T., Teodosiu, D., and Gupta, A. 1995. Hive: Fault containment for shared-memory multiprocessors. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (Copper Mountain Resort, CO). 12--25. Google ScholarGoogle Scholar
  10. Chase, J. S., Levy, H. M., Feeley, M. J., and Lazowska, E. D. 1994. Sharing and protection in a single-address-space operating system. ACM Trans. Comput. Syst. 12, 4 (Nov.), 271--307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chen, P. and Noble, B. 2001. When virtual is better than real. In Proceedings of the Eighth IEEE HOTOS. 133--138. Google ScholarGoogle Scholar
  12. Chou, A., Yang, J., Chelf, B., Hallem, S., and Engler, D. 2001. An empirical study of operating system errors. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (Lake Louise, Alta., Canada). 73--88. Google ScholarGoogle Scholar
  13. Christmansson, J. and Chillarege, R. 1996. Generation of an error set that emulates software faults---based on field data. In Proceedings of the 1996 IEEE Symposium on Fault---Tolerant Computing (FTCS, Sendai, Japan). 304--313. Google ScholarGoogle Scholar
  14. Condit, J., Harren, M., McPeak, S., Necula, G. C., and Weimer, W. 2003. CCured in the real world. In Proceedings of the ACM SIGPLAN '03 ACM Conference on Programming Language Design and Implementation (San Diego, CA). 232--244. Google ScholarGoogle Scholar
  15. Custer, H. 1993. Inside Windows NT. Microsoft Press, Redmond, WA. Google ScholarGoogle Scholar
  16. DeLine, R. and Fähndrich, M. 2001. Enforcing high-level protocols in low-level software. In Proceedings of the ACM SIGPLAN '01 ACM Conference on Programming Language Design and Implementation (Snowbird, UT). 59--69. Google ScholarGoogle Scholar
  17. Denning, P. J. 1976. Fault tolerant operating systems. ACM Comput. Surv. 8, 4 (Dec.), 359--389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Dennis, J. B. and Horn, E. V. 1966. Programming semantics for multiprogramming systems. Commun. ACM 9, 3 (Mar.), 29--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Engler, D., Chelf, B., Chou, A., and Hallem, S. 2000. Checking system rules using system-specific, programmer-written compiler extensions. In Proceedings of the 4th USENIX Symposium on Operating Systems Design and Implementation (San Diego, CA). 1--16. Google ScholarGoogle Scholar
  20. Engler, D. R., Kaashoek, M. F., and Jr., J. O. 1995. Exokernel: An operating system architecture for application-level resource management. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (Copper Mountain Resort, CO). 251--266. Google ScholarGoogle Scholar
  21. Fabre, J.-C., Rodrí, M., Arlat, J., Salles, F., and Sizun, J.-M. 2000. Building dependable COTS microkernel-based systems using MAFALDA. In Proceedings of the 2000 Pacific Rim International Symposium on Dependable Computing (PRDC 00) (Los Angeles, CA). 85--94. Google ScholarGoogle ScholarCross RefCross Ref
  22. Fabry, R. S. 1974. Capability-based addressing. Commun. ACM 17, 7 (July), 403--412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Fetzer, C. and Xiao, Z. 2003. HEALERS: A toolkit for enhancing the robustness and security of existing applications. In Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN'03, San Francisco, CA). 317--322.Google ScholarGoogle Scholar
  24. Ford, B., Back, G., Benson, G., Lepreau, J., Lin, A., and Shivers, O. 1997. The Flux OSKit: A substrate for OS language and research. In Proceedings of the 16th ACM Symposium on Operating Systems Principles. 38--51. Google ScholarGoogle Scholar
  25. Forin, A., Golub, D., and Bershad, B. 1991. An I/O system for Mach. In Proceedings of the Usenix Mach Symposium. 163--176.Google ScholarGoogle Scholar
  26. Gettys, J., Carlton, P. L., and McGregor, S. 1990. The X window system, version 11. Softw.-Prac. Exp. 20, 52 (Oct.), 35--67. Google ScholarGoogle Scholar
  27. Gillen, A., Kusnetzky, D., and McLaron, S. 2002. The role of Linux in reducing the cost of enterprise computing. IDC white paper. International Data Corporation, Framingham, MA.Google ScholarGoogle Scholar
  28. Gosling, J., Joy, B., and Steele, G. 1996. The Java Language Specification. Addison-Wesley, Reading, MA. Google ScholarGoogle Scholar
  29. Gray, J. 1996. Why do computers stop and what can be done about it? In Proceedings of the Fifth IEEE Symposium on Reliability in Distributed Software and Database Systems (Los Angeles, CA). 3--12.Google ScholarGoogle Scholar
  30. Haarsten, J. C. 2000. The Bluetooth radio system. IEEE Personal Commun. Mag. 7, 1 (Feb.), 28--36.Google ScholarGoogle Scholar
  31. Hand, S. M. 1999. Self-paging in the Nemesis operating system. In Proceedings of the 3rd USENIX Symposium on Operating Systems Design and Implementation (New Orleans, LA). 73--86. Google ScholarGoogle Scholar
  32. Härtig, H., Hohmuth, M., Liedtke, J., Schöberg, S., and Wolter, J. 1997. The performance of μ-kernel-based systems. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (Saint-Malo, France). 66--77. Google ScholarGoogle Scholar
  33. Hewlett Packard. 2001. Hewlett Packard Digital Entertainment Center. Go online to http://www.hp.com/hpinfo/newsroom/press/31oct01a.htm.Google ScholarGoogle Scholar
  34. Houdek, M. E., Soltis, F. G., and Hoffman, R. L. 1981. IBM System/38 support for capability-based addressing. In Proceedings of the 8th ACM/IEEE International Symposium on Computer Architecture. 341--348. Google ScholarGoogle Scholar
  35. Hsueh, M., Tsai, T. K., and Iyer, R. K. 1997. Fault injection techniques and tools. IEEE Comput. 30, 4 (Apr.), 75--82. Google ScholarGoogle Scholar
  36. Intel Corporation. 2002. The IA-32 Architecture Software Developer's Manual, Volume 1: Basic Architecture. Intel Corporation, Santa Clara, CA. Available online at http://www.intel.com/design/pentium4/manuals/24547010.pdf.Google ScholarGoogle Scholar
  37. Jones, R. 1995. Netperf: A network performance benchmark, version 2.1. Available online at http://www.netperf.org.Google ScholarGoogle Scholar
  38. Koldinger, E. J., Chase, J. S., and Eggers, S. J. 1994. Architectural support for single address space operating systems. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 175--186. Google ScholarGoogle Scholar
  39. Levy, H. M. 1984. Capability-Based Computer Systems. Digital Press, Burlington, MA. Available online at http://www.cs.washington.edu/homes/levy/capabook. Google ScholarGoogle Scholar
  40. Liedtke, J. 1995. On μ-kernel construction. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (Copper Mountain Resort, CO). 237--250. Google ScholarGoogle Scholar
  41. Lowell, D. E., Chandra, S., and Chen, P. M. 2000. Exploring failure transparency and the limits of generic recovery. In Proceedings of the 4th USENIX Symposium on Operating Systems Design and Implementation (San Diego, CA). 289--303. Google ScholarGoogle Scholar
  42. Lowell, D. E. and Chen, P. M. 1998. Discount checking: Transparent, low-overhead recovery for general applications. Tech. Rep. CSE-TR-410-99. University of Michigan, Axn Arbor, MI.Google ScholarGoogle Scholar
  43. Mérillon, F., Réveillère, L., Consel, C., Marlet, R., and Muller, G. 2000. Devil: An IDL for hardware programming. In Proceedings of the 4th USENIX Symposium on Operating Systems Design and Implementation (San Diego, CA). 17--30. Google ScholarGoogle Scholar
  44. Microsoft Corporation. 2000. FAT: General overview of on-disk format, version 1.03. Microsoft Corporation, Redmond, WA.Google ScholarGoogle Scholar
  45. Mosberger, D. and Jin, T. 1998. httperf: A tool for measuring web server performance. In First ACM Workshop on Internet Server Performance (Madison, WI). 59--67.Google ScholarGoogle Scholar
  46. Necula, G. C. and Lee, P. 1996. Safe kernel extensions without run-time checking. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (Seattle, WA). 229--243. Google ScholarGoogle Scholar
  47. Ng, W. T. and Chen, P. M. 1999. The systematic improvement of fault tolerance in the Rio file cache. In Proceedings of the 1999 IEEE Symposium on Fault-Tolerant Computing (FTCS). 76--83. Google ScholarGoogle Scholar
  48. Organick, E. I. 1983. A Programmer's View of the Intel 432 System. McGraw Hill, New York, NY. Google ScholarGoogle Scholar
  49. Patterson, D., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A., Kýcýman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupman, J., and Treuhaft, N. 2002. Recovery-oriented computing (ROC): Motivation, definition, techniques, and case studies. Tech. Rep. CSD-02-1175. UC Berkeley Computer Science. Berkely, CA. Google ScholarGoogle Scholar
  50. Project-UDI. 1999. Introduction to UDI version 1.0. Tech. rep. Project UDI. Visit Website www.projectudi.org.Google ScholarGoogle Scholar
  51. Saltzer, J. H. 1974. Protection and the control of information sharing in Multics. Commun. ACM 17, 7 (July), 388--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Schmuck, F. and Wylie, J. 1991. Experience with transactions in QuickSilver. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (Pacific Grove, CA). 239--253. Google ScholarGoogle Scholar
  53. Seltzer, M. I., Endo, Y., Small, C., and Smith, K. A. 1996. Dealing with disaster: Surviving misbehaved kernel extensions. In Proceedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation (Seattle, WA). 213--227. Google ScholarGoogle Scholar
  54. Short, R. 2003. Vice president of Windows Core Technology, Microsoft Corp. Private communication.Google ScholarGoogle Scholar
  55. Standard Performance Evaluation Corporation. 1999. The SPECweb99 benchmark. Standard Performance Evaluation Corporation, Warrenton, VA. Visit Web site www.Apec.org.Google ScholarGoogle Scholar
  56. Sugerman, J., Venkitachalam, G., and Lim, B. 2001. Virtualizing I/O devices on VMware workstation's hosted virtual machine monitor. In Proceedings of the 2001 USENIX Annual Technical Conference (Boston, MA). Google ScholarGoogle Scholar
  57. Sullivan, M. and Chillarege, R. 1991. Software defects and their impact on system availability---a study of field failures in operating systems. In Proceedings of the 1991 IEEE Symposium on Fault-Tolerant Computing (FTCS-21). (Montreal, P. Q., Canada). 2--9.Google ScholarGoogle Scholar
  58. Sullivan, M. and Stonebraker, M. 1991. Using write protected data structures to improve software fault tolerance in highly available database management systems. In Proceedings of the 17th International Conference on Very Large Data Bases. Morgan Kaufman Publishing, San Francisco, CA, 171--180. Google ScholarGoogle Scholar
  59. Thurrott, P. 2003. Windows 2000 server: The road to gold, part two: Developing windows. Paul Thurrott's SuperSite for Windows.Google ScholarGoogle Scholar
  60. TiVo Corporation. 2001. TiVo digital video recorder. Go to www.tivo.com.Google ScholarGoogle Scholar
  61. van de Ven, A. 1999. kHTTPd: Linux HTTP accelerator. Available online at http://www.fenrus.demon.nl/.Google ScholarGoogle Scholar
  62. Wahbe, R., Lucco, S., Anderson, T. E., and Graham, S. L. 1993. Efficient software-based fault isolation. In Proceedings of the 14th ACM Symposium on Operating Systems Principles (Asheville, NC). 203--216. Google ScholarGoogle Scholar
  63. Wheeler, D. A. 2002. More than a gigabuck: Estimating GNU/Linux's size. Available online at http://www.dwheeler.com/sloc/redhat71-v1/redhat71sloc.html.Google ScholarGoogle Scholar
  64. Whitaker, A., Shaw, M., and Gribble, S. D. 2002. Denali: Lightweight virtual machines for distributed and networked applications. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation (Boston, MA). 195--209.Google ScholarGoogle Scholar
  65. Witchel, E., Cates, J., and Asanović, K. 2002. Mondrian memory protection. In Proceedings of the Tenth International Conference on Architectural Support for Programming Languages and Operating Systems. 304--316. Google ScholarGoogle Scholar
  66. Wulf, W. A. 1975. Reliatble hardware-software architecture. In Proceedings of the International Conference on Reliable Software (Los Angeles, CA). 122--130. Google ScholarGoogle Scholar
  67. Young, M., Accetta, M., Baron, R., Bolosky, W., Golub, D., Rashid, R., and Tevanian, A. 1986. Mach: A new kernel foundation for UNIX development. In Proceedings of the 1986 Summer USENIX Conference (Atlanta, GA). 93--113.Google ScholarGoogle Scholar

Index Terms

  1. Improving the reliability of commodity operating systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Computer Systems
      ACM Transactions on Computer Systems  Volume 23, Issue 1
      February 2005
      110 pages
      ISSN:0734-2071
      EISSN:1557-7333
      DOI:10.1145/1047915
      Issue’s Table of Contents

      Copyright © 2005 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 February 2005
      Published in tocs Volume 23, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader