Skip to main content
Log in

Recovery oriented programming: runtime monitoring of safety and liveness

  • Regular Paper
  • Published:
International Journal on Software Tools for Technology Transfer Aims and scope Submit manuscript

Abstract

We introduce the recovery-oriented programming paradigm. Programs that are designed according to the recovery-oriented programming paradigm include, as an integral part, the important safety and liveness properties that the program should respect and the recovery actions that should be executed upon a violation of these properties. We design a pre-compiler that compiles the properties and recovery actions into a code snippet for monitoring properties and enforcing recovery actions upon property violation. Assuming the restartability property of a given program and the existence of a self-stabilizing software platform, the compiled program is able to recover from safety and liveness violations. We provide a generic correctness proof scheme for recovery-oriented programs, proving that the code, as transformed by the pre-compiler, converges to a legal execution in a finite number of steps after experiencing an arbitrary failure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Arora, A., Theimer, M.: On modeling and tolerating incorrect software. Tech. Rep. MSR-TR-2003-27, Microsoft Research (2003)

  2. Barringer, H., Goldberg, A., Havelund, K., Sen, K.: Program monitoring with ltl in eagle. In: Proceedings of the Workshop on Parallel and Distributed Systems: Testing and Debugging (PADTAD), p. 264. IEEE Computer Society, Washington (2004)

  3. Baumann R.: Soft errors in advanced computer systems. IEEE Des. Test 22(3), 258–266 (2005)

    Article  Google Scholar 

  4. Beck K., Andres C.: Extreme Programming Explained: Embrace Change, 2nd edn. Addison-Wesley, Boston (2004)

    Google Scholar 

  5. Bracha, G.: An asynchronous [(n − 1)/3]-resilient consensus protocol. In: Proceedings of the 3d Annual ACM Symposium on Principles of Distributed Computing (PODC), pp. 154–162. ACM, New York (1984)

  6. Brukman, O., Dolev, S., Haviv, Y., Yagel, R.: Self-stabilization as a foundation for autonomic computing. In: Proceedings of the The 2nd International Conference on Availability, Reliability and Security (ARES), pp. 991–998. IEEE Computer Society, Washington (2007)

  7. Brukman, O., Dolev, S., Kolodner, E.K.: Self-stabilizing autonomic recoverer for eventual byzantine software. In: Proceedings of the IEEE International Conference on Software-Science, Technology & Engineering (SWSTE), pp. 20–29 (2003)

  8. Burdy L., Cheon Y., Cok D., Ernst M.D., Kiniry J., Leavens G.T., Leino K.R.M., Poll E.: An overview of JML tools and applications. Softw. Tools Technol. Transfer 7(3), 212–232 (2005)

    Article  Google Scholar 

  9. Candea, G., Fox, A.: Crash-only software. In: HOTOS’03: Proceedings of the 9th Conference on Hot Topics in Operating Systems, pp. 12–12. USENIX Association, Berkeley (2003)

  10. Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., Fox, A.: Microreboot—a technique for cheap recovery. In: OSDI’04: Proceedings of the 6th Symposium on Operating Systems Design & Implementation, pp. 31–44. USENIX Association, Berkeley (2004)

  11. Castro M., Liskov B.: Practical byzantine fault tolerance and proactive recovery. ACM Trans. Comput. Syst. 20(4), 398–461 (2002)

    Article  Google Scholar 

  12. Chandy K.M., Lamport L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)

    Article  Google Scholar 

  13. Chen, F., Rosu, G.: Java-mop: a monitoring oriented programming environment for java. In: Proceedings of 11th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS). Lecture Notes in Computer Science, vol. 3440, pp. 546–550. Springer, Berlin (2005)

  14. Constable R.L., Knoblock T.B., Bates J.L.: Writing programs that construct proofs. J. Autom. Reason. 1(3), 285–326 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  15. Demsky, B., Rinard, M.: Automatic detection and repair of errors in data structures. In: Proceedings of the 18th Annual ACM SIGPLAN Conference on Object-Oriented Programing, Systems, Languages, and Applications (OOPSLA), pp. 78–95. ACM, New York (2003)

  16. Dolev S.: Self-Stabilization. MIT Press, Cambridge (2000)

    MATH  Google Scholar 

  17. Dolev S., Haviv Y.A.: Self-stabilizing microprocessor: analyzing and overcoming soft errors. IEEE Trans. Comput. 55(4), 385–399 (2006)

    Article  Google Scholar 

  18. Dolev S., Welch J.L.: Self-stabilizing clock synchronization in the presence of byzantine faults. J. ACM 51(5), 780–799 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  19. Dolev, S., Yagel, R.: Toward self-stabilizing operating systems. In: Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA), pp. 684–688. IEEE Computer Society, Washington (2004)

  20. Drusinsky, D.: Monitoring temporal rules combined with time series. In: In CAV03. LNCS, vol. 2725, pp. 114–118. Springer, New York (2003)

  21. Easwaran A., Kannan S., Sokolovsky O.: steering of discrete event systems: Control theory approach. Electron. Notes Theor. Comput. Sci. 144(4), 21–39 (2005)

    Article  Google Scholar 

  22. Elkarablieh, B., Khurshid, S.: Juzi: a tool for repairing complex data structures. In: Proceedings of the 30th International Conference on Software Engineering (ICSE), pp. 855–858. ACM, New York (2008)

  23. Falcone, Y., Fernandez, J.C., Mounier, L.: Synthesizing enforcement monitors wrt. the safety-progress classification of properties. In: Proceedings of the 4th International Conference on Information Systems Security (ICISS), pp. 41–55. Springer, Berlin (2008)

  24. Friedman D.P., Haynes C.T., Wand M.: Essentials of Programming Languages, 2nd edn. Massachusetts Institute of Technology, Cambridge (2001)

    MATH  Google Scholar 

  25. Gurevich, Y., Rossman, B., Schulte, W.: Semantic essence of asml. Tech. Rep. MSR-TR-2004-27, Microsoft Research (2004)

  26. Havelund K., Havelund K., Havelund K.: An overview of the runtime verification tool java pathexplorer. Formal Methods Syst. Des. 24(2), 189–215 (2004)

    Article  MATH  Google Scholar 

  27. Haviv, Y.A.: Self-stabilizing fault-resilient embedded systems. Ph.D. thesis, Ben-Gurion University of the Negev, Be’er Sheva, Israel (2006)

  28. Kim, M., Kannan, S., Lee, I., Sokolsky, O., Viswanathan, M.: Java-mac: a run-time assurance tool for java programs. In: Proceedings of the Conference on Runtime Verification, volume 55 of ENTCS. Elsevier, Amsterdam (2001)

  29. Lamport L., Shostak R.E., Pease M.C.: The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4(3), 382–401 (1982)

    Article  MATH  Google Scholar 

  30. Leal, W., Arora, A.: Scalable self-stabilization via composition. Tech. Rep. OSU-CISRC-7/03-TR46, Department of Computer Information Science, The Ohio State University (2003). http://www.cse.ohio-state.edu

  31. Lynch N.A.: Distributed Algorithms. Morgan Kaufmann, San Francisco (1996)

    MATH  Google Scholar 

  32. McConnell S.: Code Complete, 2nd edn. Microsoft Press, Redmond (2004)

    Google Scholar 

  33. Neumann P.G.: Computer-Related Risks. Addison-Wesley, Boston (1994)

    Google Scholar 

  34. Patterson, D., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A., Kiciman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupamn, J., Treuhaft, N.: Recovery oriented computing (roc): motivation, definition, techniques, and case studies. Tech. rep., UC Berkeley (2002)

  35. Project, A.: AKKA: Simpler scalability, fault-tolerance, concurrency & remoting through actors (2010). http://akka.io/

  36. Randell B., Lee P., Treleaven P.C.: Reliability issues in computing system design. ACM Comput. Surv. 10(2), 123–165 (1978)

    Article  MATH  Google Scholar 

  37. Randell, B., Xu, J.: The evolution of the recovery block concept. In: Software Fault Tolerance, chap. 1, pp. 1–22. Wiley, New York (1994)

  38. Rinard, M., Cadar, C., Dumitran, D., Roy, D.M., Leu, T., William S. Beebee, J.: Enhancing server availability and security through failure-oblivious computing. In: Proceedings of the 6th Symposium on Operating Systems Design & Implementation (OSDI), pp. 21–21. USENIX Association, Berkeley (2004)

  39. Rinard, M., Cadar, C., Nguyen, H.H.: Exploring the acceptability envelope. In: Companion to the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), pp. 21–30. ACM, New York (2005)

  40. Rist R., Terwilliger R.: Object-Oriented Programming in Eiffel. Prentice-Hall, Upper Saddle River (1995)

    Google Scholar 

  41. Rosen E.C., Beranek B.: Rfc 789: vulnerabilities of network control protocols: an example. Comput. Commun. Rev. 11, 10–16 (1981)

    Article  Google Scholar 

  42. Rothamel, T., Liu, Y.A., Heitmeyer, C.L., Leonard, E.I.: Generating optimized code from SCR specifications. In: Proceedings of the 2006 ACM SIGPLAN/SIGBED Conference on Language, Compilers and Tool Support for Embedded Systems, pp. 135–144. ACM Press, New York (2006)

  43. Schneider F.B., Zhou L.: Implementing trustworthy services using replicated state machines. IEEE Secur. Priv. 3, 34–43 (2005)

    Google Scholar 

  44. Schulze, M., Gibson, G.A., Katz, R.H., Patterson, D.A.: How reliable is a raid? In: COMPCON, pp. 118–123 (1989)

  45. Sen K., Roşu G., Agha G.: Runtime safety analysis of multithreaded programs. SIGSOFT Softw. Eng. Notes 28(5), 337–346 (2003)

    Article  Google Scholar 

  46. Shivakumar, P., Kistler, M., Keckler, S.W., Burger, D., Alvisi, L., Technical, I., Keaty, C.J., Bell, R., Rajamony, R.: Modeling the effect of technology trends on the soft error rate of combinational logic. In: Proceedings of the International Conference on Dependable Systems and Networks, pp. 389–398 (2002)

  47. Verssimo, P.E., Neves, N.F., Correia, M.P.: Intrusion-tolerant architectures: concepts and design. In: Architecting Dependable Systems. Lecture Notes in Computer Science, vol. 2677, pp. 3–36. Springer, New York (2003)

  48. Xu, J., Romanovsky, A., Stroud, R.J., Zorzo, A.F.: Rigorous development of a safety-critical system based on coordinated atomic actions. In: Proceedings of the 29th International Symposium on Fault-Tolerant Computing, pp. 68–75. IEEE Computer Society Press (1999)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Olga Brukman.

Additional information

An extended abstract of this work was presented at the 8th International Symposium on Stabilization, Safety, and Security of Distributed Systems (SSS’06), Austin, Texas, USA, 2006, and the 20th ACM Symposium on Operating Systems Principles (SOSP’05), Brighton, England, 2005.

This work was partially supported by the Lynne and William Frankel Center for Computer Sciences and the Rita Altura Trust Chair in Computer Sciences.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brukman, O., Dolev, S. Recovery oriented programming: runtime monitoring of safety and liveness. Int J Softw Tools Technol Transfer 13, 377–395 (2011). https://doi.org/10.1007/s10009-011-0200-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10009-011-0200-3

Keywords

Navigation