ABSTRACT
Software bugs that occur in production are often difficult to reproduce in the lab due to subtle differences in the application environment and nondeterminism. To address this problem, we present Transplay, a system that captures production software bugs into small per-bug recordings which are used to reproduce the bugs on a completely different operating system without access to any of the original software used in the production environment. Transplay introduces partial checkpointing, a new mechanism that efficiently captures the partial state necessary to reexecute just the last few moments of the application before it encountered a failure. The recorded state, which typically consists of a few megabytes of data, is used to replay the application without requiring the specific application binaries, libraries, support data, or the original execution environment. Transplay integrates with existing debuggers to provide standard debugging facilities to allow the user to examine the contents of variables and other program state at each source line of the application's replayed execution. We have implemented a Transplay prototype that can record unmodified Linux applications and replay them on different versions of Linux as well as Windows. Experiments with several applications including Apache and MySQL show that Transplay can reproduce real bugs and be used in production with modest recording overhead.
Supplemental Material
- T. Allen et al. DWARF Debugging Information Format, Version 4, Jun 2010.Google Scholar
- G. Altekar and I. Stoica. ODR: Output-Deterministic Replay for Multicore Debugging. In Proceedings of the 22nd Symposium on Operating Systems Principles (SOSP), Oct 2009. Google ScholarDigital Library
- P. Bergheaud, D. Subhraveti, and M. Vertes. Fault Tolerance in Multiprocessor Systems via Application Cloning. In Proceedings of the 27th International Conference on Distributed Computing Systems (ICDCS), Jun 2007. Google ScholarDigital Library
- J. Chow, T. Garfinkel, and P. Chen. Decoupling Dynamic Program Analysis from Execution in Virtual Environments. In Proceedings of the 2008 USENIX Annual Technical Conference, Jun 2008. Google ScholarDigital Library
- J. Chow, D. Lucchetti, T. Garfinkel, G. Lefebvre, R. Gardner, J. Mason, S. Small, and P. M. Chen. Multi-Stage Replay With Crosscut. In Proceedings of the 6th International Conference on Virtual Execution Environments (VEE), Mar 2010. Google ScholarDigital Library
- G. W. Dunlap, D. G. Lucchetti, M. A. Fetterman, and P. M. Chen. Execution Replay of Multiprocessor Virtual Machines. In Proceedings of the 4th International Conference on Virtual Execution Environments (VEE), Mar 2008. Google ScholarDigital Library
- D. Evans, J. Guttag, J. Horning, and Y. M. Tan. LCLint: A Tool For Using Specifications to Check Code. In Proceedings of the 2nd Symposium on Foundations of Software Engineering (SIGSOFT), Dec 1994. Google ScholarDigital Library
- GNU. GDB: The GNU Project Debugger, http://www.gnu.org/software/gdb/.Google Scholar
- Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, and Z. Zhang. R2: An Application-Level Kernel for Record and Replay. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI), Dec 2008. Google ScholarDigital Library
- IBM. WebSphere Application Server V6: Diagnostic Data, http://www.redbooks.ibm.com/redpapers/pdfs/redp4085.pdf.Google Scholar
- Intel. Assure, http://developer.intel.com/software/products/assure/.Google Scholar
- O. Laadan and J. Nieh. Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems. In In Proceedings of the 2007 USENIX Annual Technical Conference, Jun 2007. Google ScholarDigital Library
- O. Laadan, N. Viennot, and J. Nieh. Transparent, Lightweight Application Execution Replay on Commodity Multiprocessor Operating Systems. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Jun 2010. Google ScholarDigital Library
- T. LeBlanc and J. Mellor-Crummey. Debugging Parallel Programs with Instant Replay. IEEE Transactions on Computers, C-36(4), Apr 1987. Google ScholarDigital Library
- S. Lu, Z. Li, F. Qin, L. Tan, P. Zhou, and Y. Zhou. BugBench: Benchmarks for Evaluating Bug Detection Tools. In PLDI Workshop on the Evaluation of Software Defect Detection Tools, Jun 2005.Google Scholar
- C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Jun 2005. Google ScholarDigital Library
- P. Montesinos, M. Hicks, S. T. King, and J. Torrellas. Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar 2009. Google ScholarDigital Library
- Mozilla.org. Quality Feedback Agent, http://kb.mozillazine.org/Quality_Feedback_Agent.Google Scholar
- M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. Nainar, and I. Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI), Dec 2008. Google ScholarDigital Library
- S. Narayanasamy, C. Pereira, and B. Calder. Recording Shared Memory Dependencies Using Strata. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Oct 2006. Google ScholarDigital Library
- S. Narayanasamy, G. Pokam, and B. Calder. BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), Jun 2005. Google ScholarDigital Library
- R. Netzer and M. Weaver. Optimal Tracing and Incremental Reexecution for Debugging Long-Running Programs. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Jun 1994. Google ScholarDigital Library
- S. Osman, D. Subhraveti, G. Su, and J. Nieh. The Design and Implementation of Zap: A System for Migrating Computing Environments. In Proceedings of the 5th Symposium on Operating System Design and Implementation (OSDI), Dec 2002. Google ScholarDigital Library
- S. Park, Y. Zhou, W. Xiong, Z. Yin, R. Kaushik, K. H. Lee, and S. Lu. PRES: Probabilistic Replay With Execution Sketching on Multiprocessors. In Proceedings of the 22nd Symposium on Operating Systems Principles (SOSP), Oct 2009. Google ScholarDigital Library
- J. H. Perkins, S. Kim, S. Larsen, S. Amarasinghe, J. Bachrach, M. Carbin, C. Pacheco, F. Sherwood, S. Sidiroglou, G. Sullivan, W.-F. Wong, Y. Zibin, M. D. Ernst, and M. Rinard. Automatically Patching Errors in Deployed Software. In Proceedings of the 22nd Symposium on Operating Systems Principles (SOSP), Oct 2009. Google ScholarDigital Library
- J. Plank. An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance. Technical Report UT-CS-97-372, University of Tennessee, Jul 1997. Google ScholarDigital Library
- J. Plank, J. Xu, and R. Netzer. Compressed Differences: An Algorithm for Fast Incremental Checkpointing. Technical Report UT-CS-95-302, University of Tennessee, Aug 1995.Google Scholar
- M. Ronsse and K. De-Bosschere. RecPlay: A Fully Integrated Practical Record/Replay System. ACM Transactions on Computer Systems, 17(2), May 1999. Google ScholarDigital Library
- Y. Saito. Jockey: A User-space Library for Record-Replay Debugging. In Proceedings of the 6th International Symposium on Automated Analysis-Driven Debugging (AADEBUG), Sep 2005. Google ScholarDigital Library
- S. Sidiroglou, O. Laadan, C. Perez, N. Viennot, J. Nieh, and A. D. Keromytis. ASSURE: Automatic Software Self-Healing Using Rescue Points. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar 2009. Google ScholarDigital Library
- S. Srinivasan, S. Kandula, C. Andrews, and Y. Zhou. Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging. In Prooceedings of the 2004 USENIX Annual Technical Conference, Jun 2004. Google ScholarDigital Library
- J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing Production Run Failures at the User's Site. In Proceedings of the 21st Symposium on Operating Systems Principles (SOSP), Oct 2007. Google ScholarDigital Library
- Wikipedia. Dependency Hell, http://en.wikipedia.org/wiki/Dependency_hell.Google Scholar
- M. Xu, R. Bodik, and M. Hill. A Flight Data Recorder for Enabling Full-system Multiprocessor Deterministic Replay. In Proceedings of the 30th International Symposium on Computer Architecture (ISCA), Jun 2003. Google ScholarDigital Library
Index Terms
- Record and transplay: partial checkpointing for replay debugging across heterogeneous systems
Recommendations
Record and transplay: partial checkpointing for replay debugging across heterogeneous systems
Performance evaluation reviewSoftware bugs that occur in production are often difficult to reproduce in the lab due to subtle differences in the application environment and nondeterminism. To address this problem, we present Transplay, a system that captures production software ...
Transparent, lightweight application execution replay on commodity multiprocessor operating systems
Performance evaluation reviewWe present Scribe, the first system to provide transparent, low-overhead application record-replay and the ability to go live from replayed execution. Scribe introduces new lightweight operating system mechanisms, rendezvous and sync points, to ...
Transparent, lightweight application execution replay on commodity multiprocessor operating systems
SIGMETRICS '10: Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systemsWe present Scribe, the first system to provide transparent, low-overhead application record-replay and the ability to go live from replayed execution. Scribe introduces new lightweight operating system mechanisms, rendezvous and sync points, to ...
Comments