skip to main content
10.1145/1993744.1993757acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
research-article

Record and transplay: partial checkpointing for replay debugging across heterogeneous systems

Published:07 June 2011Publication History

ABSTRACT

Software bugs that occur in production are often difficult to reproduce in the lab due to subtle differences in the application environment and nondeterminism. To address this problem, we present Transplay, a system that captures production software bugs into small per-bug recordings which are used to reproduce the bugs on a completely different operating system without access to any of the original software used in the production environment. Transplay introduces partial checkpointing, a new mechanism that efficiently captures the partial state necessary to reexecute just the last few moments of the application before it encountered a failure. The recorded state, which typically consists of a few megabytes of data, is used to replay the application without requiring the specific application binaries, libraries, support data, or the original execution environment. Transplay integrates with existing debuggers to provide standard debugging facilities to allow the user to examine the contents of variables and other program state at each source line of the application's replayed execution. We have implemented a Transplay prototype that can record unmodified Linux applications and replay them on different versions of Linux as well as Windows. Experiments with several applications including Apache and MySQL show that Transplay can reproduce real bugs and be used in production with modest recording overhead.

Skip Supplemental Material Section

Supplemental Material

metrics_3_3.mp4

mp4

186.8 MB

References

  1. T. Allen et al. DWARF Debugging Information Format, Version 4, Jun 2010.Google ScholarGoogle Scholar
  2. G. Altekar and I. Stoica. ODR: Output-Deterministic Replay for Multicore Debugging. In Proceedings of the 22nd Symposium on Operating Systems Principles (SOSP), Oct 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Bergheaud, D. Subhraveti, and M. Vertes. Fault Tolerance in Multiprocessor Systems via Application Cloning. In Proceedings of the 27th International Conference on Distributed Computing Systems (ICDCS), Jun 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Chow, T. Garfinkel, and P. Chen. Decoupling Dynamic Program Analysis from Execution in Virtual Environments. In Proceedings of the 2008 USENIX Annual Technical Conference, Jun 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Chow, D. Lucchetti, T. Garfinkel, G. Lefebvre, R. Gardner, J. Mason, S. Small, and P. M. Chen. Multi-Stage Replay With Crosscut. In Proceedings of the 6th International Conference on Virtual Execution Environments (VEE), Mar 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. W. Dunlap, D. G. Lucchetti, M. A. Fetterman, and P. M. Chen. Execution Replay of Multiprocessor Virtual Machines. In Proceedings of the 4th International Conference on Virtual Execution Environments (VEE), Mar 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Evans, J. Guttag, J. Horning, and Y. M. Tan. LCLint: A Tool For Using Specifications to Check Code. In Proceedings of the 2nd Symposium on Foundations of Software Engineering (SIGSOFT), Dec 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. GNU. GDB: The GNU Project Debugger, http://www.gnu.org/software/gdb/.Google ScholarGoogle Scholar
  9. Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, and Z. Zhang. R2: An Application-Level Kernel for Record and Replay. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI), Dec 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. IBM. WebSphere Application Server V6: Diagnostic Data, http://www.redbooks.ibm.com/redpapers/pdfs/redp4085.pdf.Google ScholarGoogle Scholar
  11. Intel. Assure, http://developer.intel.com/software/products/assure/.Google ScholarGoogle Scholar
  12. O. Laadan and J. Nieh. Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems. In In Proceedings of the 2007 USENIX Annual Technical Conference, Jun 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. O. Laadan, N. Viennot, and J. Nieh. Transparent, Lightweight Application Execution Replay on Commodity Multiprocessor Operating Systems. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Jun 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. LeBlanc and J. Mellor-Crummey. Debugging Parallel Programs with Instant Replay. IEEE Transactions on Computers, C-36(4), Apr 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Lu, Z. Li, F. Qin, L. Tan, P. Zhou, and Y. Zhou. BugBench: Benchmarks for Evaluating Bug Detection Tools. In PLDI Workshop on the Evaluation of Software Defect Detection Tools, Jun 2005.Google ScholarGoogle Scholar
  16. C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Jun 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Montesinos, M. Hicks, S. T. King, and J. Torrellas. Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Mozilla.org. Quality Feedback Agent, http://kb.mozillazine.org/Quality_Feedback_Agent.Google ScholarGoogle Scholar
  19. M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. Nainar, and I. Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI), Dec 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Narayanasamy, C. Pereira, and B. Calder. Recording Shared Memory Dependencies Using Strata. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Oct 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Narayanasamy, G. Pokam, and B. Calder. BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), Jun 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Netzer and M. Weaver. Optimal Tracing and Incremental Reexecution for Debugging Long-Running Programs. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Jun 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Osman, D. Subhraveti, G. Su, and J. Nieh. The Design and Implementation of Zap: A System for Migrating Computing Environments. In Proceedings of the 5th Symposium on Operating System Design and Implementation (OSDI), Dec 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Park, Y. Zhou, W. Xiong, Z. Yin, R. Kaushik, K. H. Lee, and S. Lu. PRES: Probabilistic Replay With Execution Sketching on Multiprocessors. In Proceedings of the 22nd Symposium on Operating Systems Principles (SOSP), Oct 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. H. Perkins, S. Kim, S. Larsen, S. Amarasinghe, J. Bachrach, M. Carbin, C. Pacheco, F. Sherwood, S. Sidiroglou, G. Sullivan, W.-F. Wong, Y. Zibin, M. D. Ernst, and M. Rinard. Automatically Patching Errors in Deployed Software. In Proceedings of the 22nd Symposium on Operating Systems Principles (SOSP), Oct 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Plank. An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance. Technical Report UT-CS-97-372, University of Tennessee, Jul 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Plank, J. Xu, and R. Netzer. Compressed Differences: An Algorithm for Fast Incremental Checkpointing. Technical Report UT-CS-95-302, University of Tennessee, Aug 1995.Google ScholarGoogle Scholar
  28. M. Ronsse and K. De-Bosschere. RecPlay: A Fully Integrated Practical Record/Replay System. ACM Transactions on Computer Systems, 17(2), May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Saito. Jockey: A User-space Library for Record-Replay Debugging. In Proceedings of the 6th International Symposium on Automated Analysis-Driven Debugging (AADEBUG), Sep 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Sidiroglou, O. Laadan, C. Perez, N. Viennot, J. Nieh, and A. D. Keromytis. ASSURE: Automatic Software Self-Healing Using Rescue Points. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Srinivasan, S. Kandula, C. Andrews, and Y. Zhou. Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging. In Prooceedings of the 2004 USENIX Annual Technical Conference, Jun 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing Production Run Failures at the User's Site. In Proceedings of the 21st Symposium on Operating Systems Principles (SOSP), Oct 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Wikipedia. Dependency Hell, http://en.wikipedia.org/wiki/Dependency_hell.Google ScholarGoogle Scholar
  34. M. Xu, R. Bodik, and M. Hill. A Flight Data Recorder for Enabling Full-system Multiprocessor Deterministic Replay. In Proceedings of the 30th International Symposium on Computer Architecture (ISCA), Jun 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Record and transplay: partial checkpointing for replay debugging across heterogeneous systems

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SIGMETRICS '11: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
              June 2011
              376 pages
              ISBN:9781450308144
              DOI:10.1145/1993744

              Copyright © 2011 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 7 June 2011

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate459of2,691submissions,17%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader