skip to main content
survey

Deterministic Replay: A Survey

Published: 24 September 2015 Publication History

Abstract

Deterministic replay is a type of emerging technique dedicated to providing deterministic executions of computer programs in the presence of nondeterministic factors. The application scopes of deterministic replay are very broad, making it an important research topic in domains such as computer architecture, operating systems, parallel computing, distributed computing, programming languages, verification, and hardware testing.
In this survey, we comprehensively review existing studies on deterministic replay by introducing a taxonomy. Basically, existing deterministic replay schemes can be classified into two categories, single-processor (SP) schemes and multiprocessor (MP) schemes. By reviewing the details of these two categories of schemes respectively, we summarize and compare how existing schemes address technical issues such as log size, record slowdown, replay slowdown, implementation cost, and probe effect, which may shed some light on future studies on deterministic replay.

References

[1]
S. Adve and H. Boehm. 2010. Memory models: A case for rethinking parallel languages and hardware. Commun. ACM 53, 8, 90--101.
[2]
S. V. Adve and M. D. Hill. 1990. Weak ordering—a new definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA’90). 2--14.
[3]
G. Altekar and I. Stoica. 2009. ODR: Output-deterministic replay for multicore debugging. In Proceedings of the 22nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). 193--206.
[4]
A. Aviram, S. Weng, S. Hu, and Bryan Ford. 2010. Efficient system-enforced deterministic parallelism. In Proceedings of the 9th USENIX Symposium on Operating System Design and Implementation (OSDI’10). 1--16.
[5]
J. F. Bartlett. 1981. A non stop kernel. In Proceedings of the 8th ACM Symposium on Operating Systems Principles (SOSP’81). 22--29.
[6]
A. Basu, J. Bobba, and M. D. Hill. 2011. Karma: Scalable deterministic record-replay. In Proceedings of the 25th ACM International Conference on Supercomputing (ICS’11). 359--368.
[7]
T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman. 2010. CoreDet: A compiler and runtime system for deterministic multithreaded execution. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10).
[8]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 2008 Parallel Architectures and Compilation Techniques (PACT-08). 72--81.
[9]
A. Bouteiller, G. Bosilca, and J. Dongarra. 2007. Retrospect: Deterministic replay of MPI applications for interactive distributed debugging. In Proceedings of the 14th European PVM/MPI User’s Group Conference (PVM/MPI’07).
[10]
T. Bressoud and F. Schneider. 1996. Hypervisor-based fault tolerance. ACM Trans. Comput. Syst. 14, 1, 1--11.
[11]
L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas. 2007. BulkSC: Bulk enforcement of sequential consistency. In Proceedings of the 34th ACM/IEEE International Symposium on Computer Architecture (ISCA’07). 278--289.
[12]
H. Chen, X. Wu, L. Yuan, B. Zang, P. Yew, and F. Chong. 2008. From speculation to security: Practical and efficient information flow tracking using speculative hardware. In Proceedings of the 35th ACM/IEEE International Symposium on Computer Architecture (ISCA’08). 401--412.
[13]
Y. Chen, T. Chen, and W. Hu. 2009. Global Clock, Physical Time Order and Pending Period Analysis in Multiprocessor Systems. (http://arxiv.org/pdf/0903.4961 2009).
[14]
Y. Chen, T. Chen, L. Li, L. Li, L. Yang, M. Su, and W. Hu. 2013. LDet: Determinizing asynchronous transfer for post-silicon debugging. IEEE Trans. Comput. 62, 9, 1732--1744.
[15]
Y. Chen, W. Hu, T. Chen, and R. Wu. 2010. LReplay: A pending period based deterministic replay scheme. In Proceedings of the 37th ACM/IEEE International Symposium on Computer Architecture (ISCA’10). 187--197.
[16]
J. Chow, T. Garfinkel, and P. Chen. 2008. Decoupling dynamic program analysis from execution in virtual environments. In Proceedings of the USENIX Annual Technical Conference (USENIX’08).
[17]
J. Chow, D. Lucchetti, T. Garfinkel, G. Lefebvre, R. Gardner, J. Mason, S. Small, and P. Chen. 2010. Multi-stage replay with crosscut. In Proceedings of the 6th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’10).
[18]
C. Clémenoņ, J. Fritscher, M. Meehan, and R. Rühl. 1995. An implementation of race detection and deterministic replay with MPI. In Proceedings of the 1st International European Conference on Parallel and Distributed Computing (Euro-Par’95).
[19]
R. Curtis and L. D. Wittie. 1982. BUGNET: A debugging system for parallel programming environments. In Proceedings of the 3rd International Conference on Distributed Computing Systems (ICDCS’82).
[20]
J. de Kergommeaux, M. Ronsse, and K. de Bosschere. 1999. MPL*: Efficient record/play of nondeterministic features of message passing libraries. In Proceedings of the 14th European PVM/MPI User’s Group Conference (PVM/MPI’99).
[21]
J. Devietti, B. Lucia, L. Ceze, and M. Oskin. 2009. DMP: Deterministic shared memory multiprocessing. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09).
[22]
J. Devietti, J. Nelson, T. Bergan, L. Ceze, and D. Grossman. 2011. RCDC: A relaxed consistency deterministic computer. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11).
[23]
G. Dunlap, S. King, S. Cinar, M. Basrai, and P. Chen. 2002. ReVirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of the 5th USENIX Symposium on Operating System Design and Implementation (OSDI’02).
[24]
G. Dunlap, D. Lucchetti, M. Fetterman, and P. Chen. 2008. Execution replay of multiprocessor virtual machines. In Proceedings of the 4th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’08).
[25]
E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. Comput. Surv. 34, 3, 375--408.
[26]
T. Foster, D. Lastor, and P. Singh. 2007. First silicon functional validation and debug of multicore microprocessors. IEEE Trans. Very Large Scale Integr. Syst. 15, 5, 495--504.
[27]
D. Geels, G. Altekar, S. Shenker, and I. Stoica. 2006. Replay debugging for distributed applications. In Proceedings of the USENIX Annual Technical Conference (USENIX’06).
[28]
GNU. 2009. Gdb: The gnu project debugger. (http://www.gnu.org/software/gdb 2009).
[29]
J. R. Goodman. 1991. Cache Consistency and Sequential Consistency. University of Wisconsin-Madison, Computer Sciences Department.
[30]
M. Goodstein, E. Vlachos, S. Chen, P. Gibbons, M. Kozuch, and T. Mowry. 2010. Butterfly analysis: Adapting dataflow analysis to dynamic parallel monitoring. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10).
[31]
Z. Guo, X.Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, and Z. Zhang. 2008. R2: An application-level kernel for record and replay. In Proceedings of the 8th USENIX Symposium on Operating System Design and Implementation (OSDI’08).
[32]
S. Hangal, D. Vahia, C. Manovit, and J.-Y. J. Lu. 2004. TSOtool: A program for verifying memory systems using the memory consistency model. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04). 114.
[33]
M. Heath, W. Burleson, and I. Harris. 2005. Synchro-tokens: A deterministic GALS methodology for chip-level debug and test. IEEE Trans. Comput. 54, 12, 1532--1546.
[34]
D. Hower, P. Dudnik, M. Hill, and D. Wood. 2011. Calvin: Deterministic or not? Free will to choose. In Proceedings of the 17th International Symposium on High-Performance Computer Architecture (HPCA’11).
[35]
D. Hower and M. Hill. 2008. Rerun: Exploiting episodes for lightweight memory race recording. In Proceedings of the 35th ACM/IEEE International Symposium on Computer Architecture (ISCA’08).
[36]
W. Hu, J. Wang, X. Gao, Y. Chen, Q. Liu, and G. Li. 2009. Godson-3: A scalable multicore RISC processor with x86 emulation. IEEE Micro 29, 2, 17--29.
[37]
S. King, G. Dunlap, and P. Chen. 2005. Debugging operating systems with time-traveling virtual machines. In Proceedings of the USENIX Annual Technical Conference (USENIX’05).
[38]
R. Konuru, H. Srinivasan, and J. Choi. 2000. Deterministic replay of distributed java applications. In Proceedings of the 14th IEEE International Parallel and Distributed Processing Symposium (IPDPS’00).
[39]
D. Kranzlmüller, C. Schaubschläger, and J. Volkert. 2001. An integrated record&replay mechanism for nondeterministic message passing programs. In Proceedings of the 8th European PVM/MPI User’’s Group Conference (PVM/MPI’01).
[40]
O. Laadan, N. Viennot, and J. Nieh. 2010. Transparent, lightweight application execution replay on commodity multiprocessor operating systems. In Proceedings of the ACM SIGMETRICS International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS’10).
[41]
L. Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7, 558--565.
[42]
D. Lee, P. Chen, J. Flinn, and S. Narayanasamy. 2012. Chimera: Hybrid program analysis for determinism. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12).
[43]
D. Lee, M. Said, S. Narayanasamy, and Z. Yang. 2011. Offline symbolic analysis to infer total store order. In Proceedings of the 17th International Symposium on High-Performance Computer Architecture (HPCA’11).
[44]
D. Lee, M. Said, S. Narayanasamy, Z. Yang, and C. Pereira. 2009. Offline symbolic analysis for multi-processor execution replay. In Proceedings of the 42nd ACM/IEEE International Symposium on Microarchitecture (MICRO’09).
[45]
D. Lee, B. Wester, K. Veeraraghavan, S. Narayanasamy, P. Chen, and J. Flinn. 2010. Respec: Efficient online multiprocessor replay via speculation and external determinism. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10).
[46]
X. Liu, W. Lin, A. Pan, and Z. Zhang. 2007. WiDS checker: Combating bugs in distributed systems. In Proceedings of the 4th USENIX Symposium on Networked Systems Design and Implementation (NSDI’07).
[47]
D. Lucchetti, S. Reinhardt, and P. Chen. 2005. ExtraVirt: Detecting and recovering from transient processor faults. In The 20nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP) Work-in-Progress Session.
[48]
C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, P. Lowney, S. Wallace, V. Reddi, and K. Hazelwood. 2007. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’07).
[49]
M. Maruyama, T. Tsumura, and H. Nakashima. 2005. Parallel program debugging based on data-replay. In Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS’05).
[50]
P. Montesinos, L. Ceze, and J. Torrellas. 2008. DeLorean: Recording and deterministically replaying shared-memory multiprocessor execution effciently. In Proceedings of the 35th ACM/IEEE International Symposium on Computer Architecture (ISCA’08).
[51]
M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. Nainar, and I. Neamtiu. 2008. Finding and reproducing heisenbugs in concurrent programs. In Proceedings of the 8th USENIX Symposium on Operating System Design and Implementation (OSDI’08).
[52]
S. Narayanasamy, C. Pereira, and B. Calder. 2006. Recording shared memory dependencies using strata. In Proceedings of the 12nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’06).
[53]
S. Narayanasamy, G. Pokam, and B. Calder. 2005. BugNet: Continuously recording program execution for deterministic replay debugging. In Proceedings of the 32st ACM/IEEE International Symposium on Computer Architecture (ISCA’05).
[54]
R. Netzer. 1993. Optimal tracing and replay for debugging shared-memory parallel programs. In Proceedings of the ACM/ONR Workshop on Parallel and Distributed Debugging.
[55]
R. Netzer and B. Miller. 1990. On the complexity of event ordering for shared-memory parallel program executions. In Proceedings of the International Conference on Parallel Processing (ICPP’90).
[56]
R. Netzer and B. Miller. 1992. Optimal tracing and replay for debugging message-passing parallel programs. In Proceedings of the 6th ACM/IEEE Conference on Supercomputing (SC’92).
[57]
J. Newsome and D. Song. 2005. Dynamic taint analysis for automatic detection, analysis, and signaturegeneration of exploits on commodity software. In Proceedings of the Network and Distributed System Security Symposium (NDSS’05).
[58]
E. Nightingale, P. Chen, and J. Flinn. 2005. Speculative execution in a distributed file system. In Proceedings of the 20nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP’05).
[59]
M. Olszewski, J. Ansel, and S. Amarasinghe. 2009. Kendo: Efficient deterministic multithreading in software. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09).
[60]
S. Park, Y. Zhou, W. Xiong, Z. Yin, R. Kaushik, K. Lee, and S. Lu. 2009. PRES: Probabilistic replay with execution sketching on multiprocessors. In Proceedings of the 22nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09).
[61]
H. Patil, C. Pereira, M. Stallcup, G. Lueck, and J. Cownie. 2010. Pinplay: A framework for deterministic replay and reproducible analysis of parallel programs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’10).
[62]
G. Pokam, C. Pereira, K. Danne, R. Kassa, and A. Adl-Tabatabai. 2009. Architecting a chunk-based memory race recorder in modern CMPs. In Proceedings of the 42nd ACM/IEEE International Symposium on Microarchitecture (MICRO’09).
[63]
M. Ronsse and K. de Bosschere. 1999. RecPlay: A fully integrated practical record/replay system. ACM Trans. Comput. Syst. 17, 2, 133--152.
[64]
M. Russinovich and B. Cogswell. 1996. Replay for concurrent non-deterministic shared memory applications. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’96).
[65]
Y. Saito. 2005. Jockey: A user-space library for record-replay debugging. In Proceedings of the 6th International Workshop on Automated Analysis-driven Debugging (AADEBUG’05).
[66]
S. Sarangi, B. Greskamp, and J. Torrellas. 2006. CADRE: Cycle-accurate deterministic replay for hardware debugging. In Proceedings of the 37th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’06).
[67]
J. Slye and E. Elnozahy. 1996. Supporting nondeterministic execution in fault-tolerant systems. In Proceedings of the 26th IEEE International Symposium on Fault-Tolerant Computing (FTCS’96).
[68]
J. Slye and E. Elnozahy. 1998. Support for software interrupts in log-based rollback-recovery. IEEE Trans. Comput. 47, 10, 1113--1123.
[69]
D. Sorin, M. Martin, M. Hill, and D. Wood. 2002. Safetynet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 29th ACM/IEEE International Symposium on Computer Architecture (ISCA’02).
[70]
M. Su, Y. Chen, and X. Gao. 2010. A general method to make multi-clock system deterministic. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’10).
[71]
K. Veeraraghavan, D. Lee, B. Wester, J. Ouyang, P. Chen, J. Flinn, and S. Narayanasamy. 2011. DoublePlay: Parallelizing sequential logging and replay. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11).
[72]
G. Voskuilen, F. Ahmad, and T. Vijaykumar. 2010. Timetraveler: Exploiting acyclic races for optimizing memory race recording. In Proceedings of the 37th ACM/IEEE International Symposium on Computer Architecture (ISCA’10).
[73]
J. Voung, R. Jhala, and S. Lerner. 2007. RELAY: Static race detection on millions of lines of code. In Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC-FSE’07).
[74]
S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd ACM/IEEE International Symposium on Computer Architecture (ISCA’95).
[75]
M. Xu, R. Bodik, and M. Hill. 2003. A “Flight Data Recorder” for enabling full-system multiprocessor deterministic replay. In Proceedings of the 30th ACM/IEEE International Symposium on Computer Architecture (ISCA’03).
[76]
M. Xu, M. Hill, and R. Bodik. 2006. A regulated transitive reduction (RTR) for longer memory race recording. In Proceedings of the 12nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’06).
[77]
M. Xu, V. Malyugin, J. Sheldon, G. Venkitachalam, and B. Weissman. 2007. ReTrace: Collecting execution trace with virtual machine deterministic replay. In Proceedings of the 3rd Annual Workshop on Modeling, Benchmarking and Simulation (WoBS’07).
[78]
R. Xue, X. Liu, M. Wu, Z. Guo, W. Chen, W. Zheng, Z. Zhang, and G. Voelker. 2009. MPIWiz: Subgroup reproducible replay of MPI applications. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09).
[79]
J. Zhai, W. Chen, and W. Zheng. 2010. Phantom: Predicting performance of parallel applications on large-scale parallel machines using a single node. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10).

Cited By

View all
  • (2025)Parallaft: Runtime-Based CPU Fault Tolerance via Heterogeneous ParallelismProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3696443.3708946(584-599)Online publication date: 1-Mar-2025
  • (2024)Deterministic Record-and-ReplayQueue10.1145/368808822:4(120-129)Online publication date: 19-Sep-2024
  • (2024)Differential Analysis for System Provenance2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00455(5649-5653)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 48, Issue 2
November 2015
615 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/2830539
  • Editor:
  • Sartaj Sahni
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 September 2015
Accepted: 01 June 2015
Revised: 01 June 2015
Received: 01 February 2013
Published in CSUR Volume 48, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Deterministic replay
  2. System-on-Chip
  3. chip multiprocessor
  4. data race
  5. debugging
  6. distributed system
  7. operating system
  8. order
  9. parallel system

Qualifiers

  • Survey
  • Research
  • Refereed

Funding Sources

  • NSF of China
  • International Collaboration Key Program of the CAS
  • 973 Program of China
  • Strategic Priority Research Program of the CAS
  • 10000 talent program

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)100
  • Downloads (Last 6 weeks)5
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Parallaft: Runtime-Based CPU Fault Tolerance via Heterogeneous ParallelismProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3696443.3708946(584-599)Online publication date: 1-Mar-2025
  • (2024)Deterministic Record-and-ReplayQueue10.1145/368808822:4(120-129)Online publication date: 19-Sep-2024
  • (2024)Differential Analysis for System Provenance2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00455(5649-5653)Online publication date: 13-May-2024
  • (2024)Reversible debugging of concurrent Erlang programs: Supporting imperative primitivesJournal of Logical and Algebraic Methods in Programming10.1016/j.jlamp.2024.100944(100944)Online publication date: Jan-2024
  • (2023)Improving logging to reduce permission over-granting mistakesProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620261(409-426)Online publication date: 9-Aug-2023
  • (2023)A Survey on Parallelism and DeterminismACM Computing Surveys10.1145/356452955:10(1-28)Online publication date: 2-Feb-2023
  • (2023)A Grey Literature Review on Data Stream Processing applications testingJournal of Systems and Software10.1016/j.jss.2023.111744203(111744)Online publication date: Sep-2023
  • (2023)Efficient regression testing of distributed real-time reactive systems in the context of model-driven developmentSoftware and Systems Modeling (SoSyM)10.1007/s10270-023-01086-522:5(1565-1587)Online publication date: 6-Mar-2023
  • (2022)Event-Based Out-of-Place DebuggingProceedings of the 19th International Conference on Managed Programming Languages and Runtimes10.1145/3546918.3546920(85-97)Online publication date: 14-Sep-2022
  • (2022)LDT: Lightweight Dirty Tracking of Memory Pages for x86 Systems2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC56025.2022.00023(85-94)Online publication date: Dec-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media