Skip to main content
Log in

Non-Intrusive Detection of Synchronization Errors Using Execution Replay

  • Published:
Automated Software Engineering Aims and scope Submit manuscript

Abstract

This paper presents a practical solution for detecting synchronization errors in parallel programs. These errors are: a lack of synchronization resulting in data races, conflicting synchronization resulting in deadlock and redundant synchronization resulting in a performance penalty.

The solution consists of a combination of RecPlay, an efficient execution replay mechanism combined with automatic on-the-fly data race detection, deadlock detection and the detection of redundant synchronization during a replayed execution. The detection of data races, deadlocks and redundant synchronization normally introduces an important overhead during an execution, possibly altering the execution. However, by performing these extensive operations during a replayed and therefore unaltered execution there is almost no probe effect. Furthermore, the memory consumption during the data race detection is limited through the use of multilevel bitmaps and snooped matrix clocks. As the record phase of RecPlay is highly efficient, there is no need to switch it off, hereby eliminating the possibility of Heisenbugs because tracing can be left on all the time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Adve, S., Hill, M., and Netzer, R. 1991. Detecting data races on weak memory systems. In Proceedings of the 18th Annual Symposium on Computer Architectures, pp. 234–243.

  • Aldrich, J., Chambers, C., Sirer, E.G., and Eggers, S. 1999. Eliminating unnecessary synchronization from Java programs. In Proceedings of the Static Analyses Symposium, Venice, Italy, pp. 19–38.

  • Apt, K. 1986. Correctness proofs of distributed termination algorithms. ACM Transactions on Programming Languages and Systems, 8:388–405.

    Google Scholar 

  • Audenaert, K. and Levrouw, L. 1995. Space efficient data race detection for parallel programs with series-parallel task graphs. In Proceedings of the Third Euromicro Workshop on Parallel and Distributed Processing, San Remo, pp. 508–515. Los Alamitos, CA: IEEE Computer Society Press.

    Google Scholar 

  • Beranek, A. 1992. Data race detection based on execution replay for parallel applications. In Proceedings of CONPAR '92, France: Lyon, pp. 109–114.

    Google Scholar 

  • Blanchet, B. 1999. Escape analysis for object oriented languages: Application to Java. In Proceedings of the 14th Annual Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA99), Denver, Colorado, pp. 20–34.

  • Chen, D.-K. and Yew, P.-C. 1999. Redundant synchronization elimination for DOACROSS loops. IEEE Transactions on Parallel and Distributed Systems, 10(5):459–470.

    Google Scholar 

  • Chen, S., Deng, Y., Attie, P., Serrano, M., Sreedhar, V.D., and Sun, W. 1996. Optimal deadlock detection in distributed systems based on locally constructed wait-for-graphs. In Proc. of the 16th Int. Conf. on Distributed Computing System, pp. 613–619. IEEE CS.

  • Choi, J.-D., Gupta, M., Serrano, M., Sreedhar, V.D., and Midkiff, S. 1999. Escape analysis for Java. In Proceedings of the 14th Annual Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA99). Denver, Colorado, pp. 1–19.

  • Choi, J.-D. and Min, S.L. 1991. RACE FRONTIER: Reproducing data races in parallel-program debugging. In Proc. of the Third ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, Vol. 26, pp. 145–154.

    Google Scholar 

  • Christiaens, M. and De Bosschere, K. 2001. TRaDe, A topological approach to on-the-fly race detection in Java programs. In Java Virtual Machine Research and Technology Symposium (JVM'01), pp. 105–116. USENIX.

  • Coffman, E., Elphick, M., and Shoshani, A. 1971. System deadlocks. ACM Computing Surveys, 3(2):67–78.

    Google Scholar 

  • De Bosschere, K. and Ronsse, M. 1997. Clock snooping and its application in on-the-fly data race detection. In Proceedings of the 1997 International Symposium on Parallel Algorithms and Networks (I-SPAN'97), Taipei, pp. 324–330. Los Alamitos, CA: IEEE Computer Society.

    Google Scholar 

  • Dijkstra, E. 1968. Co-operating sequential processes. Programming Languages, pp. 43–112.

  • Fidge, C.J. 1991. Logical time in distributed computing systems. In IEEE Computer, 24:28–33.

    Google Scholar 

  • Gait, J. 1986. A probe effect in concurrent programs. Software—Practice and Experience, 16(3):225–233.

    Google Scholar 

  • Knapp, E. 1987. Deadlock detection in distributed databases. ACM Computing Surveys, 19:303–328.

    Google Scholar 

  • Lamport, L. 1978. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565.

    Google Scholar 

  • Levrouw, L.J., Audenaert, K.M., and Van Campenhout, J.M. 1994. A new trace and replay system for shared memory programs based on lamport clocks. In Proceedings of the Second Euromicro Workshop on Parallel and Distributed Processing, pp. 471–478. Los Alamitos, CA: IEEE Computer Society Press.

    Google Scholar 

  • Lu, H.-I., Klein, P.N., and Netzer, R.H.B. 1993. Detecting race conditions in parallel programs that use one semaphore. In Workshop on Algorithms and Data Structures (WADS), Montreal, pp. 471–482.

  • Martin, J. and Jassim, S. 1997. A tool for proving deadlock freedom. In Parallel Programming and Java. Amsterdam: IOS Press, pp. 1–16.

    Google Scholar 

  • Martin, J.M.R. and Huddart, Y. 2000. Parallel algorithms for deadlock and livelock analysis of concurrent systems. In Proceedings of Communicating Process Architecture 2000. Amsterdam: IOS Press.

    Google Scholar 

  • Mattern, F. 1989. Virtual time and global states of distributed systems. In Cosnard, M. et al. editors, Proceedings of the Intl. Workshop on Parallel and Distributed Algorithms. North-Holland: Elsevier Science Publishers B.V., pp. 215–226.

    Google Scholar 

  • Mellor-Crummey, J.M. 1991. On-the-fly detection of data races for programs with nested fork-join parallelism. In Proceedings of Supercomputing '91, pp. 24–33.

  • Netzer, R. and Miller, B. 1991. Improving the accuracy of data race detection. In Proceedings of the 1991 Conference on the Principles and Practice of Parallel Programming, pp. 133–144.

  • Netzer, R.H. 1993. Optimal tracing and replay for debugging shared-memory parallel programs. In Proceedings ACM/ONR Workshop on Parallel and Distributed Debugging, pp. 1–11.

  • Netzer, R.H. and Miller, B.P. 1992. What are race conditions? Some issues and formalizations. ACM Letters on Programming Languages and Systems 1(1):79–88.

    Google Scholar 

  • Netzer, R.H.B. and Miller, B.P. 1990. On the complexity of event ordering for shared-memory parallel program executions. In International Conference on Parallel Processing, pp. 93–97.

  • Perkovic, D. and Keleher, P.J. 1996. Online data-race detection via coherency guarantees. Seattle, pp. 47–57. The Second Symposium on Operating Systems Design and Implementation (OSDI '96) Proceedings.

  • Philippsen, M. and Heinz, E. 1995. Automatic synchronization elimination in synchronous FORALLs. In Frontiers '95: The 5th Symposium on the Frontiers of Massively Parallel Computation, McLean, VA, pp. 350–357.

  • Rajamony, R. and Cox, A. 1997a. Optimally synchronizing DOACROSS loops on shared memory multiprocessors. In Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques (PACT'97), pp. 214–224.

  • Rajamony, R. and Cox, A. 1997b. Performance debugging shared memory parallel programs using run-time dependence analysis. In Proceedings of the 1997 ACMSIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Seattle, WA, pp. 75–87.

  • Raynal, M. and Singhal, M. 1996. Logical clocks: Capturing causality in distributed systems. IEEE Computer, (2):49–56.

  • Ronsse, M. and De Bosschere, K. 1999. RecPlay: pp. 43–54. A fully integrated practical record/replay system. ACM Transactions on Computer Systems, 17(2):133–152.

    Google Scholar 

  • Ronsse, M. and De Bosschere, K. 2001. JiTI: A robust just in time instrumentation technique. In Proceedings of WBT-2000 (Workshop on Binary Translation). Philadelphia. In Computer Architecture News, Vol. 29, No. 1, New York: ACM Press.

    Google Scholar 

  • Ronsse, M. and Levrouw, L. 1996. On the implementation of a replay mechanism. In L. Bouge, P. Fraigniaud, A. Mignotte, and Y. Robert, editors, In Computer Architecture News, Vol. 29(1) ACM Press, Proceedings of EuroPar '96, Vol. 1123 of LNCS, pp. 70–73. Lyon: Springer-Verlag.

    Google Scholar 

  • Ronsse, M., Levrouw, L., and Bastiaens, K. 1995. Efficient coding of execution-traces of parallel programs. In J.P. Veen, editor, Proceedings of the ProRISC/IEEE Benelux Workshop on Circuits, Systems and Signal Processing, pp. 251–258. Utrecht, STW.

    Google Scholar 

  • Sarin, S. and Lynch, L. 1987. Discarding obsolete information in a replicated data base system. In IEEE Transactions on Software Engineering, Vol. SE, pp. 39–46.

    Google Scholar 

  • Savage, S., Burrows, M., Nelson, G., Sobalvarro, P., and Anderson, T. 1997. Eraser: A dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems, 15(4):391–411.

    Google Scholar 

  • Schonberg, E. 1989. On-the-fly detection of access anomalies. In Proceedings of the SIGPLAN '89 Conference on Programming Language Design and Implementation, ACM SIGPLAN Notices, 24(7):285–297.

    Google Scholar 

  • Shih, C. and Stankovic, J. 1990. Distributed deadlock detection in Ada runtime environments. In TRI-Ada'90 Proceedings, pp. 362–375. Baltimore, MD: ACM/SIGAda.

    Google Scholar 

  • SunSoft: 1994. lock lint User's Guide.

  • Tanenbaum, A.S. and Woodhull, A.S. 1997. Operating Systems Design and Implementation. Prentice-Hall, NJ: Englewood Cliffs.

    Google Scholar 

  • Whaley, J. and Rinard, M. 1999. Compositional pointer and escape analysis for java programs. In Proceedings of the 14th Annual Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA99), Denver, Colorado, pp. 187–206.

  • Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., and Gupta, A. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. of the 22nd Annual International Symposium on Computer Architecture. pp. 24–36.

  • Wuu, G. and Bernstein, A. 1984. Efficient solutions to the replicated log and dictionary problems. In Proc. 3rd ACM Symp. Principles Distributed Computing, pp. 233–242. New York: ACM Press.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ronsse, M., De Bosschere, K. Non-Intrusive Detection of Synchronization Errors Using Execution Replay. Automated Software Engineering 9, 95–121 (2002). https://doi.org/10.1023/A:1013236320820

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1013236320820

Navigation