Skip to main content
Log in

MPI-RCDD: A Framework for MPI Runtime Communication Deadlock Detection

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

The message passing interface (MPI) has become a de facto standard for programming models of high-performance computing, but its rich and flexible interface semantics makes the program easy to generate communication deadlock, which seriously affects the usability of the system. However, the existing detection tools for MPI communication deadlock are not scalable enough to adapt to the continuous expansion of system scale. In this context, we propose a framework for MPI runtime communication deadlock detection, namely MPI-RCDD, which contains three kinds of main mechanisms. Firstly, MPI-RCDD has a message logging protocol that is associated with deadlock detection to ensure that the communication messages required for deadlock analysis are not lost. Secondly, it uses the asynchronous processing thread provided by the MPI to implement the transfer of dependencies between processes, so that multiple processes can participate in deadlock detection simultaneously, thus alleviating the performance bottleneck problem of centralized analysis. In addition, it uses an AND⊕OR model based algorithm named AODA to perform deadlock analysis work. The AODA algorithm combines the advantages of both timeout-based and dependency-based deadlock analysis approaches, and allows the processes in the timeout state to search for a deadlock circle or knot in the process of dependency transfer. Further, the AODA algorithm cannot lead to false positives and can represent the source of the deadlock accurately. The experimental results on typical MPI communication deadlock benchmarks such as Umpire Test Suit demonstrate the capability of MPI-RCDD. Additionally, the experiments on the NPB benchmarks obtain the satisfying performance cost, which show that the MPI-RCDD has strong scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Vakkalanka S. Efficient dynamic verification algorithms for MPI applications [Ph.D. Thesis]. School of Computing, The University of Utah, 2010.

  2. Luecke G R, Zou Y, Coyle J et al. Deadlock detection in MPI programs. Concurrency and Computation: Practice and Experience, 2002, 14(11): 911-932.

    Article  Google Scholar 

  3. Krammer B, Bidmon K, Müller M S et al. MARMOT: An MPI analysis and checking tool. Advances in Parallel Computing, 2004, 13: 493-500.

    Article  Google Scholar 

  4. Vetter J S, de Supinski B R. Dynamic software testing of MPI applications with Umpire. In Proc. the 2000 ACM/IEEE Conference on Supercomputing, November 2000, Article No. 51.

  5. Hilbrich T, Schulz M, de Supinski B R et al. MUST: A scalable approach to runtime error detection in MPI programs. In Proc. the 3rd International Workshop on Parallel Tools for High Performance Computing, September 2000, pp.53-66.

  6. Hilbrich T, Protze J, Schulz M et al. MPI runtime error detection with MUST: Advances in deadlock detection. Scientific Programming, 2013, 21(3/4): 109-121.

    Article  Google Scholar 

  7. Do-Mai A T, Diep T D, Thoai N. Race condition and deadlock detection for large-scale applications. In Proc. the 15th International Symposium on Parallel and Distributed Computing, July 2016, pp.319-326.

  8. Forejt V, Joshi S, Kroening D et al. Precise predictive analysis for discovering communication deadlocks in MPI programs. ACM Transactions on Programming Languages and Systems, 2017, 39(4): Article No. 15.

  9. Alnemari R A, Fadel M A, Eassa F. Integrating static and dynamic analysis techniques for detecting dynamic errors in MPI programs. International Journal of Computer Science and Mobile Computing, 2018, 7(4): 141-147.

    Google Scholar 

  10. Alghamdi A M, Eassa F E. Software testing techniques for parallel systems: A survey. International Journal of Computer Science and Network Security, 2019, 19(4): 176-186.

    Google Scholar 

  11. Hilbrich T, de Supinski B R, Schulz M et al. A graph based approach for MPI deadlock detection. In Proc. the 23rd International Conference on Supercomputing, June 2009, pp.296-305.

  12. Siegel S F, Zirkel T K. FEVS: A functional equivalence verification suite for high-performance scientific computing. Mathematics in Computer Science, 2011, 5(4): 427-435.

    Article  Google Scholar 

  13. Müller M, de Supinski B, Gopalakrishnan G et al. Dealing with MPI bugs at scale: Best practices, automatic detection, debugging, and formal verification. http://www.cs.utah.edu/fv/publications/sc11_with_handson.pptx, October 2019.

  14. Bailey D H, Barszcz E, Barton J T et al. The NAS parallel benchmarks. The International Journal of Supercomputing Applications, 1991, 5(3): 63-73.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Gao.

Electronic supplementary material

ESM 1

(PDF 531 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, HM., Gao, J., Qing, P. et al. MPI-RCDD: A Framework for MPI Runtime Communication Deadlock Detection. J. Comput. Sci. Technol. 35, 395–411 (2020). https://doi.org/10.1007/s11390-020-9701-4

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-020-9701-4

Keywords

Navigation