skip to main content
10.1145/3126908.3126938acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

Parastack: efficient hang detection for MPI programs at large scale

Published:12 November 2017Publication History

ABSTRACT

While program hangs on large parallel systems can be detected via the widely used timeout mechanism, it is difficult for the users to set the timeout - too small a timeout leads to high false alarm rates and too large a timeout wastes a vast amount of valuable computing resources. To address the above problems with hang detection, this paper presents ParaStack, an extremely lightweight tool to detect hangs in a timely manner with high accuracy, negligible overhead with great scalability, and without requiring the user to select a timeout value. For a detected hang, it provides direction for further analysis by telling users whether the hang is the result of an error in the computation phase or the communication phase. For a computation-error induced hang, our tool pinpoints the faulty process by excluding hundreds and thousands of other processes. We have adapted ParaStack to work with the Torque and Slurm parallel batch schedulers and validated its functionality and performance on Tianhe-2 and Stampede that are respectively the world's current 2nd and 12th fastest supercomputers. Experimental results demonstrate that ParaStack detects hangs in a timely manner at negligible overhead with over 99% accuracy. No false alarm is observed in correct runs taking 66 hours at scale of 256 processes and 39.7 hours at scale of 1024 processes. ParaStack accurately reports the faulty process for computation-error induced hangs.

References

  1. Top500 list. http://www.top500.org/lists/2015/11/.Google ScholarGoogle Scholar
  2. IO-Watchdog. https://code.google.com/p/io-watchdog/.Google ScholarGoogle Scholar
  3. Bug occurs after 12 hours. https://github.com/open-mpi/ompi/issues/81/.Google ScholarGoogle Scholar
  4. Bug occurs after 200 iterations. https://github.com/open-mpi/ompi/issues/99.Google ScholarGoogle Scholar
  5. NAS parallel benchmarks. https://www.nas.nasa.gov/publications/npb.html.Google ScholarGoogle Scholar
  6. Probability theory and mathematical statistics: normal approximation to binomial. https://onlinecourses.science.psu.edu/stat414/node/179.Google ScholarGoogle Scholar
  7. HPL: a portable implementation of the high-performance Linpack benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl/.Google ScholarGoogle Scholar
  8. Paradyn project: Dyninst. http://www.paradyn.org/html/manuals.html#dyninstGoogle ScholarGoogle Scholar
  9. Ohio Supercomputer Center's charging policy. https://www.osc.edu/supercomputing/software/general#chargingGoogle ScholarGoogle Scholar
  10. San Diego Supercomputer Center's charging policy. http://www.sdsc.edu/support/user_guides/comet.html#chargingGoogle ScholarGoogle Scholar
  11. D.H. Ahn, B.R. De Supinski, I. Laguna, G.L. Lee, B. Liblit, B.P. Miller, and M. Schulz. Scalable temporal order analysis for large scale debugging. In ACM/IEEE Conference on High Performance Computing Networking, Storage and Analysis (SC), Article No. 44, Nov. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D.C. Arnold, D.H. Ahn, B.R. de Supinski, G.L. Lee, B.P. Miller, and M. Schulz. Stack trace analysis for large scale debugging. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1--10, March 2007.Google ScholarGoogle ScholarCross RefCross Ref
  13. G. Bronevetsky, I. Laguna, S. Bagchi, B. R. de Supinski, D. Ahn, and M. Schulz. Automaded: Automata-based debugging for dissimilar parallel tasks. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 231--240, July 2010.Google ScholarGoogle ScholarCross RefCross Ref
  14. Z. Chen, J. Dinan, Z. Tang, P. Balaji, H. Zhong, J. Wei, T. Huang, and F. Qin. Mc-checker: Detecting memory consistency errors in mpi one-sided applications. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 499--510, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Z. Chen, Q. Gao, W. Zhang, and F. Qin. Flowchecker: Detecting bugs in mpi libraries via message flow checking. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Z. Chen, X. Li, J-Y. Chen, H. Zhong, and F. Qin. Syncchecker: Detecting synchronization errors between mpi applications and libraries. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 342--353, May 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Coyle, J. Hoekstra, G. R. Luecke, Y. Zou, and M. Kraeva. Deadlock detection in MPI programs. Concurrency and Computation: Practice and Experience, 14(11):911--932, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  18. J. DeSouza, B. Kuhn, B. R. de Supinski, V. Samofalov, S. Zheltov, and S. Bratanov. Automated, scalable debugging of mpi programs with intel® message checker. In SE-HPCS Workshop, pages 78--82, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Q. Gao, F. Qin, and D. K. Panda. Dmtracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In ACM/IEEE Conference on Supercomputing (SC), Article No. 15, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Haque. Concurrent deadlock detection in parallel programs. International Journal of Computers and Applications, 28(1):19--25, Jan 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. A Heroux and J. Dongarra. Toward a new metric for ranking high performance computing systems. TR SAND2013-4744, June 2013.Google ScholarGoogle Scholar
  22. T. Hilbrich, B. R. de Supinski, W. E. Nagel, J. Protze, C. Baier, and M. S. Müller. Distributed wait state tracking for runtime mpi deadlock detection. In ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Hilbrich, B. R. de Supinski, Martin Schulz, and Matthias S. Müller. A graph based approach for mpi deadlock detection. In ACM/IEEE International Conference on Supercomputing (SC), pages 296--305, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Hoefler and R. Belli. Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), page 73. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Krammer, K. Bidmon, M. S. Müller, and M. M. Resch. Marmot: An mpi analysis and checking tool. In PARCO, pages 493--500, 2003.Google ScholarGoogle Scholar
  26. I. Laguna, T. Gamblin, B. R. de Supinski, S. Bagchi, G. Bronevetsky, D. H. Anh, M. Schulz, B. Rountree. Large scale debugging of parallel tasks with AutomaDeD. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 50:1--50:10, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. I. Laguna, D. H. Ahn, B. R. de Supinski, S. Bagchi, and T. Gamblin. Probabilistic diagnosis of performance faults in large-scale parallel applications. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 213--222, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. I. Laguna, D. H. Ahn, B. R. de Supinski, S. Bagchi, and T. Gamblin. Diagnosis of performance faults in large scale mpi applications via probabilistic progress-dependence inference. In IEEE Transactions on Parallel and Distributed Systems (TPDS), 26(5):1280--1289, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  29. I. Laguna, D. H. Ahn, B. R. de Supinski, T. Gamblin, G. L. Lee, M. Schulz, S. Bagchi, M. Kulkarni, B. Zhou, Z. Chen, et al. Debugging high-performance computing applications at massive scales. CACM, 58(9):72--81, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A.V. Mirgorodskiy, N. Maruyama, and B.P. Miller. Problem diagnosis in large-scale computing environments. In ACM/IEEE Supercomputing Conference (SC), Nov 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Mitra, I. Laguna, D. H. Ahn, S. Bagchi, M. Schulz, and T. Gamblin. Accurate application progress analysis for large-scale parallel debugging. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 193--203, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Ohly and W. Krotz-Vogel. Automated mpi correctness checking: What if there was a magic option? In LCI HPCC, 2007.Google ScholarGoogle Scholar
  33. F. Petrini, D.J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In ACM/IEEE Conference on Supercomputing (SC), page 55. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. X. Song, H. Chen, and B. Zang. Why software hangs and what can be done with it. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), June 2010.Google ScholarGoogle ScholarCross RefCross Ref
  35. F. S. Swed and C. Eisenhart. Tables for testing randomness of grouping in a sequence of alternatives. The Annals of Mathematical Statistics, 14(1):66--87, 03 1943.Google ScholarGoogle ScholarCross RefCross Ref
  36. M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, et al. Addressing failures in exascale computing. International Journal of High Performance Computing Applications (IJHPCA), page 1094342014522573, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. S. Vakkalanka, S. Sharma, G. Gopalakrishnan, and R. M. Kirby. Isp: A tool for model checking mpi programs. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. S. Vetter and B. R. de Supinski. Dynamic software testing of mpi applications with umpire. In ACM/IEEE Conference on Supercomputing (SC), Article No. 51, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. B. Zhou, M. Kulkarni, and S. Bagchi. Vrisha: using scaling properties of parallel programs for bug detection and localization. In International Symposium on High-Performance Distributed Computing (HPDC), pages 85--96. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. B. Zhou, J. Too, M. Kulkarni, and S. Bagchi. Wukong: automatically detecting and localizing bugs that manifest at large system scales. In International Symposium on High-performance Parallel and Distributed Computing (HPDC), pages 131--142. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2017
    801 pages
    ISBN:9781450351140
    DOI:10.1145/3126908
    • General Chair:
    • Bernd Mohr,
    • Program Chair:
    • Padma Raghavan

    Copyright © 2017 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 12 November 2017

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    SC '17 Paper Acceptance Rate61of327submissions,19%Overall Acceptance Rate1,516of6,373submissions,24%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader