research-article

Public Access

Parastack: efficient hang detection for MPI programs at large scale

Authors:
Hongbo Li

University of California

University of California
View Profile

,
Zizhong Chen

University of California

University of California
View Profile

,
Rajiv Gupta

University of California

University of California
View Profile

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2017Article No.: 63Pages 1–12https://doi.org/10.1145/3126908.3126938

Published:12 November 2017Publication History

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

While program hangs on large parallel systems can be detected via the widely used timeout mechanism, it is difficult for the users to set the timeout - too small a timeout leads to high false alarm rates and too large a timeout wastes a vast amount of valuable computing resources. To address the above problems with hang detection, this paper presents ParaStack, an extremely lightweight tool to detect hangs in a timely manner with high accuracy, negligible overhead with great scalability, and without requiring the user to select a timeout value. For a detected hang, it provides direction for further analysis by telling users whether the hang is the result of an error in the computation phase or the communication phase. For a computation-error induced hang, our tool pinpoints the faulty process by excluding hundreds and thousands of other processes. We have adapted ParaStack to work with the Torque and Slurm parallel batch schedulers and validated its functionality and performance on Tianhe-2 and Stampede that are respectively the world's current 2^nd and 12^th fastest supercomputers. Experimental results demonstrate that ParaStack detects hangs in a timely manner at negligible overhead with over 99% accuracy. No false alarm is observed in correct runs taking 66 hours at scale of 256 processes and 39.7 hours at scale of 1024 processes. ParaStack accurately reports the faulty process for computation-error induced hangs.

References

Top500 list. http://www.top500.org/lists/2015/11/.Google Scholar
IO-Watchdog. https://code.google.com/p/io-watchdog/.Google Scholar
Bug occurs after 12 hours. https://github.com/open-mpi/ompi/issues/81/.Google Scholar
Bug occurs after 200 iterations. https://github.com/open-mpi/ompi/issues/99.Google Scholar
NAS parallel benchmarks. https://www.nas.nasa.gov/publications/npb.html.Google Scholar
Probability theory and mathematical statistics: normal approximation to binomial. https://onlinecourses.science.psu.edu/stat414/node/179.Google Scholar
HPL: a portable implementation of the high-performance Linpack benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl/.Google Scholar
Paradyn project: Dyninst. http://www.paradyn.org/html/manuals.html#dyninstGoogle Scholar
Ohio Supercomputer Center's charging policy. https://www.osc.edu/supercomputing/software/general#chargingGoogle Scholar
San Diego Supercomputer Center's charging policy. http://www.sdsc.edu/support/user_guides/comet.html#chargingGoogle Scholar
D.H. Ahn, B.R. De Supinski, I. Laguna, G.L. Lee, B. Liblit, B.P. Miller, and M. Schulz. Scalable temporal order analysis for large scale debugging. In ACM/IEEE Conference on High Performance Computing Networking, Storage and Analysis (SC), Article No. 44, Nov. 2009. Google ScholarDigital Library
D.C. Arnold, D.H. Ahn, B.R. de Supinski, G.L. Lee, B.P. Miller, and M. Schulz. Stack trace analysis for large scale debugging. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1--10, March 2007.Google ScholarCross Ref
G. Bronevetsky, I. Laguna, S. Bagchi, B. R. de Supinski, D. Ahn, and M. Schulz. Automaded: Automata-based debugging for dissimilar parallel tasks. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 231--240, July 2010.Google ScholarCross Ref
Z. Chen, J. Dinan, Z. Tang, P. Balaji, H. Zhong, J. Wei, T. Huang, and F. Qin. Mc-checker: Detecting memory consistency errors in mpi one-sided applications. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 499--510, 2014. Google ScholarDigital Library
Z. Chen, Q. Gao, W. Zhang, and F. Qin. Flowchecker: Detecting bugs in mpi libraries via message flow checking. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010. Google ScholarDigital Library
Z. Chen, X. Li, J-Y. Chen, H. Zhong, and F. Qin. Syncchecker: Detecting synchronization errors between mpi applications and libraries. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 342--353, May 2012. Google ScholarDigital Library
J. Coyle, J. Hoekstra, G. R. Luecke, Y. Zou, and M. Kraeva. Deadlock detection in MPI programs. Concurrency and Computation: Practice and Experience, 14(11):911--932, 2002.Google ScholarCross Ref
J. DeSouza, B. Kuhn, B. R. de Supinski, V. Samofalov, S. Zheltov, and S. Bratanov. Automated, scalable debugging of mpi programs with intel® message checker. In SE-HPCS Workshop, pages 78--82, 2005. Google ScholarDigital Library
Q. Gao, F. Qin, and D. K. Panda. Dmtracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In ACM/IEEE Conference on Supercomputing (SC), Article No. 15, 2007. Google ScholarDigital Library
W. Haque. Concurrent deadlock detection in parallel programs. International Journal of Computers and Applications, 28(1):19--25, Jan 2006. Google ScholarDigital Library
M. A Heroux and J. Dongarra. Toward a new metric for ranking high performance computing systems. TR SAND2013-4744, June 2013.Google Scholar
T. Hilbrich, B. R. de Supinski, W. E. Nagel, J. Protze, C. Baier, and M. S. Müller. Distributed wait state tracking for runtime mpi deadlock detection. In ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2013. Google ScholarDigital Library
T. Hilbrich, B. R. de Supinski, Martin Schulz, and Matthias S. Müller. A graph based approach for mpi deadlock detection. In ACM/IEEE International Conference on Supercomputing (SC), pages 296--305, 2009. Google ScholarDigital Library
T. Hoefler and R. Belli. Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), page 73. ACM, 2015. Google ScholarDigital Library
B. Krammer, K. Bidmon, M. S. Müller, and M. M. Resch. Marmot: An mpi analysis and checking tool. In PARCO, pages 493--500, 2003.Google Scholar
I. Laguna, T. Gamblin, B. R. de Supinski, S. Bagchi, G. Bronevetsky, D. H. Anh, M. Schulz, B. Rountree. Large scale debugging of parallel tasks with AutomaDeD. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 50:1--50:10, 2011. Google ScholarDigital Library
I. Laguna, D. H. Ahn, B. R. de Supinski, S. Bagchi, and T. Gamblin. Probabilistic diagnosis of performance faults in large-scale parallel applications. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 213--222, 2012. Google ScholarDigital Library
I. Laguna, D. H. Ahn, B. R. de Supinski, S. Bagchi, and T. Gamblin. Diagnosis of performance faults in large scale mpi applications via probabilistic progress-dependence inference. In IEEE Transactions on Parallel and Distributed Systems (TPDS), 26(5):1280--1289, 2015.Google ScholarCross Ref
I. Laguna, D. H. Ahn, B. R. de Supinski, T. Gamblin, G. L. Lee, M. Schulz, S. Bagchi, M. Kulkarni, B. Zhou, Z. Chen, et al. Debugging high-performance computing applications at massive scales. CACM, 58(9):72--81, 2015. Google ScholarDigital Library
A.V. Mirgorodskiy, N. Maruyama, and B.P. Miller. Problem diagnosis in large-scale computing environments. In ACM/IEEE Supercomputing Conference (SC), Nov 2006. Google ScholarDigital Library
S. Mitra, I. Laguna, D. H. Ahn, S. Bagchi, M. Schulz, and T. Gamblin. Accurate application progress analysis for large-scale parallel debugging. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 193--203, 2014. Google ScholarDigital Library
P. Ohly and W. Krotz-Vogel. Automated mpi correctness checking: What if there was a magic option? In LCI HPCC, 2007.Google Scholar
F. Petrini, D.J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In ACM/IEEE Conference on Supercomputing (SC), page 55. ACM, 2003. Google ScholarDigital Library
X. Song, H. Chen, and B. Zang. Why software hangs and what can be done with it. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), June 2010.Google ScholarCross Ref
F. S. Swed and C. Eisenhart. Tables for testing randomness of grouping in a sequence of alternatives. The Annals of Mathematical Statistics, 14(1):66--87, 03 1943.Google ScholarCross Ref
M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, et al. Addressing failures in exascale computing. International Journal of High Performance Computing Applications (IJHPCA), page 1094342014522573, 2014. Google ScholarDigital Library
S. S. Vakkalanka, S. Sharma, G. Gopalakrishnan, and R. M. Kirby. Isp: A tool for model checking mpi programs. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2008. Google ScholarDigital Library
J. S. Vetter and B. R. de Supinski. Dynamic software testing of mpi applications with umpire. In ACM/IEEE Conference on Supercomputing (SC), Article No. 51, 2000. Google ScholarDigital Library
B. Zhou, M. Kulkarni, and S. Bagchi. Vrisha: using scaling properties of parallel programs for bug detection and localization. In International Symposium on High-Performance Distributed Computing (HPDC), pages 85--96. ACM, 2011. Google ScholarDigital Library
B. Zhou, J. Too, M. Kulkarni, and S. Bagchi. Wukong: automatically detecting and localizing bugs that manifest at large system scales. In International Symposium on High-performance Parallel and Distributed Computing (HPDC), pages 131--142. ACM, 2013. Google ScholarDigital Library

Recommendations

A necessary and sufficient condition for transforming limited accuracy failure detectors
Abstract
Unreliable failure detectors are oracles that give information about process failures. Chandra and Toueg were first to study such failure detectors for distributed systems, and they identified a number that enabled the solution of the Consensus ...
Read More
On the Quality of Service of Crash-Recovery Failure Detectors

We model the probabilistic behavior of a system comprising a failure detector and a monitored crash-recovery target. We extend failure detectors to take account of failure recovery in the target system. This involves extending QoS measures to include ...
Read More
On Quiescent Reliable Communication

We study the problem of achieving reliable communication with quiescent algorithms (i.e., algorithms that eventually stop sending messages) in asynchronous systems with process crashes and lossy links. We first show that it is impossible to solve this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2017
801 pages
ISBN:9781450351140
DOI:10.1145/3126908
General Chair:
Bernd Mohr
Jülich Supercomputing Center, Jülich, Germany
,
Program Chair:
Padma Raghavan
Vanderbilt University, Nashville, TN
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SC '17 Paper Acceptance Rate61of327submissions,19%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 263
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Parastack: efficient hang detection for MPI programs at large scale

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Recommendations

A necessary and sufficient condition for transforming limited accuracy failure detectors

On the Quality of Service of Crash-Recovery Failure Detectors

On Quiescent Reliable Communication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Parastack: efficient hang detection for MPI programs at large scale

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Recommendations

A necessary and sufficient condition for transforming limited accuracy failure detectors

On the Quality of Service of Crash-Recovery Failure Detectors

On Quiescent Reliable Communication

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media