ABSTRACT
While program hangs on large parallel systems can be detected via the widely used timeout mechanism, it is difficult for the users to set the timeout - too small a timeout leads to high false alarm rates and too large a timeout wastes a vast amount of valuable computing resources. To address the above problems with hang detection, this paper presents ParaStack, an extremely lightweight tool to detect hangs in a timely manner with high accuracy, negligible overhead with great scalability, and without requiring the user to select a timeout value. For a detected hang, it provides direction for further analysis by telling users whether the hang is the result of an error in the computation phase or the communication phase. For a computation-error induced hang, our tool pinpoints the faulty process by excluding hundreds and thousands of other processes. We have adapted ParaStack to work with the Torque and Slurm parallel batch schedulers and validated its functionality and performance on Tianhe-2 and Stampede that are respectively the world's current 2nd and 12th fastest supercomputers. Experimental results demonstrate that ParaStack detects hangs in a timely manner at negligible overhead with over 99% accuracy. No false alarm is observed in correct runs taking 66 hours at scale of 256 processes and 39.7 hours at scale of 1024 processes. ParaStack accurately reports the faulty process for computation-error induced hangs.
- Top500 list. http://www.top500.org/lists/2015/11/.Google Scholar
- IO-Watchdog. https://code.google.com/p/io-watchdog/.Google Scholar
- Bug occurs after 12 hours. https://github.com/open-mpi/ompi/issues/81/.Google Scholar
- Bug occurs after 200 iterations. https://github.com/open-mpi/ompi/issues/99.Google Scholar
- NAS parallel benchmarks. https://www.nas.nasa.gov/publications/npb.html.Google Scholar
- Probability theory and mathematical statistics: normal approximation to binomial. https://onlinecourses.science.psu.edu/stat414/node/179.Google Scholar
- HPL: a portable implementation of the high-performance Linpack benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl/.Google Scholar
- Paradyn project: Dyninst. http://www.paradyn.org/html/manuals.html#dyninstGoogle Scholar
- Ohio Supercomputer Center's charging policy. https://www.osc.edu/supercomputing/software/general#chargingGoogle Scholar
- San Diego Supercomputer Center's charging policy. http://www.sdsc.edu/support/user_guides/comet.html#chargingGoogle Scholar
- D.H. Ahn, B.R. De Supinski, I. Laguna, G.L. Lee, B. Liblit, B.P. Miller, and M. Schulz. Scalable temporal order analysis for large scale debugging. In ACM/IEEE Conference on High Performance Computing Networking, Storage and Analysis (SC), Article No. 44, Nov. 2009. Google ScholarDigital Library
- D.C. Arnold, D.H. Ahn, B.R. de Supinski, G.L. Lee, B.P. Miller, and M. Schulz. Stack trace analysis for large scale debugging. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1--10, March 2007.Google ScholarCross Ref
- G. Bronevetsky, I. Laguna, S. Bagchi, B. R. de Supinski, D. Ahn, and M. Schulz. Automaded: Automata-based debugging for dissimilar parallel tasks. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 231--240, July 2010.Google ScholarCross Ref
- Z. Chen, J. Dinan, Z. Tang, P. Balaji, H. Zhong, J. Wei, T. Huang, and F. Qin. Mc-checker: Detecting memory consistency errors in mpi one-sided applications. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 499--510, 2014. Google ScholarDigital Library
- Z. Chen, Q. Gao, W. Zhang, and F. Qin. Flowchecker: Detecting bugs in mpi libraries via message flow checking. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010. Google ScholarDigital Library
- Z. Chen, X. Li, J-Y. Chen, H. Zhong, and F. Qin. Syncchecker: Detecting synchronization errors between mpi applications and libraries. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 342--353, May 2012. Google ScholarDigital Library
- J. Coyle, J. Hoekstra, G. R. Luecke, Y. Zou, and M. Kraeva. Deadlock detection in MPI programs. Concurrency and Computation: Practice and Experience, 14(11):911--932, 2002.Google ScholarCross Ref
- J. DeSouza, B. Kuhn, B. R. de Supinski, V. Samofalov, S. Zheltov, and S. Bratanov. Automated, scalable debugging of mpi programs with intel® message checker. In SE-HPCS Workshop, pages 78--82, 2005. Google ScholarDigital Library
- Q. Gao, F. Qin, and D. K. Panda. Dmtracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In ACM/IEEE Conference on Supercomputing (SC), Article No. 15, 2007. Google ScholarDigital Library
- W. Haque. Concurrent deadlock detection in parallel programs. International Journal of Computers and Applications, 28(1):19--25, Jan 2006. Google ScholarDigital Library
- M. A Heroux and J. Dongarra. Toward a new metric for ranking high performance computing systems. TR SAND2013-4744, June 2013.Google Scholar
- T. Hilbrich, B. R. de Supinski, W. E. Nagel, J. Protze, C. Baier, and M. S. Müller. Distributed wait state tracking for runtime mpi deadlock detection. In ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC), 2013. Google ScholarDigital Library
- T. Hilbrich, B. R. de Supinski, Martin Schulz, and Matthias S. Müller. A graph based approach for mpi deadlock detection. In ACM/IEEE International Conference on Supercomputing (SC), pages 296--305, 2009. Google ScholarDigital Library
- T. Hoefler and R. Belli. Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), page 73. ACM, 2015. Google ScholarDigital Library
- B. Krammer, K. Bidmon, M. S. Müller, and M. M. Resch. Marmot: An mpi analysis and checking tool. In PARCO, pages 493--500, 2003.Google Scholar
- I. Laguna, T. Gamblin, B. R. de Supinski, S. Bagchi, G. Bronevetsky, D. H. Anh, M. Schulz, B. Rountree. Large scale debugging of parallel tasks with AutomaDeD. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 50:1--50:10, 2011. Google ScholarDigital Library
- I. Laguna, D. H. Ahn, B. R. de Supinski, S. Bagchi, and T. Gamblin. Probabilistic diagnosis of performance faults in large-scale parallel applications. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 213--222, 2012. Google ScholarDigital Library
- I. Laguna, D. H. Ahn, B. R. de Supinski, S. Bagchi, and T. Gamblin. Diagnosis of performance faults in large scale mpi applications via probabilistic progress-dependence inference. In IEEE Transactions on Parallel and Distributed Systems (TPDS), 26(5):1280--1289, 2015.Google ScholarCross Ref
- I. Laguna, D. H. Ahn, B. R. de Supinski, T. Gamblin, G. L. Lee, M. Schulz, S. Bagchi, M. Kulkarni, B. Zhou, Z. Chen, et al. Debugging high-performance computing applications at massive scales. CACM, 58(9):72--81, 2015. Google ScholarDigital Library
- A.V. Mirgorodskiy, N. Maruyama, and B.P. Miller. Problem diagnosis in large-scale computing environments. In ACM/IEEE Supercomputing Conference (SC), Nov 2006. Google ScholarDigital Library
- S. Mitra, I. Laguna, D. H. Ahn, S. Bagchi, M. Schulz, and T. Gamblin. Accurate application progress analysis for large-scale parallel debugging. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 193--203, 2014. Google ScholarDigital Library
- P. Ohly and W. Krotz-Vogel. Automated mpi correctness checking: What if there was a magic option? In LCI HPCC, 2007.Google Scholar
- F. Petrini, D.J. Kerbyson, and S. Pakin. The case of the missing supercomputer performance: Achieving optimal performance on the 8,192 processors of ASCI Q. In ACM/IEEE Conference on Supercomputing (SC), page 55. ACM, 2003. Google ScholarDigital Library
- X. Song, H. Chen, and B. Zang. Why software hangs and what can be done with it. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), June 2010.Google ScholarCross Ref
- F. S. Swed and C. Eisenhart. Tables for testing randomness of grouping in a sequence of alternatives. The Annals of Mathematical Statistics, 14(1):66--87, 03 1943.Google ScholarCross Ref
- M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, et al. Addressing failures in exascale computing. International Journal of High Performance Computing Applications (IJHPCA), page 1094342014522573, 2014. Google ScholarDigital Library
- S. S. Vakkalanka, S. Sharma, G. Gopalakrishnan, and R. M. Kirby. Isp: A tool for model checking mpi programs. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2008. Google ScholarDigital Library
- J. S. Vetter and B. R. de Supinski. Dynamic software testing of mpi applications with umpire. In ACM/IEEE Conference on Supercomputing (SC), Article No. 51, 2000. Google ScholarDigital Library
- B. Zhou, M. Kulkarni, and S. Bagchi. Vrisha: using scaling properties of parallel programs for bug detection and localization. In International Symposium on High-Performance Distributed Computing (HPDC), pages 85--96. ACM, 2011. Google ScholarDigital Library
- B. Zhou, J. Too, M. Kulkarni, and S. Bagchi. Wukong: automatically detecting and localizing bugs that manifest at large system scales. In International Symposium on High-performance Parallel and Distributed Computing (HPDC), pages 131--142. ACM, 2013. Google ScholarDigital Library
Recommendations
A necessary and sufficient condition for transforming limited accuracy failure detectors
AbstractUnreliable failure detectors are oracles that give information about process failures. Chandra and Toueg were first to study such failure detectors for distributed systems, and they identified a number that enabled the solution of the Consensus ...
On the Quality of Service of Crash-Recovery Failure Detectors
We model the probabilistic behavior of a system comprising a failure detector and a monitored crash-recovery target. We extend failure detectors to take account of failure recovery in the target system. This involves extending QoS measures to include ...
On Quiescent Reliable Communication
We study the problem of achieving reliable communication with quiescent algorithms (i.e., algorithms that eventually stop sending messages) in asynchronous systems with process crashes and lossy links. We first show that it is impossible to solve this ...
Comments