ABSTRACT
Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while small-scale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. A popular approach to finding bugs is statistical bug detection, where abnormal behavior is detected through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales and therefore traditional statistical techniques are not viable. In this paper, we propose Vrisha, a statistical approach to detecting and localizing scale-dependent bugs. Vrisha detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale. We use Vrisha to detect and diagnose two bugs caused by errors in popular MPI libraries and show that our techniques can be implemented with low overhead and low false-positive rates.
- http://www.mcs.anl.gov/research/projects/mpich2/.Google Scholar
- https://trac.mcs.anl.gov/projects/mpich2/changeset/5262.Google Scholar
- https://trac.mcs.anl.gov/projects/mpich2/browser/mpich2/trunk/src/mpi/col%l/allgatherv.c.Google Scholar
- http://trac.mcs.anl.gov/projects/mpich2/ticket/1005.Google Scholar
- D. H. Ahn, B. R. de Supinski, I. Laguna, G. L. Lee, B. Liblit, B. P. Miller, and M. Schulz. Scalable temporal order analysis for large scale debugging. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 44:1--44:11, 2009. Google ScholarDigital Library
- F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1--48, March 2003. Google ScholarDigital Library
- D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS Parallel Benchmarks. RNR-91-002, NASA Ames Research Center, August 1991.Google Scholar
- D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. Nas parallel benchmark results. In Proceedings of the 1992 ACM/IEEE conference on Supercomputing, pages 386--393, 1992. Google ScholarDigital Library
- S. Balay, J. Brown, K. Buschelman, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc Web page, 2009. http://www.mcs.anl.gov/petsc.Google Scholar
- G. Bronevetsky, I. Laguna, S. Bagchi, B. R. de Supinski,, M. Schulz, and D. H. Ahn. Statistical Fault Detection for Parallel Applications with AutomaDeD. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), pages 1--6, 2010.Google Scholar
- G. Bronevetsky, I. Laguna, S. Bagchi, B. R. de Supinski, D. H. Ahn,, and M. Schulz. AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks. In 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 231--240, June-July 2010.Google Scholar
- Z. Chen, Q. Gao, W. Zhang, and F. Qin. Flowchecker: Detecting bugs in mpi libraries via message flow checking. In Proceedings of the 2010 ACM/IEEE International Conference on Supercomputing, SC '10, pages 1--11, 2010. Google ScholarDigital Library
- N. DeBardeleben. Fault-Tolerance for HPC at Extreme Scale, 2010.Google Scholar
- S. Fu and C. Xu. Exploring Event Correlation For Failure Prediction In Coalitions Of Clusters. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--12. ACM, 2007. Google ScholarDigital Library
- A. Ganapathi, K. Datta, A. Fox, and D. Patterson. A case for machine learning to optimize multicore performance. In Proceedings of the First USENIX conference on Hot topics in parallelism, HotPar'09, pages 1--6, 2009. Google ScholarDigital Library
- A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In Proceedings of the 2009 IEEE International Conference on Data Engineering, pages 592--603, 2009. Google ScholarDigital Library
- Q. Gao, F. Qin, and D. K. Panda. Dmtracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC '07, pages 1--12, 2007. Google ScholarDigital Library
- D. Herbert, V. Sundaram, Y.-H. Lu, S. Bagchi, and Z. Li. Adaptive correctness monitoring for wireless sensor networks using hierarchical distributed run-time invariant checking. ACM Trans. Auton. Adapt. Syst., 2, September 2007. Google ScholarDigital Library
- H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):pp. 321--377, 1936.Google ScholarCross Ref
- M. P. Kasick, J. Tan, R. Gandhi, and P. Narasimhan. Black-box problem diagnosis in parallel file systems. In FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies, pages 1--14, 2010. Google ScholarDigital Library
- G. L. Lee, D. H. Ahn, D. C. Arnold, B. R. de Supinski, M. Legendre, B. P. Miller, M. Schulz, and B. Liblit. Lessons Learned at 208K: Towards Debugging Millions of Cores. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing (SC), SC '08, pages 1--9, 2008. Google ScholarDigital Library
- S. Michalak. Silent Data Corruption: A Threat to Data Integrity in High-End Computing Systems. In Proceedings of 2009 National HPC Workshop On Resilience, 2009.Google Scholar
- A. V. Mirgorodskiy, N. Maruyama, and B. P. Miller. Problem diagnosis in large-scale computing environments. In SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, 2006. Google ScholarDigital Library
- X. Ren, S. Lee, R. Eigenmann, and S. Bagchi. Prediction of resource availability in fine-grained cycle sharing systems and empirical evaluation. Journal of Grid Comput., 5(2):173--195, 2007.Google ScholarCross Ref
- B. Schroeder and G. Gibson. A Large-Scale Study of Failures in High-Performance Computing Systems. In International Conference on Dependable Systems and Networks (DSN), pages 249--258, 2006. Google ScholarDigital Library
- J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004. Google ScholarDigital Library
- A. Snavely, L. Carrington, N. Wolter, J. Labarta, R. Badia, and A. Purkayastha. A Framework for Performance Modeling and Prediction. In ACM/IEEE Conference on Supercomputing, pages 1--17, 2002. Google ScholarDigital Library
- R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications, 19(1):49, 2005.Google ScholarDigital Library
- X. Wu and F. Mueller. ScalaExtrap: Trace-Based Communication Extrapolation for SPMD Program. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 1--10, 2011. Google ScholarDigital Library
Index Terms
- Vrisha: using scaling properties of parallel programs for bug detection and localization
Recommendations
Detect Related Bugs from Source Code Using Bug Information
COMPSAC '10: Proceedings of the 2010 IEEE 34th Annual Computer Software and Applications ConferenceOpen source projects often maintain open bug repositories during development and maintenance, and the reporters often point out straightly or implicitly the reasons why bugs occur when they submit them. The comments about a bug are very valuable for ...
Are Neural Bug Detectors Comparable to Software Developers on Variable Misuse Bugs?
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software EngineeringDebugging, that is, identifying and fixing bugs in software, is a central part of software development. Developers are therefore often confronted with the task of deciding whether a given code snippet contains a bug, and if yes, where. Recently, data-...
Scalable and systematic detection of buggy inconsistencies in source code
OOPSLA '10Software developers often duplicate source code to replicate functionality. This practice can hinder the maintenance of a software project: bugs may arise when two identical code segments are edited inconsistently. This paper presents DejaVu, a highly ...
Comments