research-article

Vrisha: using scaling properties of parallel programs for bug detection and localization

Authors:

Milind Kulkarni,

Saurabh BagchiAuthors Info & Claims

HPDC '11: Proceedings of the 20th international symposium on High performance distributed computing

Pages 85 - 96

https://doi.org/10.1145/1996130.1996143

Published: 08 June 2011 Publication History

Abstract

Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while small-scale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. A popular approach to finding bugs is statistical bug detection, where abnormal behavior is detected through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales and therefore traditional statistical techniques are not viable. In this paper, we propose Vrisha, a statistical approach to detecting and localizing scale-dependent bugs. Vrisha detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale. We use Vrisha to detect and diagnose two bugs caused by errors in popular MPI libraries and show that our techniques can be implemented with low overhead and low false-positive rates.

References

[1]

http://www.mcs.anl.gov/research/projects/mpich2/.

[2]

https://trac.mcs.anl.gov/projects/mpich2/changeset/5262.

[3]

https://trac.mcs.anl.gov/projects/mpich2/browser/mpich2/trunk/src/mpi/col%l/allgatherv.c.

[4]

http://trac.mcs.anl.gov/projects/mpich2/ticket/1005.

[5]

D. H. Ahn, B. R. de Supinski, I. Laguna, G. L. Lee, B. Liblit, B. P. Miller, and M. Schulz. Scalable temporal order analysis for large scale debugging. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 44:1--44:11, 2009.

Digital Library

[6]

F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1--48, March 2003.

Digital Library

[7]

D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS Parallel Benchmarks. RNR-91-002, NASA Ames Research Center, August 1991.

[8]

D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. Nas parallel benchmark results. In Proceedings of the 1992 ACM/IEEE conference on Supercomputing, pages 386--393, 1992.

Digital Library

[9]

S. Balay, J. Brown, K. Buschelman, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc Web page, 2009. http://www.mcs.anl.gov/petsc.

[10]

G. Bronevetsky, I. Laguna, S. Bagchi, B. R. de Supinski, M. Schulz, and D. H. Ahn. Statistical Fault Detection for Parallel Applications with AutomaDeD. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), pages 1--6, 2010.

[11]

G. Bronevetsky, I. Laguna, S. Bagchi, B. R. de Supinski, D. H. Ahn, and M. Schulz. AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks. In 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 231--240, June-July 2010.

[12]

Z. Chen, Q. Gao, W. Zhang, and F. Qin. Flowchecker: Detecting bugs in mpi libraries via message flow checking. In Proceedings of the 2010 ACM/IEEE International Conference on Supercomputing, SC '10, pages 1--11, 2010.

Digital Library

[13]

N. DeBardeleben. Fault-Tolerance for HPC at Extreme Scale, 2010.

[14]

S. Fu and C. Xu. Exploring Event Correlation For Failure Prediction In Coalitions Of Clusters. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--12. ACM, 2007.

Digital Library

[15]

A. Ganapathi, K. Datta, A. Fox, and D. Patterson. A case for machine learning to optimize multicore performance. In Proceedings of the First USENIX conference on Hot topics in parallelism, HotPar'09, pages 1--6, 2009.

Digital Library

[16]

A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In Proceedings of the 2009 IEEE International Conference on Data Engineering, pages 592--603, 2009.

Digital Library

[17]

Q. Gao, F. Qin, and D. K. Panda. Dmtracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC '07, pages 1--12, 2007.

Digital Library

[18]

D. Herbert, V. Sundaram, Y.-H. Lu, S. Bagchi, and Z. Li. Adaptive correctness monitoring for wireless sensor networks using hierarchical distributed run-time invariant checking. ACM Trans. Auton. Adapt. Syst., 2, September 2007.

Digital Library

[19]

H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):pp. 321--377, 1936.

[20]

M. P. Kasick, J. Tan, R. Gandhi, and P. Narasimhan. Black-box problem diagnosis in parallel file systems. In FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies, pages 1--14, 2010.

Digital Library

[21]

G. L. Lee, D. H. Ahn, D. C. Arnold, B. R. de Supinski, M. Legendre, B. P. Miller, M. Schulz, and B. Liblit. Lessons Learned at 208K: Towards Debugging Millions of Cores. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing (SC), SC '08, pages 1--9, 2008.

Digital Library

[22]

S. Michalak. Silent Data Corruption: A Threat to Data Integrity in High-End Computing Systems. In Proceedings of 2009 National HPC Workshop On Resilience, 2009.

[23]

A. V. Mirgorodskiy, N. Maruyama, and B. P. Miller. Problem diagnosis in large-scale computing environments. In SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, 2006.

Digital Library

[24]

X. Ren, S. Lee, R. Eigenmann, and S. Bagchi. Prediction of resource availability in fine-grained cycle sharing systems and empirical evaluation. Journal of Grid Comput., 5(2):173--195, 2007.

[25]

B. Schroeder and G. Gibson. A Large-Scale Study of Failures in High-Performance Computing Systems. In International Conference on Dependable Systems and Networks (DSN), pages 249--258, 2006.

Digital Library

[26]

J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004.

Digital Library

[27]

A. Snavely, L. Carrington, N. Wolter, J. Labarta, R. Badia, and A. Purkayastha. A Framework for Performance Modeling and Prediction. In ACM/IEEE Conference on Supercomputing, pages 1--17, 2002.

Digital Library

[28]

R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications, 19(1):49, 2005.

Digital Library

[29]

X. Wu and F. Mueller. ScalaExtrap: Trace-Based Communication Extrapolation for SPMD Program. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 1--10, 2011.

Digital Library

Cited By

Jin YWang HTang XGuo ZZhao YHoefler TLiu TLiu XZhai J(2025)Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348578936:2(308-325)Online publication date: Feb-2025
https://doi.org/10.1109/TPDS.2024.3485789
Zhai JJin YChen WZheng WZhai JJin YChen WZheng W(2023)Graph Analysis for Scalability AnalysisPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_5(101-128)Online publication date: 19-Jun-2023
https://doi.org/10.1007/978-981-99-4366-1_5
Peng DFeng YLiu YLiu XXue WChen DSong JChen Z(2022)Jdebug: A Fast, Non-Intrusive and Scalable Fault Locating Tool for Ten-Million-Scale Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315769033:12(3491-3504)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3157690
Show More Cited By

Index Terms

Vrisha: using scaling properties of parallel programs for bug detection and localization
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Detect Related Bugs from Source Code Using Bug Information
COMPSAC '10: Proceedings of the 2010 IEEE 34th Annual Computer Software and Applications Conference

Open source projects often maintain open bug repositories during development and maintenance, and the reporters often point out straightly or implicitly the reasons why bugs occur when they submit them. The comments about a bug are very valuable for ...
Are Neural Bug Detectors Comparable to Software Developers on Variable Misuse Bugs?
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering

Debugging, that is, identifying and fixing bugs in software, is a central part of software development. Developers are therefore often confronted with the task of deciding whether a given code snippet contains a bug, and if yes, where. Recently, data-...
Scalable and systematic detection of buggy inconsistencies in source code
OOPSLA '10

Software developers often duplicate source code to replicate functionality. This practice can hinder the maintenance of a software project: bugs may arise when two identical code segments are edited inconsistently. This paper presents DejaVu, a highly ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '11: Proceedings of the 20th international symposium on High performance distributed computing

June 2011

296 pages

ISBN:9781450305525

DOI:10.1145/1996130

General Chair:
Arthur "Barney" Maccabe
Oak Ridge National Lab, USA
,
Program Chair:
Douglas Thain
University of Notre Dame, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 June 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC '11

Sponsor:

University of Arizona
SIGARCH

HPDC '11: The 20th International Symposium on High-Performance Parallel and Distributed Computing

June 8 - 11, 2011

California, San Jose, USA

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
246
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jin YWang HTang XGuo ZZhao YHoefler TLiu TLiu XZhai J(2025)Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348578936:2(308-325)Online publication date: Feb-2025
https://doi.org/10.1109/TPDS.2024.3485789
Zhai JJin YChen WZheng WZhai JJin YChen WZheng W(2023)Graph Analysis for Scalability AnalysisPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_5(101-128)Online publication date: 19-Jun-2023
https://doi.org/10.1007/978-981-99-4366-1_5
Peng DFeng YLiu YLiu XXue WChen DSong JChen Z(2022)Jdebug: A Fast, Non-Intrusive and Scalable Fault Locating Tool for Ten-Million-Scale Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315769033:12(3491-3504)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3157690
Zarins JWeiland MBartholomew PLapworth LParsons M(2022)Detecting Scale-Induced Overflow Bugs in Production HPC CodesHigh Performance Computing. ISC High Performance 2022 International Workshops10.1007/978-3-031-23220-6_3(33-43)Online publication date: 29-May-2022
https://dl.acm.org/doi/10.1007/978-3-031-23220-6_3
Jin YWang HYu TTang XHoefler TLiu XZhai J(2020)SCALANA: Automating Scaling Loss Detection with Graph AnalysisSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00032(1-14)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00032
Stuardo CLeesatapornwongsa TSuminto RKe HLukman JChuang WLu SGunawi HMerchant AWeatherspoon H(2019)ScalecheckProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323332(359-373)Online publication date: 25-Feb-2019
https://dl.acm.org/doi/10.5555/3323298.3323332
Koo JSaumya CKulkarni MBagchi S(2019)PySE: Automatic Worst-Case Test Generation by Reinforcement Learning2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)10.1109/ICST.2019.00023(136-147)Online publication date: Apr-2019
https://doi.org/10.1109/ICST.2019.00023
Li HChen ZGupta RXie M(2018)Non-intrusively Avoiding Scaling Problems in and out of MPI Collectives2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00076(415-424)Online publication date: May-2018
https://doi.org/10.1109/IPDPSW.2018.00076
Li HChen ZGupta RMohr BRaghavan P(2017)ParastackProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126938(1-12)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126938
Leesatapornwongsa TStuardo CSuminto RKe HLukman JGunawi H(2017)Scalability BugsProceedings of the 16th Workshop on Hot Topics in Operating Systems10.1145/3102980.3102985(24-29)Online publication date: 7-May-2017
https://dl.acm.org/doi/10.1145/3102980.3102985
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten