skip to main content
10.1145/1996130.1996143acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Vrisha: using scaling properties of parallel programs for bug detection and localization

Published: 08 June 2011 Publication History

Abstract

Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while small-scale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. A popular approach to finding bugs is statistical bug detection, where abnormal behavior is detected through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales and therefore traditional statistical techniques are not viable. In this paper, we propose Vrisha, a statistical approach to detecting and localizing scale-dependent bugs. Vrisha detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale. We use Vrisha to detect and diagnose two bugs caused by errors in popular MPI libraries and show that our techniques can be implemented with low overhead and low false-positive rates.

References

[1]
http://www.mcs.anl.gov/research/projects/mpich2/.
[2]
https://trac.mcs.anl.gov/projects/mpich2/changeset/5262.
[3]
https://trac.mcs.anl.gov/projects/mpich2/browser/mpich2/trunk/src/mpi/col%l/allgatherv.c.
[4]
http://trac.mcs.anl.gov/projects/mpich2/ticket/1005.
[5]
D. H. Ahn, B. R. de Supinski, I. Laguna, G. L. Lee, B. Liblit, B. P. Miller, and M. Schulz. Scalable temporal order analysis for large scale debugging. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 44:1--44:11, 2009.
[6]
F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1--48, March 2003.
[7]
D. Bailey, J. Barton, T. Lasinski, and H. Simon. The NAS Parallel Benchmarks. RNR-91-002, NASA Ames Research Center, August 1991.
[8]
D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. Nas parallel benchmark results. In Proceedings of the 1992 ACM/IEEE conference on Supercomputing, pages 386--393, 1992.
[9]
S. Balay, J. Brown, K. Buschelman, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc Web page, 2009. http://www.mcs.anl.gov/petsc.
[10]
G. Bronevetsky, I. Laguna, S. Bagchi, B. R. de Supinski, M. Schulz, and D. H. Ahn. Statistical Fault Detection for Parallel Applications with AutomaDeD. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), pages 1--6, 2010.
[11]
G. Bronevetsky, I. Laguna, S. Bagchi, B. R. de Supinski, D. H. Ahn, and M. Schulz. AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks. In 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 231--240, June-July 2010.
[12]
Z. Chen, Q. Gao, W. Zhang, and F. Qin. Flowchecker: Detecting bugs in mpi libraries via message flow checking. In Proceedings of the 2010 ACM/IEEE International Conference on Supercomputing, SC '10, pages 1--11, 2010.
[13]
N. DeBardeleben. Fault-Tolerance for HPC at Extreme Scale, 2010.
[14]
S. Fu and C. Xu. Exploring Event Correlation For Failure Prediction In Coalitions Of Clusters. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--12. ACM, 2007.
[15]
A. Ganapathi, K. Datta, A. Fox, and D. Patterson. A case for machine learning to optimize multicore performance. In Proceedings of the First USENIX conference on Hot topics in parallelism, HotPar'09, pages 1--6, 2009.
[16]
A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting multiple metrics for queries: Better decisions enabled by machine learning. In Proceedings of the 2009 IEEE International Conference on Data Engineering, pages 592--603, 2009.
[17]
Q. Gao, F. Qin, and D. K. Panda. Dmtracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC '07, pages 1--12, 2007.
[18]
D. Herbert, V. Sundaram, Y.-H. Lu, S. Bagchi, and Z. Li. Adaptive correctness monitoring for wireless sensor networks using hierarchical distributed run-time invariant checking. ACM Trans. Auton. Adapt. Syst., 2, September 2007.
[19]
H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):pp. 321--377, 1936.
[20]
M. P. Kasick, J. Tan, R. Gandhi, and P. Narasimhan. Black-box problem diagnosis in parallel file systems. In FAST'10: Proceedings of the 8th USENIX conference on File and storage technologies, pages 1--14, 2010.
[21]
G. L. Lee, D. H. Ahn, D. C. Arnold, B. R. de Supinski, M. Legendre, B. P. Miller, M. Schulz, and B. Liblit. Lessons Learned at 208K: Towards Debugging Millions of Cores. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing (SC), SC '08, pages 1--9, 2008.
[22]
S. Michalak. Silent Data Corruption: A Threat to Data Integrity in High-End Computing Systems. In Proceedings of 2009 National HPC Workshop On Resilience, 2009.
[23]
A. V. Mirgorodskiy, N. Maruyama, and B. P. Miller. Problem diagnosis in large-scale computing environments. In SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, 2006.
[24]
X. Ren, S. Lee, R. Eigenmann, and S. Bagchi. Prediction of resource availability in fine-grained cycle sharing systems and empirical evaluation. Journal of Grid Comput., 5(2):173--195, 2007.
[25]
B. Schroeder and G. Gibson. A Large-Scale Study of Failures in High-Performance Computing Systems. In International Conference on Dependable Systems and Networks (DSN), pages 249--258, 2006.
[26]
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004.
[27]
A. Snavely, L. Carrington, N. Wolter, J. Labarta, R. Badia, and A. Purkayastha. A Framework for Performance Modeling and Prediction. In ACM/IEEE Conference on Supercomputing, pages 1--17, 2002.
[28]
R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in MPICH. International Journal of High Performance Computing Applications, 19(1):49, 2005.
[29]
X. Wu and F. Mueller. ScalaExtrap: Trace-Based Communication Extrapolation for SPMD Program. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 1--10, 2011.

Cited By

View all
  • (2025)Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348578936:2(308-325)Online publication date: Feb-2025
  • (2023)Graph Analysis for Scalability AnalysisPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_5(101-128)Online publication date: 19-Jun-2023
  • (2022)Jdebug: A Fast, Non-Intrusive and Scalable Fault Locating Tool for Ten-Million-Scale Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315769033:12(3491-3504)Online publication date: 1-Dec-2022
  • Show More Cited By

Index Terms

  1. Vrisha: using scaling properties of parallel programs for bug detection and localization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HPDC '11: Proceedings of the 20th international symposium on High performance distributed computing
    June 2011
    296 pages
    ISBN:9781450305525
    DOI:10.1145/1996130
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 June 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. KCCA
    2. bug detection
    3. large-scale bugs

    Qualifiers

    • Research-article

    Conference

    HPDC '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 166 of 966 submissions, 17%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.348578936:2(308-325)Online publication date: Feb-2025
    • (2023)Graph Analysis for Scalability AnalysisPerformance Analysis of Parallel Applications for HPC10.1007/978-981-99-4366-1_5(101-128)Online publication date: 19-Jun-2023
    • (2022)Jdebug: A Fast, Non-Intrusive and Scalable Fault Locating Tool for Ten-Million-Scale Parallel ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315769033:12(3491-3504)Online publication date: 1-Dec-2022
    • (2022)Detecting Scale-Induced Overflow Bugs in Production HPC CodesHigh Performance Computing. ISC High Performance 2022 International Workshops10.1007/978-3-031-23220-6_3(33-43)Online publication date: 29-May-2022
    • (2020)SCALANA: Automating Scaling Loss Detection with Graph AnalysisSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00032(1-14)Online publication date: Nov-2020
    • (2019)ScalecheckProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323332(359-373)Online publication date: 25-Feb-2019
    • (2019)PySE: Automatic Worst-Case Test Generation by Reinforcement Learning2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)10.1109/ICST.2019.00023(136-147)Online publication date: Apr-2019
    • (2018)Non-intrusively Avoiding Scaling Problems in and out of MPI Collectives2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2018.00076(415-424)Online publication date: May-2018
    • (2017)ParastackProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126938(1-12)Online publication date: 12-Nov-2017
    • (2017)Scalability BugsProceedings of the 16th Workshop on Hot Topics in Operating Systems10.1145/3102980.3102985(24-29)Online publication date: 7-May-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media