Article

Problem diagnosis in large-scale computing environments

Authors:

Alexander V. Mirgorodskiy,

Naoya Maruyama,

Barton P. MillerAuthors Info & Claims

SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing

Pages 88 - es

https://doi.org/10.1145/1188455.1188548

Published: 11 November 2006 Publication History

Abstract

We describe a new approach for locating the causes of anomalies in distributed systems. Our target environment is a distributed application that contains multiple identical processes performing similar activities. We use a new, lightweight form of dynamic instrumentation to collect function-level traces from each process. If the application fails, the traces are automatically compared to each other. We find anomalies by identifying processes that stopped earlier than the rest (sign of a fail-stop problem) or processes that behaved different from the rest (sign of a non-fail-stop problem). Our algorithm does not require reference data to distinguish anomalies from normal behaviors. However, it can make use of such data when available to reduce the number of false positives. Ultimately, we identify a function that is likely to explain the anomalous behavior. We demonstrated the efficacy of our approach by finding two problems in a large distributed cluster environment called SCore.

References

[1]

A. Avizienis, J.-C. Laprie and B. Randell, "Fundamental Concepts of Dependability", Research Report N01145, Laboratory for Analysis and Architecture of Systems (LAAS-CNRS), Apr. 2001.]]

[2]

P. Barham, A. Donnelly, R. Isaacs, and R. Mortier, "Using Magpie for Request Extraction and Workload Modelling", in 6th Symposium on Operating Systems Design and Implementation (OSDI), San Francisco, CA, Dec. 2004.]]

Digital Library

[3]

P. Barham, R. Isaacs, R. Mortier, D. Narayanan, "Magpie: real-time modelling and performance-aware systems", in 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, May 2003.]]

Digital Library

[4]

S. D. Bay and M. Schwabacher, "Mining distance-based outliers in near linear time with randomization and a simple pruning rule", in 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C., Aug. 2003.]]

Digital Library

[5]

A. Chan, D. Ashton, R. Lusk, W. Gropp, "Jumpshot-4 Users Guide", Mathematics and Computer Science Division, Argonne National Laboratory, http://www.mcs.anl.gov/perfvis/software/viewers/jumpshot-4/usersguide.html]]

[6]

M. Chen, E. Kiciman, E. Fratkin, E. Brewer, and A. Fox, "Pinpoint: Problem Determination in Large, Dynamic, Internet Services", in International Conference on Dependable Systems and Networks, Washington D.C., Jun. 2002.]]

Digital Library

[7]

M. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer, "Path-based Failure and Evolution Management", in 1st Symposium on Networked Systems Design and Implementation (NSDI), San Francisco, CA, Mar. 2004.]]

Digital Library

[8]

M. Chen, A. X. Zheng, M. I. Jordan, and E. Brewer, "Failure Diagnosis Using Decision Trees", in International Conference on Autonomic Computing (ICAC), New York, NY, May 2004.]]

Digital Library

[9]

I. Cohen, J. Chase, M. Goldszmidt, T. Kelly, and J. Symons, "Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control", in 6th Symposium on Operating Systems Design and Implementation (OSDI), San Francisco, CA, Dec. 2004.]]

Digital Library

[10]

W. W. Cohen, P. Ravikumar, and S. Fienberg, "A Comparison of String Metrics for Matching Names and Records", in KDD Workshop on Data Cleaning and Object Consolidation, Washington D.C., Aug. 2003.]]

[11]

W. Dickinson, D. Leon, and A. Podgurski, "Finding failures by cluster analysis of execution profiles", in 23rd International Conference on Software Engineering (ICSE), Toronto, Ontario, Canada, May 2001.]]

Digital Library

[12]

S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff, "A sense of self for unix processes", in IEEE Symposium on Security and Privacy, Los Alamitos, CA, May 1996.]]

Digital Library

[13]

T. Hastie, R. Tibshirani, J. Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and Prediction", Springer-Verlag, 2001, ISBN 0-387-95284-5.]]

[14]

R. Hasting and B. Joyce, "Purify: Fast detection of memory leaks and access errors", in Winter Usenix Conference, San Francisco, CA, Jan. 1992.]]

[15]

B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan, "Scalable Statistical Bug Isolation", in ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation (PLDI), Chicago, IL, Jun. 2005.]]

Digital Library

[16]

Linux Manual Page, "send, sendto, sendmsg - send a message from a socket".]]

[17]

M. Litzkow, M. Livny, and M. Mutka, "Condor---a hunter of idle workstations", in 8th International Conference on Distributed Computing Systems, San Jose, CA, Jun. 1988.]]

[18]

B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhall, "The Paradyn Parallel Performance Measurement Tool", IEEE Computer, 28, 11, Nov. 1995, pp. 37--46.]]

Digital Library

[19]

D. L. Mills, "The network computer as precision timekeeper", Precision Time and Time Interval (PTTI) Applications and Planning Meeting, Reston VA, Dec. 1996.]]

[20]

A. V. Mirgorodskiy and B. P. Miller, "Autonomous Analysis of Interactive Systems with Self-Propelled Instrumentation", in 12th Multimedia Computing and Networking (MMCN), San Jose, CA, Jan. 2005.]]

[21]

N. Nethercote and J. Seward, "Valgrind: A program supervision framework", in 3rd Workshop on Runtime Verification (RV), Boulder, CO, Jul. 2003.]]

[22]

R. H. B. Netzer and B. P. Miller, "Improving the Accuracy of Data Race Detection", in 3rd ACM Symposium on Principles and Practice of Parallel Programming, Williamsburg, VA, Apr. 1991.]]

Digital Library

[23]

B. Perens, "Electric Fence", http://perens.com/FreeSoftware/]]

[24]

S. Ramaswamy, R. Rastogi, and K. Shim, "Efficient algorithms for mining outliers from large data sets", in ACM SIGMOD International Conference on Management of Data, Dallas, TX, May 2000.]]

Digital Library

[25]

M. Ronsse, B. Stougie, J. Maebe, F. Cornelis, K. De Bosschere, "An Efficient Data Race Detector Backend for DIOTA", in International Conference on Parallel Computing (ParCo), Dresden, Germany, Sept. 2003.]]

[26]

S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson, "Eraser: A dynamic data race detector for multi-threaded programs", in 16th ACM Symposium on Operating Systems Principles, Saint-Malo, France, Oct. 1997.]]

Digital Library

[27]

Y. Ishikawa, H. Tezuka, A. Hori, S. Sumimoto, T. Takahashi, F. O'Carroll, and H. Harada, "RWC PC Cluster II and SCore Cluster System Software---High Performance Linux Cluster", in 5th Annual Linux Expo, Raleigh, NC, May 1999.]]

[28]

D. M. J. Tax, "One-class classification", PhD thesis, Delft University of Technology, http://www.ph.tn.tudelft.nl/davidt/thesis.pdf, Jun. 2001.]]

[29]

R. F. Van der Wijngaart, "NAS Parallel Benchmarks Version 2.4", NAS Technical Report NAS-02-007, Oct. 2002.]]

[30]

R. Wagner and D. Dean, "Intrusion Detection via Static Analysis", in IEEE Symposium on Security and Privacy, Washington, D.C., May 2001.]]

Digital Library

[31]

R. Wismuller, J. Trinitis, and T. Ludwig, "OCM---A Monitoring System for Interoperable Tools", in SIGMETRICS Symposium on Parallel and Distributed Tools, Welches, OR, Aug. 1998.]]

Digital Library

[32]

C. Yuan, N. Lao, J.-R. Wen, J. Li, Z. Zhang, Y.-M. Wang, W.-Y. Ma, "Automated Known Problem Diagnosis with Event Traces", Microsoft Research Technical Report MSR-TR-2005-81, Jun. 2005.]]

[33]

V. Zandy, "Force a Process to Load a Library", http://www.cs.wisc.edu/~zandy/p/hijack.c]]

Cited By

Wang ZMa PWang HWang S(2024)PP-CSA: Practical Privacy-Preserving Software Call Stack AnalysisProceedings of the ACM on Programming Languages10.1145/36498568:OOPSLA1(1264-1293)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3649856
DeFreez DBhowmick ALaguna IRubio-González CGupta RShen X(2020)Detecting and reproducing error-code propagation bugs in MPI implementationsProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374515(187-201)Online publication date: 19-Feb-2020
https://dl.acm.org/doi/10.1145/3332466.3374515
Kasai TTakano K(2019)An Architecture for System Recovery Based on Solution Records on Different ServersAdvances on P2P, Parallel, Grid, Cloud and Internet Computing10.1007/978-3-030-33509-0_85(904-913)Online publication date: 20-Oct-2019
https://doi.org/10.1007/978-3-030-33509-0_85
Show More Cited By

Index Terms

Recommendations

Anomaly detection and diagnosis in grid environments
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Identifying and diagnosing anomalies in application behavior is critical to delivering reliable application-level performance. In this paper we introduce a strategy to detect anomalies and diagnose the possible reasons behind them. Our approach extends ...
Multi-stage change-point detection scheme for large-scale simultaneous events

Change-point detection schemes, which represent one type of anomaly detection schemes, are a promising approach for detecting network anomalies, such as attacks and epidemics by unknown viruses and worms. These events are detected as change-points. ...
Fault Diagnosis Analysis in Large-Scale Computing Environments
MINES '10: Proceedings of the 2010 International Conference on Multimedia Information Networking and Security

This paper issues the problem of fault diagnosis in high computing system. In order to solve this problem, i.e., correctly and efficiently detecting the anomaly nodes during the system operation, which is very similar to the principle of pattern ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing

November 2006

746 pages

ISBN:0769527000

DOI:10.1145/1188455

Conference Chair:
Barbara Horner-Miller
Arctic Region Supercomputing Center

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SC '06

Sponsor:

SIGARCH
IEEE-CS

SC '06: International Conference for High Performance Computing, Networking, Storage and Analysis

November 11 - 17, 2006

Florida, Tampa

Acceptance Rates

SC '06 Paper Acceptance Rate 54 of 239 submissions, 23%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
436
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang ZMa PWang HWang S(2024)PP-CSA: Practical Privacy-Preserving Software Call Stack AnalysisProceedings of the ACM on Programming Languages10.1145/36498568:OOPSLA1(1264-1293)Online publication date: 29-Apr-2024
https://dl.acm.org/doi/10.1145/3649856
DeFreez DBhowmick ALaguna IRubio-González CGupta RShen X(2020)Detecting and reproducing error-code propagation bugs in MPI implementationsProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374515(187-201)Online publication date: 19-Feb-2020
https://dl.acm.org/doi/10.1145/3332466.3374515
Kasai TTakano K(2019)An Architecture for System Recovery Based on Solution Records on Different ServersAdvances on P2P, Parallel, Grid, Cloud and Internet Computing10.1007/978-3-030-33509-0_85(904-913)Online publication date: 20-Oct-2019
https://doi.org/10.1007/978-3-030-33509-0_85
Mrowca APramsohler TSteinhorst SBaumgarten U(2018)Automated interpretation and reduction of in-vehicle network traces at a large scaleProceedings of the 55th Annual Design Automation Conference10.1145/3195970.3196000(1-6)Online publication date: 24-Jun-2018
https://dl.acm.org/doi/10.1145/3195970.3196000
Li HChen ZGupta RMohr BRaghavan P(2017)ParastackProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126938(1-12)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126938
Lin QZhang HLou JZhang YChen XDillon LVisser WWilliams L(2016)Log clustering based problem identification for online service systemsProceedings of the 38th International Conference on Software Engineering Companion10.1145/2889160.2889232(102-111)Online publication date: 14-May-2016
https://dl.acm.org/doi/10.1145/2889160.2889232
Sikora AMargalef TJorba J(2016)Automated and dynamic abstraction of MPI application performanceCluster Computing10.1007/s10586-016-0615-419:3(1105-1137)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1007/s10586-016-0615-4
Chen PPlale BBalaji PXu C(2015)ProvErrProceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2015.86(525-534)Online publication date: 4-May-2015
https://dl.acm.org/doi/10.1109/CCGrid.2015.86
Mitra SLaguna IAhn DBagchi SSchulz MGamblin T(2014)Accurate application progress analysis for large-scale parallel debuggingACM SIGPLAN Notices10.1145/2666356.259433649:6(193-203)Online publication date: 9-Jun-2014
https://dl.acm.org/doi/10.1145/2666356.2594336
Mitra SLaguna IAhn DBagchi SSchulz MGamblin TO'Boyle MPingali K(2014)Accurate application progress analysis for large-scale parallel debuggingProceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2594291.2594336(193-203)Online publication date: 9-Jun-2014
https://dl.acm.org/doi/10.1145/2594291.2594336
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten