skip to main content
10.1145/1188455.1188548acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
Article

Problem diagnosis in large-scale computing environments

Published: 11 November 2006 Publication History

Abstract

We describe a new approach for locating the causes of anomalies in distributed systems. Our target environment is a distributed application that contains multiple identical processes performing similar activities. We use a new, lightweight form of dynamic instrumentation to collect function-level traces from each process. If the application fails, the traces are automatically compared to each other. We find anomalies by identifying processes that stopped earlier than the rest (sign of a fail-stop problem) or processes that behaved different from the rest (sign of a non-fail-stop problem). Our algorithm does not require reference data to distinguish anomalies from normal behaviors. However, it can make use of such data when available to reduce the number of false positives. Ultimately, we identify a function that is likely to explain the anomalous behavior. We demonstrated the efficacy of our approach by finding two problems in a large distributed cluster environment called SCore.

References

[1]
A. Avizienis, J.-C. Laprie and B. Randell, "Fundamental Concepts of Dependability", Research Report N01145, Laboratory for Analysis and Architecture of Systems (LAAS-CNRS), Apr. 2001.]]
[2]
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier, "Using Magpie for Request Extraction and Workload Modelling", in 6th Symposium on Operating Systems Design and Implementation (OSDI), San Francisco, CA, Dec. 2004.]]
[3]
P. Barham, R. Isaacs, R. Mortier, D. Narayanan, "Magpie: real-time modelling and performance-aware systems", in 9th Workshop on Hot Topics in Operating Systems, Lihue, Hawaii, May 2003.]]
[4]
S. D. Bay and M. Schwabacher, "Mining distance-based outliers in near linear time with randomization and a simple pruning rule", in 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, D.C., Aug. 2003.]]
[5]
A. Chan, D. Ashton, R. Lusk, W. Gropp, "Jumpshot-4 Users Guide", Mathematics and Computer Science Division, Argonne National Laboratory, http://www.mcs.anl.gov/perfvis/software/viewers/jumpshot-4/usersguide.html]]
[6]
M. Chen, E. Kiciman, E. Fratkin, E. Brewer, and A. Fox, "Pinpoint: Problem Determination in Large, Dynamic, Internet Services", in International Conference on Dependable Systems and Networks, Washington D.C., Jun. 2002.]]
[7]
M. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer, "Path-based Failure and Evolution Management", in 1st Symposium on Networked Systems Design and Implementation (NSDI), San Francisco, CA, Mar. 2004.]]
[8]
M. Chen, A. X. Zheng, M. I. Jordan, and E. Brewer, "Failure Diagnosis Using Decision Trees", in International Conference on Autonomic Computing (ICAC), New York, NY, May 2004.]]
[9]
I. Cohen, J. Chase, M. Goldszmidt, T. Kelly, and J. Symons, "Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control", in 6th Symposium on Operating Systems Design and Implementation (OSDI), San Francisco, CA, Dec. 2004.]]
[10]
W. W. Cohen, P. Ravikumar, and S. Fienberg, "A Comparison of String Metrics for Matching Names and Records", in KDD Workshop on Data Cleaning and Object Consolidation, Washington D.C., Aug. 2003.]]
[11]
W. Dickinson, D. Leon, and A. Podgurski, "Finding failures by cluster analysis of execution profiles", in 23rd International Conference on Software Engineering (ICSE), Toronto, Ontario, Canada, May 2001.]]
[12]
S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff, "A sense of self for unix processes", in IEEE Symposium on Security and Privacy, Los Alamitos, CA, May 1996.]]
[13]
T. Hastie, R. Tibshirani, J. Friedman, "The Elements of Statistical Learning: Data Mining, Inference, and Prediction", Springer-Verlag, 2001, ISBN 0-387-95284-5.]]
[14]
R. Hasting and B. Joyce, "Purify: Fast detection of memory leaks and access errors", in Winter Usenix Conference, San Francisco, CA, Jan. 1992.]]
[15]
B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan, "Scalable Statistical Bug Isolation", in ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation (PLDI), Chicago, IL, Jun. 2005.]]
[16]
Linux Manual Page, "send, sendto, sendmsg - send a message from a socket".]]
[17]
M. Litzkow, M. Livny, and M. Mutka, "Condor---a hunter of idle workstations", in 8th International Conference on Distributed Computing Systems, San Jose, CA, Jun. 1988.]]
[18]
B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhall, "The Paradyn Parallel Performance Measurement Tool", IEEE Computer, 28, 11, Nov. 1995, pp. 37--46.]]
[19]
D. L. Mills, "The network computer as precision timekeeper", Precision Time and Time Interval (PTTI) Applications and Planning Meeting, Reston VA, Dec. 1996.]]
[20]
A. V. Mirgorodskiy and B. P. Miller, "Autonomous Analysis of Interactive Systems with Self-Propelled Instrumentation", in 12th Multimedia Computing and Networking (MMCN), San Jose, CA, Jan. 2005.]]
[21]
N. Nethercote and J. Seward, "Valgrind: A program supervision framework", in 3rd Workshop on Runtime Verification (RV), Boulder, CO, Jul. 2003.]]
[22]
R. H. B. Netzer and B. P. Miller, "Improving the Accuracy of Data Race Detection", in 3rd ACM Symposium on Principles and Practice of Parallel Programming, Williamsburg, VA, Apr. 1991.]]
[23]
B. Perens, "Electric Fence", http://perens.com/FreeSoftware/]]
[24]
S. Ramaswamy, R. Rastogi, and K. Shim, "Efficient algorithms for mining outliers from large data sets", in ACM SIGMOD International Conference on Management of Data, Dallas, TX, May 2000.]]
[25]
M. Ronsse, B. Stougie, J. Maebe, F. Cornelis, K. De Bosschere, "An Efficient Data Race Detector Backend for DIOTA", in International Conference on Parallel Computing (ParCo), Dresden, Germany, Sept. 2003.]]
[26]
S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson, "Eraser: A dynamic data race detector for multi-threaded programs", in 16th ACM Symposium on Operating Systems Principles, Saint-Malo, France, Oct. 1997.]]
[27]
Y. Ishikawa, H. Tezuka, A. Hori, S. Sumimoto, T. Takahashi, F. O'Carroll, and H. Harada, "RWC PC Cluster II and SCore Cluster System Software---High Performance Linux Cluster", in 5th Annual Linux Expo, Raleigh, NC, May 1999.]]
[28]
D. M. J. Tax, "One-class classification", PhD thesis, Delft University of Technology, http://www.ph.tn.tudelft.nl/davidt/thesis.pdf, Jun. 2001.]]
[29]
R. F. Van der Wijngaart, "NAS Parallel Benchmarks Version 2.4", NAS Technical Report NAS-02-007, Oct. 2002.]]
[30]
R. Wagner and D. Dean, "Intrusion Detection via Static Analysis", in IEEE Symposium on Security and Privacy, Washington, D.C., May 2001.]]
[31]
R. Wismuller, J. Trinitis, and T. Ludwig, "OCM---A Monitoring System for Interoperable Tools", in SIGMETRICS Symposium on Parallel and Distributed Tools, Welches, OR, Aug. 1998.]]
[32]
C. Yuan, N. Lao, J.-R. Wen, J. Li, Z. Zhang, Y.-M. Wang, W.-Y. Ma, "Automated Known Problem Diagnosis with Event Traces", Microsoft Research Technical Report MSR-TR-2005-81, Jun. 2005.]]
[33]
V. Zandy, "Force a Process to Load a Library", http://www.cs.wisc.edu/~zandy/p/hijack.c]]

Cited By

View all
  • (2024)PP-CSA: Practical Privacy-Preserving Software Call Stack AnalysisProceedings of the ACM on Programming Languages10.1145/36498568:OOPSLA1(1264-1293)Online publication date: 29-Apr-2024
  • (2020)Detecting and reproducing error-code propagation bugs in MPI implementationsProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374515(187-201)Online publication date: 19-Feb-2020
  • (2019)An Architecture for System Recovery Based on Solution Records on Different ServersAdvances on P2P, Parallel, Grid, Cloud and Internet Computing10.1007/978-3-030-33509-0_85(904-913)Online publication date: 20-Oct-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing
November 2006
746 pages
ISBN:0769527000
DOI:10.1145/1188455
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2006

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SC '06
Sponsor:

Acceptance Rates

SC '06 Paper Acceptance Rate 54 of 239 submissions, 23%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)PP-CSA: Practical Privacy-Preserving Software Call Stack AnalysisProceedings of the ACM on Programming Languages10.1145/36498568:OOPSLA1(1264-1293)Online publication date: 29-Apr-2024
  • (2020)Detecting and reproducing error-code propagation bugs in MPI implementationsProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374515(187-201)Online publication date: 19-Feb-2020
  • (2019)An Architecture for System Recovery Based on Solution Records on Different ServersAdvances on P2P, Parallel, Grid, Cloud and Internet Computing10.1007/978-3-030-33509-0_85(904-913)Online publication date: 20-Oct-2019
  • (2018)Automated interpretation and reduction of in-vehicle network traces at a large scaleProceedings of the 55th Annual Design Automation Conference10.1145/3195970.3196000(1-6)Online publication date: 24-Jun-2018
  • (2017)ParastackProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126938(1-12)Online publication date: 12-Nov-2017
  • (2016)Log clustering based problem identification for online service systemsProceedings of the 38th International Conference on Software Engineering Companion10.1145/2889160.2889232(102-111)Online publication date: 14-May-2016
  • (2016)Automated and dynamic abstraction of MPI application performanceCluster Computing10.1007/s10586-016-0615-419:3(1105-1137)Online publication date: 1-Sep-2016
  • (2015)ProvErrProceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2015.86(525-534)Online publication date: 4-May-2015
  • (2014)Accurate application progress analysis for large-scale parallel debuggingACM SIGPLAN Notices10.1145/2666356.259433649:6(193-203)Online publication date: 9-Jun-2014
  • (2014)Accurate application progress analysis for large-scale parallel debuggingProceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2594291.2594336(193-203)Online publication date: 9-Jun-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media