skip to main content
10.1145/1736020.1736038acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

SherLog: error diagnosis by connecting clues from run-time logs

Authors Info & Claims
Published:13 March 2010Publication History

ABSTRACT

Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability of users' inputs and file content due to privacy concerns; (2) difficulty in building the exact same execution environment; and (3) non-determinism of concurrent executions on multi-processors.

Therefore, programmers often have to diagnose a production run failure based on logs collected back from customers and the corresponding source code. Such diagnosis requires expert knowledge and is also too time-consuming, tedious to narrow down root causes. To address this problem, we propose a tool, called SherLog, that analyzes source code by leveraging information provided by run-time logs to infer what must or may have happened during the failed production run. It requires neither re-execution of the program nor knowledge on the log's semantics. It infers both control and data value information regarding to the failed execution.

We evaluate SherLog with 8 representative real world software failures (6 software bugs and 2 configuration errors) from 7 applications including 3 servers. Information inferred by SherLog are very useful for programmers to diagnose these evaluated failures. Our results also show that SherLog can analyze large server applications such as Apache with thousands of logging messages within only 40 minutes.

References

  1. H. Agrawal, R. A. DeMillo, and E. H. Spafford. Debugging with dynamic slicing and backtracking. Software -- Practice and Experience, 23(6):589--616, June 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. H. Agrawal, J. R. Horgan, S. London, and W. E.Wong. Fault localization using execution slices and dataflow tests. In ISSRE'95.Google ScholarGoogle Scholar
  3. M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In SOSP'03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Aiken, S. Bugrara, I. Dillig, T. Dillig, P. Hawkins, and B. Hackett. The Saturn Program Analysis System.Google ScholarGoogle Scholar
  5. K. Ashcraft and D. Engler. Using programmer-written compiler extensions to catch security holes. In SP '02: Proceedings of the 2002 IEEE Symposium on Security and Privacy. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Ayers, R. Schooler, C. Metcalf, A. Agarwal, J. Rhee, and E. Witchel. Traceback: First fault diagnosis by reconstruction of distributed control flow. In PLDI'05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Ball, M. Naik, and S. K. Rajamani. From symptom to cause: localizing errors in counterexample traces. ACM SIGPLAN Notices, 38(1):97--105, Jan. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In OSDI'04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Bodden, P. Lam, and L. Hendren. Finding programming errors earlier by evaluating runtime monitors ahead-of-time. In FSE'08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Cadar, D. Dunbar, and D. R. Engler. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI'08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. F. Chen and G. Rosú. Parametric trace slicing and monitoring. In TACAS'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. M. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani. HOLMES: Effective statistical debugging via efficient path profiling. In ICSE'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. V. Chipounov, V. Georgescu, C. Zamfir, and G. Candea. Selective Symbolic Execution. In HotDep'09.Google ScholarGoogle Scholar
  14. I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In SOSP'05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dell. Streamlined Troubleshooting with the Dell system E--Support tool. Dell Power Solutions, 2008.Google ScholarGoogle Scholar
  16. R. A. DeMillo, H. Pan, and E. H. Spafford. Critical slicing for software fault localization. In ISSTA, pages 121--134, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Devietti, B. Lucia, M. Oskin, and L. Ceze. Dmp: Deterministic shared-memory multiprocessing. In ASPLOS'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. I. Dillig, T. Dillig, and A. Aiken. Sound, complete and scalable pathsensitive analysis. SIGPLAN Not., 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. W. Dunlap, D. G. Lucchetti, M. A. Fetterman, and P. M. Chen. Execution replay of multiprocessor virtual machines. In VEE'08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Engler, B. Chelf, and A. Chou. Checking system rules using system--specific, programmer--written compiler extensions. In OSDI'00. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: Fully automatic tool generation from ad hoc data. In POPL'08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle, and G. Hunt. Debugging in the (very)large: ten years of implementation and experience. In SOSP'09, pages 103--116, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Gray. Why do computers stop and what can be done about it?, 1985.Google ScholarGoogle Scholar
  24. Z. Guo, X.Wang, J. Tang, X. Liu, Z. Xu, M.Wu,M. F. Kaashoek, and Z. Zhang. R2: An application-level kernel for record and replay. In OSDI'08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Gupta, M. L. Soffa, and J. Howard. Hybrid slicing: integrating dynamic information with static analysis. ACMTransactions on Software Engineering and Methodology, 6(4):370--397, Oct. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Horwitz, T. Reps, and D. Binkley. Interprocedural slicing using dependence graphs. In PLDI '88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. Jiang. Understanding storage system problems and diagnosing them through log analysis. Ph.D. Dissertation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. W. Jiang, C. Hu, S. Pasupathy, A. Kanevsky, Z. Li, and Y. Zhou. Understanding customer problem troubleshooting from storage system logs. In FAST'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Kandula, R. Mahajan, P. Verkaik, S. Agrawal, J. Padhye, and P. Bahl. Degailed diagnosis in enterprise networks. In SIGCOMM'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In USENIX ATC'05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In PLDI'03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Apache Logging Services -- Log4j. http://logging.apache.org/log4j.Google ScholarGoogle Scholar
  33. R. Manevich, M. Sridharan, S. Adams, M. Das, and Z. Yang. PSE: Explaining program failures via postmortem static analysis. SIGSOFT Softw. Eng. Notes, 29(6):63--72, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Mozilla Quality Feedback Agent. http://support.mozilla.com/en-US/kb/quality+feedback+agent.Google ScholarGoogle Scholar
  35. S. Narayanasamy, C. Pereira, and B. Calder. Recording shared memory dependencies using strata. In ASPLOS'06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. Narayanasamy, G. Pokam, and B. Calder. Bugnet: Continuously recording program execution for deterministic replay debugging. In ISCA'05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. NetApp. Proactive health management with auto-support. NetApp White Paper, 2007.Google ScholarGoogle Scholar
  38. M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient determistic multithreading in software. In ASPLOS'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Squid Archives. http://www.squid-cache.org/Versions/v2/2.3/bugs/#squid-2.3.stable4-ftp_icon_not_found.Google ScholarGoogle Scholar
  40. M. Sridharan, S. J. Fink, and R. Bodik. Thin slicing. In PLDI'07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. F. Tip. A survey of program slicing techniques. Journal of Programming Languages, 3:121--189, 1995.Google ScholarGoogle Scholar
  42. J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing production run failures at the user's site. In SOSP'07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. VMWare. Using the intergrated virtual debugger for visual studio. http://www.vmware.com/pdf/ws65_manual.pdf.Google ScholarGoogle Scholar
  44. A. Whitaker, R. S. Cox, and S. D. Gribble. Configuration debugging as search: finding the needle in the haystack. In OSDI'04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Windows Error Reporting(Dr.Watson). http://www.microsoft.com/whdc/maintain/StartWER.mspx.Google ScholarGoogle Scholar
  46. M. Xu, R. Bodik, and M. D. Hill. A "flight data recorder" for enabling full-system multiprocessor deterministic replay. In ISCA'03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. W. Xu, L. Huang,M. Jordan, D. Patterson, and A. Fox. Mining console logs for large-scale system problem detection. In SOSP'09.Google ScholarGoogle Scholar
  48. J. Yang, P. Twohey, D. Engler, and M. Musuvathi. Using model checking to find serious file system errors. In OSDI'04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Y.Xie and A.Aiken. Saturn: A scalable framework for error detection using boolean satisfiability. Transactions on Programming Language and Systems, 29(3):1---16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. A. Zeller. Isolating cause-effect chains from computer programs. In FSE'02. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SherLog: error diagnosis by connecting clues from run-time logs

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
      March 2010
      422 pages
      ISBN:9781605588391
      DOI:10.1145/1736020
      • General Chair:
      • James C. Hoe,
      • Program Chair:
      • Vikram S. Adve
      • cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 38, Issue 1
        ASPLOS '10
        March 2010
        399 pages
        ISSN:0163-5964
        DOI:10.1145/1735970
        Issue’s Table of Contents
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 45, Issue 3
        ASPLOS '10
        March 2010
        399 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1735971
        Issue’s Table of Contents

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 March 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      ASPLOS XV Paper Acceptance Rate32of181submissions,18%Overall Acceptance Rate535of2,713submissions,20%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader