ABSTRACT
Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) unavailability of users' inputs and file content due to privacy concerns; (2) difficulty in building the exact same execution environment; and (3) non-determinism of concurrent executions on multi-processors.
Therefore, programmers often have to diagnose a production run failure based on logs collected back from customers and the corresponding source code. Such diagnosis requires expert knowledge and is also too time-consuming, tedious to narrow down root causes. To address this problem, we propose a tool, called SherLog, that analyzes source code by leveraging information provided by run-time logs to infer what must or may have happened during the failed production run. It requires neither re-execution of the program nor knowledge on the log's semantics. It infers both control and data value information regarding to the failed execution.
We evaluate SherLog with 8 representative real world software failures (6 software bugs and 2 configuration errors) from 7 applications including 3 servers. Information inferred by SherLog are very useful for programmers to diagnose these evaluated failures. Our results also show that SherLog can analyze large server applications such as Apache with thousands of logging messages within only 40 minutes.
- H. Agrawal, R. A. DeMillo, and E. H. Spafford. Debugging with dynamic slicing and backtracking. Software -- Practice and Experience, 23(6):589--616, June 1993. Google ScholarDigital Library
- H. Agrawal, J. R. Horgan, S. London, and W. E.Wong. Fault localization using execution slices and dataflow tests. In ISSRE'95.Google Scholar
- M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In SOSP'03. Google ScholarDigital Library
- A. Aiken, S. Bugrara, I. Dillig, T. Dillig, P. Hawkins, and B. Hackett. The Saturn Program Analysis System.Google Scholar
- K. Ashcraft and D. Engler. Using programmer-written compiler extensions to catch security holes. In SP '02: Proceedings of the 2002 IEEE Symposium on Security and Privacy. Google ScholarDigital Library
- A. Ayers, R. Schooler, C. Metcalf, A. Agarwal, J. Rhee, and E. Witchel. Traceback: First fault diagnosis by reconstruction of distributed control flow. In PLDI'05. Google ScholarDigital Library
- T. Ball, M. Naik, and S. K. Rajamani. From symptom to cause: localizing errors in counterexample traces. ACM SIGPLAN Notices, 38(1):97--105, Jan. 2003. Google ScholarDigital Library
- P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In OSDI'04. Google ScholarDigital Library
- E. Bodden, P. Lam, and L. Hendren. Finding programming errors earlier by evaluating runtime monitors ahead-of-time. In FSE'08. Google ScholarDigital Library
- C. Cadar, D. Dunbar, and D. R. Engler. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI'08. Google ScholarDigital Library
- F. Chen and G. Rosú. Parametric trace slicing and monitoring. In TACAS'09. Google ScholarDigital Library
- T. M. Chilimbi, B. Liblit, K. Mehra, A. V. Nori, and K. Vaswani. HOLMES: Effective statistical debugging via efficient path profiling. In ICSE'09. Google ScholarDigital Library
- V. Chipounov, V. Georgescu, C. Zamfir, and G. Candea. Selective Symbolic Execution. In HotDep'09.Google Scholar
- I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In SOSP'05. Google ScholarDigital Library
- Dell. Streamlined Troubleshooting with the Dell system E--Support tool. Dell Power Solutions, 2008.Google Scholar
- R. A. DeMillo, H. Pan, and E. H. Spafford. Critical slicing for software fault localization. In ISSTA, pages 121--134, 1996. Google ScholarDigital Library
- J. Devietti, B. Lucia, M. Oskin, and L. Ceze. Dmp: Deterministic shared-memory multiprocessing. In ASPLOS'09. Google ScholarDigital Library
- I. Dillig, T. Dillig, and A. Aiken. Sound, complete and scalable pathsensitive analysis. SIGPLAN Not., 2008. Google ScholarDigital Library
- G. W. Dunlap, D. G. Lucchetti, M. A. Fetterman, and P. M. Chen. Execution replay of multiprocessor virtual machines. In VEE'08. Google ScholarDigital Library
- D. Engler, B. Chelf, and A. Chou. Checking system rules using system--specific, programmer--written compiler extensions. In OSDI'00. Google ScholarDigital Library
- K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: Fully automatic tool generation from ad hoc data. In POPL'08. Google ScholarDigital Library
- K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle, and G. Hunt. Debugging in the (very)large: ten years of implementation and experience. In SOSP'09, pages 103--116, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- J. Gray. Why do computers stop and what can be done about it?, 1985.Google Scholar
- Z. Guo, X.Wang, J. Tang, X. Liu, Z. Xu, M.Wu,M. F. Kaashoek, and Z. Zhang. R2: An application-level kernel for record and replay. In OSDI'08. Google ScholarDigital Library
- R. Gupta, M. L. Soffa, and J. Howard. Hybrid slicing: integrating dynamic information with static analysis. ACMTransactions on Software Engineering and Methodology, 6(4):370--397, Oct. 1997. Google ScholarDigital Library
- S. Horwitz, T. Reps, and D. Binkley. Interprocedural slicing using dependence graphs. In PLDI '88. Google ScholarDigital Library
- W. Jiang. Understanding storage system problems and diagnosing them through log analysis. Ph.D. Dissertation. Google ScholarDigital Library
- W. Jiang, C. Hu, S. Pasupathy, A. Kanevsky, Z. Li, and Y. Zhou. Understanding customer problem troubleshooting from storage system logs. In FAST'09. Google ScholarDigital Library
- S. Kandula, R. Mahajan, P. Verkaik, S. Agrawal, J. Padhye, and P. Bahl. Degailed diagnosis in enterprise networks. In SIGCOMM'09. Google ScholarDigital Library
- S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In USENIX ATC'05. Google ScholarDigital Library
- B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In PLDI'03. Google ScholarDigital Library
- Apache Logging Services -- Log4j. http://logging.apache.org/log4j.Google Scholar
- R. Manevich, M. Sridharan, S. Adams, M. Das, and Z. Yang. PSE: Explaining program failures via postmortem static analysis. SIGSOFT Softw. Eng. Notes, 29(6):63--72, 2004. Google ScholarDigital Library
- Mozilla Quality Feedback Agent. http://support.mozilla.com/en-US/kb/quality+feedback+agent.Google Scholar
- S. Narayanasamy, C. Pereira, and B. Calder. Recording shared memory dependencies using strata. In ASPLOS'06. Google ScholarDigital Library
- S. Narayanasamy, G. Pokam, and B. Calder. Bugnet: Continuously recording program execution for deterministic replay debugging. In ISCA'05. Google ScholarDigital Library
- NetApp. Proactive health management with auto-support. NetApp White Paper, 2007.Google Scholar
- M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient determistic multithreading in software. In ASPLOS'09. Google ScholarDigital Library
- Squid Archives. http://www.squid-cache.org/Versions/v2/2.3/bugs/#squid-2.3.stable4-ftp_icon_not_found.Google Scholar
- M. Sridharan, S. J. Fink, and R. Bodik. Thin slicing. In PLDI'07. Google ScholarDigital Library
- F. Tip. A survey of program slicing techniques. Journal of Programming Languages, 3:121--189, 1995.Google Scholar
- J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing production run failures at the user's site. In SOSP'07. Google ScholarDigital Library
- VMWare. Using the intergrated virtual debugger for visual studio. http://www.vmware.com/pdf/ws65_manual.pdf.Google Scholar
- A. Whitaker, R. S. Cox, and S. D. Gribble. Configuration debugging as search: finding the needle in the haystack. In OSDI'04. Google ScholarDigital Library
- Windows Error Reporting(Dr.Watson). http://www.microsoft.com/whdc/maintain/StartWER.mspx.Google Scholar
- M. Xu, R. Bodik, and M. D. Hill. A "flight data recorder" for enabling full-system multiprocessor deterministic replay. In ISCA'03. Google ScholarDigital Library
- W. Xu, L. Huang,M. Jordan, D. Patterson, and A. Fox. Mining console logs for large-scale system problem detection. In SOSP'09.Google Scholar
- J. Yang, P. Twohey, D. Engler, and M. Musuvathi. Using model checking to find serious file system errors. In OSDI'04. Google ScholarDigital Library
- Y.Xie and A.Aiken. Saturn: A scalable framework for error detection using boolean satisfiability. Transactions on Programming Language and Systems, 29(3):1---16, 2007. Google ScholarDigital Library
- A. Zeller. Isolating cause-effect chains from computer programs. In FSE'02. Google ScholarDigital Library
Index Terms
SherLog: error diagnosis by connecting clues from run-time logs
Recommendations
Improving Software Diagnosability via Log Enhancement
Special Issue APLOS 2011Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental complexity of troubleshooting any complex software system, but further exacerbated by the paucity of information that is typically available in the ...
SherLog: error diagnosis by connecting clues from run-time logs
ASPLOS '10Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) ...
SherLog: error diagnosis by connecting clues from run-time logs
ASPLOS '10Computer systems often fail due to many factors such as software bugs or administrator errors. Diagnosing such production run failures is an important but challenging task since it is difficult to reproduce them in house due to various reasons: (1) ...
Comments