ABSTRACT
When something unexpected happens in a large production system, administrators must first perform a search to isolate which components and component interactions are likely to be involved. The system may consist of thousands of interacting subsystems, the logging instrumentation may be noisy or incomplete, and the problem description may be vague, so this search is often the most difficult part of understanding the system's behavior. To facilitate the search process, we present a query language and a method for computing these queries that makes minimal assumptions about the available data. We evaluate our method on nearly 1.22 billion lines of system logs from four supercomputers, two autonomous vehicles, and a server cluster.
- M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Methitacharoen. Performance debugging for distributed systems of black boxes. In SOSP, pages 74--89, 2003. Google ScholarDigital Library
- P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang. Towards highly reliable enterprise network services via inference of multi-level dependencies. In SIGCOMM, 2007. Google ScholarDigital Library
- P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. In OSDI, 2004. Google ScholarDigital Library
- M. Brodie, I. Rish, and S. Ma. Optimizing probe selection for fault localization. In Intl. Workshop on Distributed Systems: Operations and Management (DSOM), October 2001.Google ScholarCross Ref
- A. Brown, G. Kar, and A. Keller. An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In IEEE IM, pages 377--390, Seattle, WA, 2001.Google ScholarCross Ref
- M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer. Path-based failure and evolution management. In NSDI, 2004. Google ScholarDigital Library
- M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: problem determination in large, dynamic internet services. In DSN, June 2002. Google ScholarDigital Library
- S. Chutani and H. Nussbaumer. On the distributed fault diagnosis of computer networks. In IEEE Symposium on Computers and Communications, pages 71--77, Alexandria, Egypt, June 1995. Google ScholarDigital Library
- I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In SOSP, 2005. Google ScholarDigital Library
- C. Ensel. New approach for automated generation of service dependency models. In Latin American Network Operation and Management Symposium (LANOMS), 2001.Google Scholar
- D. Geels, G. Altekar, S. Shenker, and I. Stoica. Replay debugging for distributed applications. In USENIX Technical, 2006. Google ScholarDigital Library
- K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle, and G. Hunt. Debugging in the (very) large: Ten years of implementation and experience. In SOSP, 2009. Google ScholarDigital Library
- S. Kandula, D. Katabi, and J.-P. Vasseur. Shrink: A tool for failure diagnosis in IP networks. In MineNet Workshop at SIGCOMM, 2005. Google ScholarDigital Library
- R. R. Kompella, J. Yates, A. Greenberg, and A. C. Snoeren. IP fault localization via risk modeling. In NSDI, pages 57--70, 2005. Google ScholarDigital Library
- X. Liu, Z. Guo, X. Wang, F. Chen, X. Lian, J. Tang, M. Wu, M. F. Kaashoek, and Z. Zhang. D3S: debugging deployed distributed systems. In NSDI, 2008. Google ScholarDigital Library
- X. Liu, W. Lin, A. Pan, and Z. Zhang. WiDS Checker: Combating bugs in distributed systems. In NSDI, 2007. Google ScholarDigital Library
- J. C. Mogul. Emergent (mis)behavior vs. complex software systems. In EuroSys, 2006. Google ScholarDigital Library
- M. Montemerlo et al. Junior: The Stanford entry in the Urban Challenge. Journal of Field Robotics, 25(9):569--597, 2008. Google ScholarDigital Library
- A. J. Oliner, A. Aiken, and J. Stearley. Alert detection in system logs. In ICDM, December 2008. Google ScholarDigital Library
- A. J. Oliner, A. V. Kulkarni, and A. Aiken. Using correlated surprise to infer shared influence. In DSN, 2010.Google ScholarCross Ref
- A. J. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In DSN, 2007. Google ScholarDigital Library
- X. Pan, J. Tan, S. Kavulya, R. Gandhi, and P. Narasimhan. Ganesha: Black-box fault diagnosis for MapReduce systems. Technical report, CMU-PDL-08-112, 2008.Google Scholar
- P. Reynolds, C. Killian, J. L. Wiener, J. C. Mogul, M. A. Shah, and A. Vahdat. Pip: Detecting the unexpected in distributed systems. In NSDI, 2006. Google ScholarDigital Library
- P. Reynolds, J. L. Wiener, J. C. Mogul, M. K. Aguilera, and A. Vahdat. WAP5: black-box performance debugging for wide-area systems. In WWW, 2006. Google ScholarDigital Library
- I. Rish, M. Brodie, N. Odintsova, S. Ma, and G. Grabarnik. Real-time problem determination in distributed systems using active probing. In NOMS, 2004.Google Scholar
- R. Schwarz and F. Mettern. Detecting causal relationships in distributed computations: in search of the holy grail. Distributed Computing, 7(3):149--174, March 1994. Google ScholarDigital Library
- A. Singh, P. Maniatis, T. Roscoe, and P. Druschel. Using queries for distributed monitoring and forensics. In EuroSys, 2006. Google ScholarDigital Library
- The Computer Failure Data Repository (CFDR). The HPC4 data. http://cfdr.usenix.org/data.html, 2009.Google Scholar
- S. Thrun and M. Montemerlo, et al. Stanley: The robot that won the DARPA Grand Challenge. Journal of Field Robotics, 23(9):661--692, June 2006. Google ScholarDigital Library
- W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting large-scale system problems by mining console logs. In SOSP, 2009. Google ScholarDigital Library
- E. S. K. Yu and J. Mylopoulos. Understanding "why" in software process modelling, analysis, and design. In ICSE, Sorrento, Italy, May 1994. Google ScholarDigital Library
Index Terms
- A query language for understanding component interactions in production systems
Recommendations
Language integrated query: unified querying across data sources and programming languages
OOPSLA '06: Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications.NET Language Integrated Query (LINQ) is based on the philosophy that querying should be native to your object-oriented programming language. LINQ allows you to write queries in a uniform way in your programming language itself, taking full advantage of ...
OWL-QL-a language for deductive query answering on the Semantic Web
This paper discusses the issues involved in designing a query language for the Semantic Web and presents the OWL query language (OWL-QL) as a candidate standard language and protocol for query-answering dialogues among Semantic Web computational agents ...
Ethereum query language
WETSEB '18: Proceedings of the 1st International Workshop on Emerging Trends in Software Engineering for BlockchainBlockchains store a massive amount of heterogeneous data which will only grow in time. When searching for data on the Ethereum platform, one is required to either access the records (blocks) directly by using a unique identifier, or sequentially search ...
Comments