skip to main content
10.1145/1810085.1810114acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

A query language for understanding component interactions in production systems

Published:02 June 2010Publication History

ABSTRACT

When something unexpected happens in a large production system, administrators must first perform a search to isolate which components and component interactions are likely to be involved. The system may consist of thousands of interacting subsystems, the logging instrumentation may be noisy or incomplete, and the problem description may be vague, so this search is often the most difficult part of understanding the system's behavior. To facilitate the search process, we present a query language and a method for computing these queries that makes minimal assumptions about the available data. We evaluate our method on nearly 1.22 billion lines of system logs from four supercomputers, two autonomous vehicles, and a server cluster.

References

  1. M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Methitacharoen. Performance debugging for distributed systems of black boxes. In SOSP, pages 74--89, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang. Towards highly reliable enterprise network services via inference of multi-level dependencies. In SIGCOMM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Brodie, I. Rish, and S. Ma. Optimizing probe selection for fault localization. In Intl. Workshop on Distributed Systems: Operations and Management (DSOM), October 2001.Google ScholarGoogle ScholarCross RefCross Ref
  5. A. Brown, G. Kar, and A. Keller. An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In IEEE IM, pages 377--390, Seattle, WA, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  6. M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer. Path-based failure and evolution management. In NSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: problem determination in large, dynamic internet services. In DSN, June 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chutani and H. Nussbaumer. On the distributed fault diagnosis of computer networks. In IEEE Symposium on Computers and Communications, pages 71--77, Alexandria, Egypt, June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In SOSP, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Ensel. New approach for automated generation of service dependency models. In Latin American Network Operation and Management Symposium (LANOMS), 2001.Google ScholarGoogle Scholar
  11. D. Geels, G. Altekar, S. Shenker, and I. Stoica. Replay debugging for distributed applications. In USENIX Technical, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle, and G. Hunt. Debugging in the (very) large: Ten years of implementation and experience. In SOSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Kandula, D. Katabi, and J.-P. Vasseur. Shrink: A tool for failure diagnosis in IP networks. In MineNet Workshop at SIGCOMM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. R. Kompella, J. Yates, A. Greenberg, and A. C. Snoeren. IP fault localization via risk modeling. In NSDI, pages 57--70, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. Liu, Z. Guo, X. Wang, F. Chen, X. Lian, J. Tang, M. Wu, M. F. Kaashoek, and Z. Zhang. D3S: debugging deployed distributed systems. In NSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. Liu, W. Lin, A. Pan, and Z. Zhang. WiDS Checker: Combating bugs in distributed systems. In NSDI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. C. Mogul. Emergent (mis)behavior vs. complex software systems. In EuroSys, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Montemerlo et al. Junior: The Stanford entry in the Urban Challenge. Journal of Field Robotics, 25(9):569--597, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. J. Oliner, A. Aiken, and J. Stearley. Alert detection in system logs. In ICDM, December 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. J. Oliner, A. V. Kulkarni, and A. Aiken. Using correlated surprise to infer shared influence. In DSN, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  21. A. J. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In DSN, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. X. Pan, J. Tan, S. Kavulya, R. Gandhi, and P. Narasimhan. Ganesha: Black-box fault diagnosis for MapReduce systems. Technical report, CMU-PDL-08-112, 2008.Google ScholarGoogle Scholar
  23. P. Reynolds, C. Killian, J. L. Wiener, J. C. Mogul, M. A. Shah, and A. Vahdat. Pip: Detecting the unexpected in distributed systems. In NSDI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Reynolds, J. L. Wiener, J. C. Mogul, M. K. Aguilera, and A. Vahdat. WAP5: black-box performance debugging for wide-area systems. In WWW, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. I. Rish, M. Brodie, N. Odintsova, S. Ma, and G. Grabarnik. Real-time problem determination in distributed systems using active probing. In NOMS, 2004.Google ScholarGoogle Scholar
  26. R. Schwarz and F. Mettern. Detecting causal relationships in distributed computations: in search of the holy grail. Distributed Computing, 7(3):149--174, March 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Singh, P. Maniatis, T. Roscoe, and P. Druschel. Using queries for distributed monitoring and forensics. In EuroSys, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. The Computer Failure Data Repository (CFDR). The HPC4 data. http://cfdr.usenix.org/data.html, 2009.Google ScholarGoogle Scholar
  29. S. Thrun and M. Montemerlo, et al. Stanley: The robot that won the DARPA Grand Challenge. Journal of Field Robotics, 23(9):661--692, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting large-scale system problems by mining console logs. In SOSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. E. S. K. Yu and J. Mylopoulos. Understanding "why" in software process modelling, analysis, and design. In ICSE, Sorrento, Italy, May 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A query language for understanding component interactions in production systems

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              ICS '10: Proceedings of the 24th ACM International Conference on Supercomputing
              June 2010
              365 pages
              ISBN:9781450300186
              DOI:10.1145/1810085

              Copyright © 2010 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 2 June 2010

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate584of2,055submissions,28%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader