skip to main content
10.1145/2663165.2663319acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article

Stage-aware anomaly detection through tracking log points

Published: 08 December 2014 Publication History

Abstract

We introduce Stage-aware Anomaly Detection (SAAD), a low-overhead real-time solution for detecting runtime anomalies in storage systems. Modern storage server architectures are multi-threaded and structured as a set of modules, which we call stages. SAAD leverages this to collect stage-level log summaries at runtime and to perform statistical analysis across stage instances. Stages that generate rare execution flows and/or register unusually high duration for regular flows at run-time indicate anomalies. SAAD makes two key contributions: i) limits the search space for root causes, by pinpointing specific anomalous code stages, and ii) reduces compute and storage requirements for log analysis, while preserving accuracy, through a novel technique based on log summarization. We evaluate SAAD on three distributed storage systems: HBase, Hadoop Distributed File System (HDFS), and Cassandra. We show that, with practically zero overhead, we uncover various anomalies in real-time.

References

[1]
Apache Log4j. http://logging.apache.org/log4j.
[2]
Splunk. http://www.splunk.com/.
[3]
Systemtap. http://sourceware.org/systemtap/.
[4]
The R Project for Statistical Computing. http://r-project.org/.
[5]
Thread local storage. http://wikipedia.org/wiki/thread-local_storage.
[6]
A. Agarwal, M. Slee, and M. Kwiatkowski. Thrift: Scalable cross-language services implementation. Technical report, Facebook, 2007.
[7]
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. In OSDI, 2004.
[8]
D. Borthakur, J. Gray, J. Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, et al. Apache hadoop goes realtime at facebook. In SIGMOD, 2011.
[9]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1--4:26, 2008.
[10]
M. Y. Chen, A. Accardi, E. Kiciman, D. A. Patterson, A. Fox, and E. A. Brewer. Path-based failure and evolution management. In NSDI, pages 309--322, 2004.
[11]
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In SoCC, 2010.
[12]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. In SOSP, 2007.
[13]
C. Ding and K. Kennedy. Improving cache performance of dynamic applications with computation computation and data layout transformations. In PLDI99, 1999.
[14]
R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-trace: A pervasive network tracing framework. In NSDI, 2007.
[15]
Q. Fu, J. Lou, Y. Wang, and J. Li. Execution anomaly detection in distributed systems through unstructured log analysis. In ICDM, pages 149--158, 2009.
[16]
P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: wait-free coordination for internet-scale systems. In USENIX annual technical conference, 2010.
[17]
J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. In NetDB, 2011.
[18]
Y. Liang, Y. Zhang, A. Sivasubramaniam, R. Sahoo, J. Moreira, and M. Gupta. Filtering failure logs for a bluegene/l prototype. In DSN, pages 476--485, 2005.
[19]
S. Ma and J. L. Hellerstein. Mining partially periodic event patterns with unknown periods. In ICDE, 2001.
[20]
A. Makanju, A. Zincir-Heywood, and E. Milios. Clustering event logs using iterative partitioning. In KDD, 2009.
[21]
K. Nagaraj, C. Killian, and J. Neville. Structured comparative analysis of systems logs to diagnose performance problems. In NSDI, pages 26--26, 2012.
[22]
A. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In DSN, pages 575--584, 2007.
[23]
P. O'Neil, E. Cheng, D. Gawlick, and E. O'Neil. The log-structured merge-tree (lsm-tree). Acta Informatica, 33(4):351--385, 1996.
[24]
P. Reynolds, C. E. Killian, J. L. Wiener, J. C. Mogul, M. A. Shah, and A. Vahdat. Pip: Detecting the unexpected in distributed systems. In NSDI, 2006.
[25]
R. Sambasivan, A. Zheng, M. De Rosa, E. Krevat, S. Whitman, M. Stroucken, W. Wang, L. Xu, and G. Ganger. Diagnosing performance changes by comparing request flows. In NSDI, 2011.
[26]
B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, 2010.
[27]
J. Stearley. Towards informatic analysis of syslogs. In Cluster Computing, pages 309--318, 2004.
[28]
I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM '01, pages 149--160, 2001.
[29]
J. Tan, S. Kavulya, R. Gandhi, and P. Narasimhan. Visual, log-based causal tracing for performance debugging of mapreduce systems. In ICDCS, pages 795--806, 2010.
[30]
W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting large-scale system problems by mining console logs. In SOSP, pages 117--132, 2009.
[31]
K. Yamanishi and Y. Maruyama. Dynamic syslog mining for network failure monitoring. In KDD, pages 499--508, 2005.
[32]
D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. Sherlog: error diagnosis by connecting clues from runtime logs. In ASPLOS, pages 143--154, 2010.
[33]
D. Yuan, S. Park, P. Huang, Y. Liu, M. M. Lee, Y. Zhou, and S. Savage. Be conservative: Enhancing failure diagnosis with proactive logging. In OSDI, 2012.
[34]
D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage. Improving software diagnosability via log enhancement. In ASPLOS, 2011.

Cited By

View all
  • (2024)A Review of Software Testing Process Log Parsing and Mining2024 IEEE International Conference on Software Services Engineering (SSE)10.1109/SSE62657.2024.00055(334-343)Online publication date: 7-Jul-2024
  • (2021)A Hierarchical Tree-Based Syslog Clustering Scheme for Network Diagnosis2021 17th International Conference on Network and Service Management (CNSM)10.23919/CNSM52442.2021.9615506(28-34)Online publication date: 25-Oct-2021
  • (2021)On Log Analysis and Stack Trace Use to Improve Program SlicingInformation Science and Applications10.1007/978-981-33-6385-4_25(265-275)Online publication date: 3-Apr-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
Middleware '14: Proceedings of the 15th International Middleware Conference
December 2014
334 pages
ISBN:9781450327855
DOI:10.1145/2663165
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • Orange
  • Conseil Régional d'Aquitaine
  • LaBRI: LaBRI
  • Raytheon BBN Technologies: Raytheon BBN Technologies
  • ACM: Association for Computing Machinery
  • Red Hat JBoss Middleware: Red Hat JBoss Middleware
  • Bordeaux: City of Bordeaux
  • USENIX Assoc: USENIX Assoc
  • GDR ASR: GDR Architecture, Systèmes et Réseaux
  • IBM: IBM
  • HP: HP
  • IFIP

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. anomaly detection
  2. failure diagnostics
  3. log

Qualifiers

  • Research-article

Conference

Middleware '14
Sponsor:
  • LaBRI
  • Raytheon BBN Technologies
  • ACM
  • Red Hat JBoss Middleware
  • Bordeaux
  • USENIX Assoc
  • GDR ASR
  • IBM
  • HP

Acceptance Rates

Middleware '14 Paper Acceptance Rate 27 of 144 submissions, 19%;
Overall Acceptance Rate 203 of 948 submissions, 21%

Upcoming Conference

MIDDLEWARE '25
26th International Middleware Conference
December 15 - 19, 2025
Nashville , TN , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Review of Software Testing Process Log Parsing and Mining2024 IEEE International Conference on Software Services Engineering (SSE)10.1109/SSE62657.2024.00055(334-343)Online publication date: 7-Jul-2024
  • (2021)A Hierarchical Tree-Based Syslog Clustering Scheme for Network Diagnosis2021 17th International Conference on Network and Service Management (CNSM)10.23919/CNSM52442.2021.9615506(28-34)Online publication date: 25-Oct-2021
  • (2021)On Log Analysis and Stack Trace Use to Improve Program SlicingInformation Science and Applications10.1007/978-981-33-6385-4_25(265-275)Online publication date: 3-Apr-2021
  • (2020)On Matching Log Analysis to Source CodeProceedings of the International Conference on Research in Adaptive and Convergent Systems10.1145/3400286.3418262(181-187)Online publication date: 13-Oct-2020
  • (2020)Leveraging Anomaly Detection for Proactive Application MonitoringArtificial Intelligence XXXVII10.1007/978-3-030-63799-6_29(380-385)Online publication date: 8-Dec-2020
  • (2019)Progress in Outlier Detection Techniques: A SurveyIEEE Access10.1109/ACCESS.2019.29327697(107964-108000)Online publication date: 2019
  • (2017)Safe Inspection of Live Virtual MachinesACM SIGPLAN Notices10.1145/3140607.305076652:7(97-111)Online publication date: 8-Apr-2017
  • (2017)Safe Inspection of Live Virtual MachinesProceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3050748.3050766(97-111)Online publication date: 8-Apr-2017
  • (2016)Security intelligence for cloud management infrastructuresIBM Journal of Research and Development10.1147/JRD.2016.257246260:4(11:1-11:13)Online publication date: 1-Jul-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media