skip to main content
10.1145/3464298.3493396acmconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article

CAT: content-aware tracing and analysis for distributed systems

Published: 02 December 2021 Publication History

Abstract

Tracing and analyzing the interactions and exchanges between nodes is fundamental to uncover performance, correctness and dependability issues almost unavoidable in any complex distributed system. Existing monitoring tools acknowledge this importance but, so far, restrict tracing to the external attributes of I/O messages, thus missing a wealth of information in them.
We present CaT, a non-intrusive content-aware tracing and analysis framework that, through a novel similarity-based approach, is able to comprehensively trace and correlate the flow of network and storage requests from applications. By supporting multiple tracing tools, CaT can balance the coverage of captured events with the impact on applications' performance.
The conducted experimental evaluation considering two widely used applications (TensorFlow and Apache Hadoop) shows how CaT can improve the analysis of distributed systems. The results also exemplify the trade-offs that can be used to balance tracing coverage and performance impact. Interestingly, in certain cases, full coverage of events can be attained with negligible performance and storage overhead.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th Symposium on Operating Systems Design and Implementation (OSDI). USENIX, 265--283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi
[2]
Ibrahim Umit Akgun, Geoff Kuenning, and Erez Zadok. 2020. Re-Animator: Versatile High-Fidelity Storage-System Tracing and Replaying. In Proceedings of the 13th ACM International Systems and Storage Conference (SYSTOR). ACM, 61--74.
[3]
Apache Software Foundation. 2015. Hadoop. Retrieved April, 2021 from https://hadoop.apache.org
[4]
Andrei Z. Broder. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences 1997. IEEE, 21--29.
[5]
Mike Y Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. 2002. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the 2002 International Conference on Dependable Systems and Networks (DSN). IEEE, 595--604.
[6]
Biplob Debnath, Mohiuddin Solaimani, Muhammad Ali Gulzar Gulzar, Nipun Arora, Cristian Lumezanu, Jianwu Xu, Bo Zong, Hui Zhang, Guofei Jiang, and Latifur Khan. 2018. LogLens: A Real-Time Log Analysis System. In Proceedings of the 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1052-1062.
[7]
Jeff Dileo and Andy Olsen. 2019. eBPF Adventures: Fiddling with the Linux Kernel and Unix Domain Sockets. Retrieved April, 2021 from https://www.nccgroup.com/us/about-us/newsroom-and-events/blog/2019/march/ebpf-adventures-fiddling-with-the-linux-kemel-and-unix-domain-sockets/#case-study-sniffing-frida-traffic
[8]
Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. 2007. X-Trace: A Pervasive Network Tracing Framework. In Proceedings of the 4th Symposium on Networked Systems Design & Implementation (NSDI). USENIX, 271--284. http://www.usenix.org/events/nsdi07/tech/fonseca.html
[9]
Mohamad Gebai and Michel R. Dagenais. 2018. Survey and Analysis of Kernel and Userspace Tracers on Linux: Design, Implementation, and Overhead. Comput. Surveys 51, 2 (2018), 1--33.
[10]
Wael H. Gomaa and Aly A. Fahmy. 2013. A Survey of Text Similarity Approaches. International Journal of Computer Applications 68, 13 (2013), 13--18.
[11]
Chris Hunt. 2019. chrahunt/strace-parser. Retrieved April, 2021 from https://github.com/chrahunt/strace-parser
[12]
Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the 30th Symposium on Theory of Computing (STOC). ACM, 604--613.
[13]
Min Gyung Kang, Stephen McCamant, Pongsin Poosankam, and Dawn Song. 2011. DTA++: Dynamic Taint Analysis with Targeted Control-Flow Propagation. In Proceedings of the 18th Network and Distributed System Security Symposium (NDSS). The Internet Society. http://bitblaze.cs.berkeley.edu/papers/dta++-ndss11.pdf
[14]
Paul Kranenburg, Branko Lankester, Rick Sladkey, et al. 2021. strace: Linux syscall tracer. Retrieved April, 2021 from https://strace.io
[15]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.
[16]
Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP). ACM, 378--393.
[17]
Steven McCanne and Van Jacobson. 1993. The BSD Packet Filter: A New Architecture for User-level Packet Capture. In Proceedings of the Winter 1993 USENIX Conference, Vol. 46. USENIX, 259--269. https://www.usenix.org/legacy/publications/library/proceedings/sd93/mccanne.pdf
[18]
Q. Monnet. 2016. Dive into BPF: a list of reading material. Retrieved April, 2021 from https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf/
[19]
Francisco Neves, Nuno Machado, and José Pereira. 2018. Falcon: A practical log-based analysis tool for distributed systems. In Proceedings of the 48th International Conference on Dependable Systems and Networks (DSN). IEEE, 534--541.
[20]
Francisco Neves, Nuno Machado, and José Pereira. 2019. fntn-eves/falcon. Retrieved April, 2021 from https://github.com/fntneves/falcon
[21]
Francisco Neves, Nuno Machado, Ricardo Vilaça, and José Pereira. 2021. Horus: Non-Intrusive Causal Analysis of Distributed Systems Logs. In Proceedings of the 51st International Conference on Dependable Systems and Networks (DSN). 212--223.
[22]
Adam Oliner, Archana Ganapathi, and Wei Xu. 2012. Advances and Challenges in Log Analysis. Commun. ACM 55, 2 (2012), 55--61.
[23]
Thomas Pasquier, Xueyuan Han, Mark Goldstein, Thomas Moyer, David Eyers, Margo Seltzer, and Jean Bacon. 2017. Practical Whole-System Provenance Capture. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC). ACM, 405--418.
[24]
Andrew Pollock. 2020. dstat(1) - Linux man page. Retrieved April, 2021 from https://linux.die.net/man/1/dstat
[25]
Zhilei Ren, Changlin Liu, Xusheng Xiao, He Jiang, and Tao Xie. 2019. Root Cause Localization for Unreproducible Builds via Causality Analysis Over System Call Tracing. In Proceedings of the 34th International Conference on Automated Software Engineering (ASE). IEEE, 527--538.
[26]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211--252.
[27]
Nikolaos Sapountzis, Ruimin Sun, Xuetao Wei, Yier Jin, Jedidiah Crandall, and Daniela Oliveira. 2020. MITOS: Optimal Decisioning for the Indirect Flow Propagation Dilemma in Dynamic Information Flow Tracking Systems. In Proceedings of the 40th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1090--1100.
[28]
Sanhita Sarkar. 2019. A Scalable Artificial Intelligence Data Pipeline for Accelerating Time to Insight. (2019). Storage Developer Conference.
[29]
Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google, Inc. https://research.google.com/archive/papers/dapper-2010-1.pdf
[30]
Chun Hui Suen, Ryan KL Ko, Yu Shyang Tan, Peter Jagadpramana, and Bu Sung Lee. 2013. S2Logger: End-to-End Data Tracking Mechanism for Cloud Data Provenance. In Proceedings of the 12th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). IEEE, 594--602.
[31]
EnoThereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-El-Malek, Julio Lopez, and Gregory R Ganger. 2006. Stardust: Tracking Activity in a Distributed Storage System. SIGMETRICS Performance Evaluation Review 34, 1 (2006), 3--14.
[32]
Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, et al. 2014. Bigdatabench: a Big Data Benchmark Suite from Internet Services. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 488--499.
[33]
Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting Large-Scale System Problems by Mining Console Logs. In Proceedings of the 22nd Symposium on Operating Systems Principles (SOSP). ACM, 117--132.
[34]
Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, Yuanyuan Zhou, and Shankar Pasupathy. 2010. SherLog: error diagnosis by connecting clues from run-time logs. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 143--154.
[35]
Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan Ullah, Yu Luo, Ding Yuan, and Michael Stumm. 2014. lprof: A Non-intrusive Request Flow Profiler for Distributed Systems. In Prooceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI). USENIX, 629--644. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zhao

Cited By

View all
  • (2023)Knowledge Extraction and Discovery about Web System Based on the Benchmark Application of Online Stock Trading SystemSensors10.3390/s2304227423:4(2274)Online publication date: 17-Feb-2023
  • (2023)Diagnosing applications’ I/O behavior through system call observability2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W)10.1109/DSN-W58399.2023.00022(1-8)Online publication date: Jun-2023
  • (2023)Toward a Practical and Timely Diagnosis of Application’s I/O BehaviorIEEE Access10.1109/ACCESS.2023.332210411(110184-110207)Online publication date: 2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
Middleware '21: Proceedings of the 22nd International Middleware Conference
December 2021
398 pages
ISBN:9781450385343
DOI:10.1145/3464298
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • USENIX Assoc: USENIX Assoc
  • IFIP

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 December 2021

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. black-box
  2. content-aware analysis
  3. distributed systems
  4. tracing

Qualifiers

  • Research-article

Funding Sources

  • FCT - Fundacao para a Ciencia e a Tecnologia (Portuguese Foundation for Science and Technology)
  • National Funds through FCT
  • ERDF - European Regional Development Fund

Conference

Middleware '21
Sponsor:
Middleware '21: 22nd International Middleware Conference
December 6 - 10, 2021
Québec city, Canada

Acceptance Rates

Overall Acceptance Rate 203 of 948 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)1
Reflects downloads up to 30 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Knowledge Extraction and Discovery about Web System Based on the Benchmark Application of Online Stock Trading SystemSensors10.3390/s2304227423:4(2274)Online publication date: 17-Feb-2023
  • (2023)Diagnosing applications’ I/O behavior through system call observability2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W)10.1109/DSN-W58399.2023.00022(1-8)Online publication date: Jun-2023
  • (2023)Toward a Practical and Timely Diagnosis of Application’s I/O BehaviorIEEE Access10.1109/ACCESS.2023.332210411(110184-110207)Online publication date: 2023
  • (2022)PUTraceAD: Trace Anomaly Detection with Partial Labels based on GNN and PU Learning2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE55969.2022.00032(239-250)Online publication date: Oct-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media