Abstract
Microservice-based architectures are gaining popularity for their benefits in software development. Distributed tracing can be used to help operators maintain observability in this highly distributed context, and find problems such as latency, and analyse their context and root cause. However, exploring and working with distributed tracing data is sometimes difficult due to its complexity and application specificity, volume of information and lack of tools. The most common and general tools available for this kind of data, focus on trace-level human-readable data visualisation. Unfortunately, these tools do not provide good ways to abstract, navigate, filter and analyse tracing data. Additionally, they do not automate or aid with trace analysis, relying on administrators to do it themselves. In this paper we propose using tracing data to extract service metrics, dependency graphs and work-flows with the objective of detecting anomalous services and operation patterns. We implemented and published open source prototype tools to process tracing data, conforming to the OpenTracing standard, and developed anomaly detection methods. We validated our tools and methods against real data provided by a major cloud provider. Results show that there is an underused wealth of actionable information that can be extracted from both metric and morphological aspects derived from tracing. In particular, our tools were able to detect anomalous behaviour and situate it both in terms of involved services, work-flows and time-frame. Furthermore, we identified some limitations of the OpenTracing format—as well as the industry accepted tracing abstractions—, and provide suggestions to test trace quality and enhance the standard.
Similar content being viewed by others
Data Availability
The data that support the findings of this study are available from Huawei Research, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Huawei Research. All our source code was made available in GitHub.Footnote 1
Notes
OpenTracing Processor (OTP), https://github.com/andrepbento/OpenTracingProcessor
References
The OpenTracing Specification repository. https://github.com/opentracing/specification. Retrieved on Nov, 2018
Aguilera, M.K., Mogul, J.C., Wiener, J.L., Reynolds, P., Muthitacharoen, A.: Performance debugging for distributed systems of black boxes. ACM SIGOPS Operating Systems Review 37(5), 74 (2003). https://doi.org/10.1145/1165389.945454
Apache Software Foundation: Zipkin. http://zipkin.io (2016). Retrieved on Feb, 2019
Ates, E., Sturmann, L., Toslali, M., Krieger, O., Megginson, R., Coskun, A.K., Sambasivan, R.R.: An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications. In: Proceedings of the ACM Symposium on Cloud Computing - SoCC ’19, pp 165–170. ACM Press, New York (2019). https://doi.org/10.1145/3357223.3362704
Cinque, M., Della Corte, R., Pecchia, A.: Microservices monitoring with event logs and black box execution tracing. IEEE Trans. Serv. Comput., 1–1. https://doi.org/10.1109/TSC.2019.2940009 (2019)
Cloud Native Computing Foundation: OpenTelemetry: Effective observability requires high-quality telemetry. https://opentelemetry.io (2019). Retrieved on July, 2019
Cotroneo, D., De Simone, L., Liguori, P., Natella, R., Bidokhti, N.: Enhancing failure propagation analysis in cloud computing systems. In: 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE), pp 139–150. IEEE (2019). https://doi.org/10.1109/ISSRE.2019.00023
Cournapeau, D.: Scikit-learn - Machine learning in Python. https://github.com/scikit-learn/scikit-learn. Retrieved on Feb, 2019 (2007)
Dragoni, N., Giallorenzo, S., Lafuente, A.L., Mazzara, M., Montesi, F., Mustafin, R., Safina, L.: Microservices: yesterday, today, and tomorrow. In: Present and Ulterior Software Engineering, pp 195–216 (2017). https://doi.org/10.1007/978-3-319-67425-4_12
Erlingsson, Ú., Peinado, M., Peter, S., Erlingsson, U., Peinado, M., Peter, S., Budiu, M.: Fay. Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles - SOSP ’11 13(4), 311–326 (2011). https://doi.org/10.1145/2043556.2043585
Ewaschuk, R., Beyer, B.: Site Reliability engineering: How Google Runs Production Systems, chap. Monitoring Distributed Systems, pp. 55–66. O’Reilly Media Inc. (2016)
Fonseca, R., Porter, G., Katz, R.H., Shenker, S., Stoica, I.: X-trace: a pervasive network tracing framework. In: Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation (NSDI’07), April, p. 20. USENIX Association. https://doi.org/10.5555/1973430.1973450 (2007)
Fowler, M., Lewis, J.: Microservices, a definition of this architectural term. https://martinfowler.com/articles/microservices.html. Retrieved on Sep, 2018 (2014)
Francesco, P.D., Malavolta, I., Lago, P.: Research on architecting microservices: trends, focus, and potential for industrial adoption. In: 2017 IEEE International Conference on Software Architecture (ICSA), pp 21–30. IEEE (2017). https://doi.org/10.1109/ICSA.2017.24
Google LLC: OpenCensus. https://opencensus.io (2016). Retrieved on July, 2019
Grafana Labs: Grafana - The tool for beautiful metric dashboards. https://github.com/grafana/grafana (2015). Retrieved on Feb, 2019
Herbst, N.R., Kounev, S., Reussner, R.: Elasticity in cloud computing: what it is, and what it is not. Presented as part of the 10th International Conference on Autonomic Computing, 23–27 (2013)
Jacob, S.: The Rise of AIOps: How Data, Machine Learning, and AI Will Transform Performance Monitoring. Retrieved on Mar, 2019 (2019). https://www.appdynamics.com/blog/aiops/aiops-platforms-transform-performance-monitoring
Janapati, S.P.R.: Distributed Logging Architecture for Microservices. Retrieved on Feb, 2019 (2017). https://dzone.com/articles/distributed-logging-architecture-for-microservices
Jonas Bonér Dave Farley, R.K., Thompson, M.: The Reactive Manifesto. https://www.reactivemanifesto.org. Retrieved on Jun, 2019 (2014)
Kaldor, J., Mace, J., Bejda, M., Gao, E., Kuropatwa, W., O’Neill, J., Ong, K.W., Schaller, B., Shan, P., Viscomi, B., Venkataraman, V., Veeraraghavan, K., Song, Y.J.: Canopy: an end-to-end performance tracing and analysis system. In: SOSP 2017 - Proceedings of the 26th ACM Symposium on Operating Systems Principles, pp 34–50. ACM Press, New York (2017). https://doi.org/10.1145/3132747.3132749
Kohyarnejadfard, I., Shakeri, M., Aloise, D.: System Performance Anomaly Detection Using Tracing Data Analysis. In: ACM International Conference Proceeding Series, vol. Part F1482, pp 169–173. ACM Press, New York (2019). https://doi.org/10.1145/3323933.3324085
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21 (7), 558–565 (1978). https://doi.org/10.1145/359545.359563. http://amturing.acm.org/p558-lamport.pdf, http://portal.acm.org/citation.cfm?doid=359545.359563
Laprie, J.C.: From dependability to resilience. In: 38th IEEE/IFIP Int. Conf. on Dependable Systems and Networks, pp G8–G9 (2008)
Las-Casas, P., Papakerashvili, G., Anand, V., Mace, J.: Sifter: scalable sampling for distributed traces, without feature engineering. In: Proceedings of the ACM Symposium on Cloud Computing - SoCC ’19, pp 312–324. ACM Press, New York (2019). https://doi.org/10.1145/3357223.3362736
Lerner, A.: AIOps Platforms. https://blogs.gartner.com/andrew-lerner/2017/08/09/aiops-platforms. Retrieved on Jun, 2019 (2017)
Levin, A., Garion, S., Kolodner, E.K., Lorenz, D.H., Barabash, K., Kugler, M., McShane, N.: AIOps for a cloud object storage service. In: 2019 IEEE International Congress on Big Data (Bigdatacongress), pp 165–169. IEEE (2019). https://doi.org/10.1109/BigDataCongress.2019.00036
Li, H., Oh, J., Oh, H., Lee, H.: Automated source code instrumentation for verifying potential vulnerabilities. IFIP Advances in Information and Communication Technology 471, 211–226 (2016). https://doi.org/10.1007/978-3-319-33630-5_15
Li, S.: Time Series of Price Anomaly Detection. https://towardsdatascience.com/time-series-of-price-anomaly-detection-13586cd5ff46. Retrieved on Jan, 2019 (2019)
Nedelkoski, S., Cardoso, J., Kao, O.: Anomaly detection from system tracing data using multimodal deep learning. In: 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), vol. 2019-July, pp. 179–186. IEEE. https://doi.org/10.1109/CLOUD.2019.00038 (2019)
NetworkX developers: NetworkX. https://networkx.github.io (2014). Retrieved on Nov, 2018
New Relic, Inc.: Newrelic – deliver more perfect software. https://newrelic.com (2008). Retrieved on Jan, 2021
OpenTracing Specification Council: The OpenTracing Data Model Specification. https://opentracing.io/specification (2019). Retrieved on Feb, 2019
OpenTracing Specification Council: The OpenTracing Semantic Conventions. https://github.com/opentracing/specification/blob/master/semantic_conventions.md (2019). Retrieved on Feb, 2019
OpenTracing Specification Council: The OpenTracing Semantic Specification. https://github.com/opentracing/specification/blob/master/specification.md (2019). Retrieved on Feb, 2019
Oracle: Java Stream API. https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html (2017). Retrieved on Feb, 2019
Pina, F., Correia, J., Filipe, R., Araujo, F., Cardoso, J.: Nonintrusive monitoring of microservice-based systems. In: 2018 IEEE 17th International Symposium on Network Computing and Applications (NCA), pp 1–8. IEEE (2018)
Project Jupyter: Jupyter Notebooks. https://jupyter.org (2015). Retrieved on Nov, 2018
Richardson, C.: Microservices Definition. https://microservices.io. Retrieved on Sep, 2018 (2019)
Sambasivan, R.R., Fonseca, R., Shafer, I., Ganger, G.R.: So, you want to trace your distributed system? Key design insights from years of practical experience. Tech. rep., Technical Report CMU-PDL-14 (2014)
Sambasivan, R.R., Shafer, I., Mace, J., Sigelman, B.H., Fonseca, R., Ganger, G.R.: Principled workflow-centric tracing of distributed systems. In: Proceedings of the Seventh ACM Symposium on Cloud Computing - SoCC ’16, pp 401–414. ACM Press, New York (2016). https://doi.org/10.1145/2987550.2987568
Sigelman, B.H., André, L., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C.: Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Tech. rep., Google LLC (2010)
StumbleUpon, Inc: OpenTSDB. https://github.com/OpenTSDB/opentsdb (2010). Retrieved on Feb, 2019
Uber Technologies: Jaeger. https://www.jaegertracing.io (2017). Retrieved on Jun, 2019
Wes McKinney: Pandas - Flexible and powerfull time-series data analysis. https://github.com/pandas-dev/pandas (2008). Retrieved on Nov, 2018
Zhou, X., Peng, X., Xie, T., Sun, J., Ji, C., Liu, D., Xiang, Q., He, C.: Latent error prediction and fault localization for microservice applications by learning from system trace logs. In: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering - ESEC/FSE 2019, pp 683–694. ACM Press, New York (2019). https://doi.org/10.1145/3338906.3338961
Acknowledgements
This work was produced with the support of INCD funded by FCT and FEDER, under the project 01/SAICT/2016 n 022153 and was partially carried out under the project P2020 - 31/SI/2017: AESOP — Autonomic Service Operation, supported by Portugal 2020 and UE-FEDER. This work was also partially supported by national funds through the FCT - Foundation for Science and Technology, I.P., within the scope of the project CISUC - UID/CEC/00326/2020 and by the European Social Fund, through the Regional Operational Program Centro 2020.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bento, A., Correia, J., Filipe, R. et al. Automated Analysis of Distributed Tracing: Challenges and Research Directions. J Grid Computing 19, 9 (2021). https://doi.org/10.1007/s10723-021-09551-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10723-021-09551-5