In Practice
Improving observability in Event Sourcing systems

https://doi.org/10.1016/j.jss.2021.111015Get rights and content

Highlights

  • Event Sourcing has the double purpose of keeping application state and providing decoupling between modules.

  • Our proposal is improving the observability of the system.

  • We make it possible to track all individual requests and their relation to events in the log.

  • This combination reduces the need to replay the event log for debugging.

Abstract

Event Sourcing (ES) systems use an event log with the double purpose of keeping application state and providing decoupled communication. While ES systems keep track of all business events, other untracked events, either from internal components or from external systems may still cause failures. Determining the root cause of such failures usually involves complex procedures based on replaying the event log. Unlike this, in distributed systems, developers often instrument the source code, for the sake of improving observability and perform tracing on workflows and data.

Adding tracing to ES thus seems like an unexplored and powerful approach to improve the observability of the system. In this paper, we suggest possible implementations of the idea and discuss their merits. These include the adoption of well-known tracing-related tools and standards in ES systems, with the respective advantages for root-cause analysis, anomaly detection, profiling and others.

Introduction

Event Sourcing is a design approach where distributed applications keep their state as a sequence of state-changing operations. Instead of storing mutable objects, applications keep an immutable sequence of changes to such objects. Changing state and writing an event to the log is one single, therefore atomic, operation. The log becomes the authoritative source of truth, offering eventual consistency, as events propagate to different parts of the distributed application. This design entails a strongly decoupled architecture, typical of publish–subscribe systems (Clayman et al., 2010), while providing reliable auditing and logging functionality. For example, developers may bring the application back to a previous execution state, by replaying the events in the log. This feature is particularly powerful, not only for the sake of failure recovery and state management (most services can be stateless), but also for debugging and to experiment alternative what-if scenarios. Furthermore, it makes the system’s data schema more flexible, as it becomes possible to recover field values or even calculate additional ones from the log.

The literature (Fowler, 2005a) mentions ease of debugging as one of the advantages of having the event log and the ability to do partial and branched replays. In this context, ES is often presented as a way to deal with some of the complications created by distribution. However, this approach preserves no metadata about the system, which is necessary to observe failures and assert the correctness of the ES mechanisms themselves. To investigate and correct these situations, developers normally have to resort to unstructured logging. It is worth noting that the cost of branching or replaying the log is very high, both in source code complexity and resources, as a portion of the events will essentially have to be reprocessed.

At the same time, the rising trend of fine-grained distributed systems using microservices and the Function-as-a-Service (FaaS) paradigm made applications more fragmented than ever. Essentially, while breaking the monolith simplifies programming and provides horizontal scalability, complexity goes into the distributed system, which become difficult to observe and comprehend (Pautasso et al., 2017). Getting a consistent picture of state involves causality relationships. End-to-end tracing (Xiang et al., 2016) is one such method, preserving a portion of causality relationships at the cost of requiring code instrumentation, for the sake of system analysis and debugging.

As we discuss in Section 2, tracing and ES are usually taken as separate paradigms. To some degree, ES is presented as a complete solution for debugging any issue, as it can theoretically recreate state at any point in time for analysis. State-changing events, however, lack information of the system’s internal variables and code execution path, not to mention interaction with external components. Log events lack structured metadata, thus failing to provide end-to-end tracing information. Operators thus lack good means to recover the workflows taking place inside the distributed system. On the other hand, tracing is an observability tool that uses source code instrumentation to pass specific data, known as baggage, between different parts of the application at run-time. Baggage relates and stitches the different services of the application together, for the sake of a posteriori debugging of their interactions. Tracing has a meta-functionality; it is not meant to be the source of truth of business data. Tools like  zipkin.io (2021) or  Jaeger (2021) demonstrate the acceptance of tracing in the industry.

In Section 3, we review other work that explores or enriches ES in ways that have similarities to our own, e.g., to recover the state of the distributed system or for simulation purposes. Interestingly, our work also displays similarities to the topic of process mining, where the goal is to retrieve operational processes from system logs. We, therefore, briefly review the topic in this section.

In Section 4 we formally reason about ES and tracing, their differences and complementarities. We propose to use them together in Section 5. We do this by keeping internal events that, while not changing the business state of the application, are either in the causation or consequence chain of such changes. Processing and storage restrictions aside, this would enable a complete recovery of the execution state of the entire system.

Our goal is to increase observability, by recovering causality inside the entire system, across separate components. Our contribution is an approach to leverage distributed tracing, to preserve relationships between events, as well as connecting these to software metadata, such as versions, ultimate event publishers and subscribers, and other runtime data, such as exceptions and logging. This will eventually make the identification of failure root causes much cheaper, by resorting to proven standards, tools and approaches.

To better illustrate our idea, in Section 6, we explore a Guitar Store as an ES use case and analyze scenarios where an event may be missing from the log, duplicated, or incorrect. We then address these scenarios with and without tracing and evaluate the differences. We extend this analysis to a case where system developers and operators can take advantage of tracing to eliminate unused code.

We discuss possible implementations for adding tracing to ES and elaborate on the advantages of the idea in Section 7. In Section 8, we conclude the paper and enumerate some consequences of our proposal for developers and operators.

Section snippets

Background

According to Vernon (2013), Domain-Driven Design (DDD) is an approach to complex software development, in which software developers: (i) focus on a domain; (ii) work in close collaboration with domain experts; (iii) and use a ubiquitous language within the bounded context of a domain.

DDD divides the problem of creating software into bounded contexts aligned with real-world domains (e.g., an ontology). Reducing the scope of the problem enables domain experts to have a deep understanding of the

Related work

To contextualize our proposal, we look at existing approaches and how they compare. In particular, we review work related to system morphology extraction and root cause analysis. While the idea of fully recovering causality in a system is not new, the novelty of our approach lies in stitching together tracing and ES. By connecting state changes to runtime metadata, we generate valuable information for root-cause analysis, anomaly detection, profiling and other related goals.

Problem statement

As we mentioned in Section 2, strategies like retroactive events help dealing with incorrect states of the system. However, it is not trivial to determine which events need compensation and the breadth of events they affect. External factors, such as software updates and interaction with external systems, result in state and interaction metadata that is not captured in the event log. Additionally, some internal interactions do not immediately materialize as state change: when a sell command

Formalizations

We propose to further explore the idea of Erb et al. (2016), by resorting to distributed tracing, instead of keeping vector clocks on the processes. Our argument is that, by definition, tracing is storing a relevant portion of all events and happens-before relationships, i.e. G=(E,T). Conversely, the need to have G is fulfilled by tracing, which often is already in place in large distributed systems (Netflix, 2019, zipkin.io, 2021). Combining tracing with event sourcing is a very natural

Evaluation

To validate our proposal we consider an Event-Sourced (ESd) online Guitar Store – shown in Fig. 2 – to demonstrate the advantages of tracing for observing the state of, and reasoning about, ESd systems.

The sample system is composed by three aggregates: payment, warehouse and supply management; as well as an orchestrator in charge of coordinating the aggregates. Aggregates produce the following log events:

  • GuitarReserved: upon receiving the instruction to purchase a guitar, the Warehouse

Discussion

Distributed tracing is highly complementary to ES as it adds a new dimension, by connecting it to execution metadata. To explore the proposed approach and its implications, this sections discusses the current limitations of ES and how tracing can mitigate them. In particular, we discuss a few example use cases and how they are affected.

Conclusion

In this paper we argue that an event log does not have enough information for a complete audit of a distributed system. As far as we know, we are the first to suggest adding end-to-end tracing to event sourcing, as a way to mitigate the traceability problems felt by practitioners.

While ES offers many advantages in terms of replayability, isolation and data schema flexibility, for the purposes of debugging, it still provides an incomplete view of the system. Normally, debugging requires

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by research grants of the programs: Science Without Borders (Ciências sem Fronteiras - CsF), Brazil, Brazilian Space Agency (Agência Espacial Brasileira - AEB), Brazil and by Portuguese funds through the Foundation for Science and Technology, I.P., within the scope of the project CISUC – UID/CEC/00326/2020 and by the European Social Fund , through the Regional Operational Program Centro 2020.

Stanley Lima is a Ph.D. student at the University of Coimbra, Portugal, as well as a Senior Cloud Solution Architect at Ânima Educação group. He is affiliated to the Centre for Informatics and Systems of the University of Coimbra, Portugal, and to the Service Prototyping Lab (SPLab) at the Zurich University of Applied Sciences (ZHAW), Institute of Applied Information Technology in Winterthur Switzerland. His main research interests are reliable software and distributed systems, namely

References (58)

  • GaoZhi-peng et al.

    Qoe/qos driven simulated annealing-based genetic algorithm for web services selection

    J. China Univ. Posts Telecommun.

    (2009)
  • AdriansyahArya

    Aligning observed and modeled behavior

    (2014)
  • AlmeidaPaulo Sergio et al.

    Interval tree clocks

  • Axon,, 2021. URL:...
  • ClaymanStuart et al.

    Monitoring service clouds in the future internet

  • de MedeirosAna Karla A. et al.

    Genetic process mining: an experimental evaluation

    Data Min. Knowl. Discov.

    (2007)
  • de MurillasEduardo González López et al.

    Process mining on databases: Unearthing historical data from redo logs

  • DebskiAndrzej et al.

    In search for a scalable & reactive architecture of a cloud application: CQRS and event sourcing case study

    IEEE Softw.

    (2017)
  • EibenAgoston E. et al.

    Introduction To Evolutionary Computing

    (2003)
  • ErbBenjamin et al.

    On the potential of event sourcing for retroactive actor-based programming

  • ErbBenjamin et al.

    Combining discrete event simulations and event sourcing

  • ErbBenjamin et al.

    Consistent retrospective snapshots in distributed event-sourced systems

  • ErbBenjamin et al.

    Consistent retrospective snapshots in distributed event-sourced systems

  • ErbBenjamin et al.

    Log pruning in distributed event-sourced systems

  • EugsterPatrick Th et al.

    The many faces of publish/subscribe

    ACM Comput. Surv.

    (2003)
  • EvansEric

    Domain-Driven Design: Tackling Complexity in the Heart of Software

    (2004)
  • Eventuate Framework,, 2021. URL:...
  • FidgeColin J.

    Timestamps in Message-Passing Systems that Preserve the Partial Ordering

    (1987)
  • FonsecaRodrigo et al.

    X-trace: A pervasive network tracing framework

  • FowlerMartin

    Event sourcing

    (2005)
  • FowlerMartin

    Parallel model

    (2005)
  • FowlerMartin

    Retroactive event

    (2005)
  • FowlerJ. et al.

    Causal distributed breakpoints

  • Garcia-MolinaHector et al.

    Sagas

  • GoedertierStijn et al.

    Robust process discovery with artificial negative events

    J. Mach. Learn. Res.

    (2009)
  • Gonzalez Lopez de MurillasE.

    Process mining on databases: extracting event data from real-life data sources

    (2019)
  • GousiosGeorgios et al.

    Aquarium: An extensible billing platform for cloud infrastructures

    (2012)
  • HohpeG.

    Your coffee shop doesn’t use two-phase commit [asynchronous messaging architecture]

    IEEE Softw.

    (2005)
  • Jaeger,, 2021. URL:...
  • Cited by (3)

    Stanley Lima is a Ph.D. student at the University of Coimbra, Portugal, as well as a Senior Cloud Solution Architect at Ânima Educação group. He is affiliated to the Centre for Informatics and Systems of the University of Coimbra, Portugal, and to the Service Prototyping Lab (SPLab) at the Zurich University of Applied Sciences (ZHAW), Institute of Applied Information Technology in Winterthur Switzerland. His main research interests are reliable software and distributed systems, namely observability and tracing, cloud environments as well as consistency versus availability and related open challenges in cloud computing.

    Jaime Correia is a Ph.D. student at the Department of Informatics Engineering of the University of Coimbra and has previously received a B.Sc. in Informatics Engineering and an M.Sc. in Software Engineering from the same institution. His main research interests are distributed systems with focus on observability of distributed systems, particularly tracing, cloud environments as well as autonomic and elastic systems.

    Filipe Araujo is a Tenured Assistant Professor at the University of Coimbra, Portugal. He received his Ph.D. in 2006 from the University of Lisbon, Portugal. His main research topic is observability of fine-grained distributed systems. His research interests include cloud computing, microservices, monitoring, security, and other distributed systems topics. He is the author of the blog “Enterprise Application Integration” (http://eai-course.blogspot.com), which has over 100,000 page views from all over the world.

    Jorge Cardoso his currently Chief Architect for AIOps (artificial intelligence for IT operations) at Huawei Munich Research Center and Associate Professor at the University of Coimbra, Portugal. His current research involves the development of the next generation of AI-driven IT Operations tools and platforms. He has a Ph.D. from the University of Georgia, US and a M.Sc./B.Sc. in Informatics Engineering from the University of Coimbra.

    Editor: Earl Barr.

    View full text