Invited Paper: Monotonicity and Opportunistically-Batched Actions in Derecho

Birman, Ken; Jha, Sagar; Milano, Mae; Rosa, Lorenzo; Song, Weijia; Tremel, Edward

doi:10.1007/978-3-031-44274-2_14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14310))

Included in the following conference series:

International Symposium on Stabilizing, Safety, and Security of Distributed Systems

185 Accesses

Abstract

Our work centers on a programming style in which a system separates data movement from control-data exchange, streaming the former over hardware-implemented reliable channels, while using a new form of distributed shared memory to manage the latter. Protocol decisions and control actions are expressed as monotonic predicates over the control data guarding protocol actions. Provable invariants about the protocol are expressed as effectively-common knowledge, which can be derived from the monotonic predicates in effect during a particular membership epoch. The methodology enables a natural style of code that is easy to reason about, and it runs efficiently on modern hardware. We used this approach to create Derecho, an optimal Paxos-based data replication library that sets performance records, and we believe it is broadly applicable to the construction of reliable distributed systems on high-bandwidth networks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
All-to-all exchange of control state would scale poorly in many settings, but no issue arises because Derecho is sharded: most activity occurs in tiny subgroups with just 2 or 3 members. We have experimented with far larger subgroups without problems. Future systems deploying Derecho in immense subgroups might need to exchange control data in a different manner, but the underlying principle of asynchronous updates and monotonic deduction of system state would still apply.

References

Agarwal, D.A., Moser, L.E., Melliar-Smith, P.M., Budhia, R.K.: The Totem multiple-ring ordering and topology maintenance protocol. ACM Trans. Comput. Syst. 16(2), 93–132 (1998). https://doi.org/10.1145/279227.279228
Article Google Scholar
Alvaro, P., Conway, N., Hellerstein, J.M., Marczak, W.R.: Consistency analysis in bloom: a CALM and collected approach. In: Conference on Innovative Data Systems Research (2011)
Google Scholar
Ameloot, T.J., Neven, F., Van Den Bussche, J.: Relational transducers for declarative networking. J. ACM 60(2) (2013). https://doi.org/10.1145/2450142.2450151
Behrens, J., Jha, S., Birman, K., Tremel, E.: RDMC: a reliable RDMA multicast for large objects. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Luxembourg City, Luxembourg, pp. 71–82. IEEE (2018). https://doi.org/10.1109/DSN.2018.00020
Birman, K., Joseph, T.: Exploiting virtual synchrony in distributed systems. In: SOSP 1987, Austin, Texas, USA, pp. 123–138. ACM (1987). https://doi.org/10.1145/41457.37515
Birman, K.: Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services. Texts in Computer Science. Springer, London (2012). https://doi.org/10.1007/978-1-4471-2416-0
Birman, K.P.: Replication and fault-tolerance in the ISIS system. SIGOPS Oper. Syst. Rev. 19(5), 79–86 (1985). https://doi.org/10.1145/323627.323636
Article Google Scholar
Coquand, T., Huet, G.: Constructions: a higher order proof system for mechanizing mathematics. In: Buchberger, B. (ed.) EUROCAL 1985. LNCS, vol. 203, pp. 151–184. Springer, Heidelberg (1985). https://doi.org/10.1007/3-540-15983-5_13
Chapter Google Scholar
Halpern, J.Y., Moses, Y.: Knowledge and common knowledge in a distributed environment. J. ACM 37(3), 549–587 (1990). https://doi.org/10.1145/79147.79161
Article MathSciNet MATH Google Scholar
Hawblitzel, C., et al.: IronFleet: proving safety and liveness of practical distributed systems. Commun. ACM 60(7), 83–92 (2017). https://doi.org/10.1145/3068608
Article Google Scholar
Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12(3), 463–492 (1990). https://doi.org/10.1145/78969.78972
Article Google Scholar
Jha, S., et al.: Derecho: fast state machine replication for cloud services. ACM Trans. Comput. Syst. (TOCS) 36(2), 1–49 (2019)
Article Google Scholar
Jha, S., Rosa, L., Birman, K.P.: Spindle: techniques for optimizing atomic multicast on RDMA. In: 2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS), pp. 1085–1097 (2022). https://doi.org/10.1109/ICDCS54860.2022.00108
Junqueira, F., Reed, B.: ZooKeeper: Distributed Process Coordination, 1st edn. O’Reilly Media Inc., Sebastopol (2013)
Google Scholar
Kashyap, V.: IP over InfiniBand (IPoIB) architecture. Technical report (2006)
Google Scholar
Keidar, I., Shraer, A.: Timeliness, failure-detectors, and consensus performance. In: Proceedings of the Twenty-fifth Annual ACM Symposium on Principles of Distributed Computing, PODC 2006, Denver, Colorado, USA, pp. 169–178. ACM (2006). https://doi.org/10.1145/1146381.1146408
Kreps, J., Narkhede, N., Rao, J., et al.: Kafka: a distributed messaging system for log processing. In: Proceedings of the NetDB, vol. 11, pp. 1–7 (2011)
Google Scholar
Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998). https://doi.org/10.1145/279227.279229
Article MATH Google Scholar
Lamport, L., Matthews, J., Tuttle, M., Yu, Y.: Specifying and verifying systems with TLA+. In: Proceedings of the 10th Workshop on ACM SIGOPS European Workshop, EW 10, Saint-Emilion, France, pp. 45–48. Association for Computing Machinery (2002). https://doi.org/10.1145/1133373.1133382
Liu, Y.A., Stoller, S.D., Lin, B., Gorbovitski, M.: From clarity to efficiency for distributed algorithms. In: Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA 2012, Tucson, Arizona, USA, pp. 395–410. ACM (2012). https://doi.org/10.1145/2384616.2384645
Network-Based Computing Laboratory at the Ohio State University: RDMA-based Apache Kafka (RDMA-kafka). https://hibd.cse.ohio-state.edu/kafka
Padon, O., McMillan, K.L., Panda, A., Sagiv, M., Shoham, S.: Ivy: safety verification by interactive generalization. SIGPLAN Not. 51(6), 614–630 (2016). https://doi.org/10.1145/2980983.2908118
Article Google Scholar
Shivam, K., Paladugu, V., Liu, Y.: Specification and runtime checking of Derecho, a protocol for fast replication for cloud services. In: Proceedings of the 2023 Workshop on Advanced Tools, Programming Languages, and PLatforms for Implementing and Evaluating Algorithms for Distributed Systems, ApPLIED 2023, Orlando, Florida. ACM (2023). https://doi.org/10.1145/3584684.3597275

Download references

Acknowledgements

The authors are very grateful to Luis Rodriguez, who read an earlier draft of this paper and suggested many ways that it could be improved. The SSS 2023 reviewers were incredibly helpful. Our work was funded, in part, by grants from AFRL under its SWEC program, Microsoft Research and Siemens, and the experiments summarized here used hardware generously provided by NVIDA and its Mellanox subsidiary.

Author information

Authors and Affiliations

Cornell University, Ithaca, USA
Ken Birman, Sagar Jha, Lorenzo Rosa & Weijia Song
UC Berkeley, Berkeley, USA
Mae Milano
University of Bologna, Bologna, Italy
Lorenzo Rosa
Augusta University, Augusta, USA
Edward Tremel

Authors

Ken Birman
View author publications
You can also search for this author in PubMed Google Scholar
Sagar Jha
View author publications
You can also search for this author in PubMed Google Scholar
Mae Milano
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Rosa
View author publications
You can also search for this author in PubMed Google Scholar
Weijia Song
View author publications
You can also search for this author in PubMed Google Scholar
Edward Tremel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ken Birman .

Editor information

Editors and Affiliations

Ben-Gurion University of the Negev, Be'er-Sheva, Israel
Shlomi Dolev
New Jersey Institute of Technology, Newark, NJ, USA
Baruch Schieber

Appendices

The two appendices in this section provide additional detail going beyond the material in the body of the paper. Neither is needed to understand our main contributions. Appendix A offers two examples of common knowledge, drawing on examples from [9]. Appendix B discusses the connection between effectively-common knowledge and a tactic used when formally verifying protocols using provers that can fully automate subproofs provided that they are fully expressed in a decidable fragment of first-order logic (often, the subset that the Z3 SMT solver can handle). We considered but decided against including an appendix on RDMA (this kind of hardware has been actively discussed for at least a decade, and there is an excellent Wikipedia article covering the one-sided write feature we used), and on virtual synchrony (well known to the community since 1987).

Appendix A: Common Knowledge

1.1 A.1 Impossibility of Outdoor Dining in Seattle

Two friends work in Seattle, a city known for cloud cover and damp weather, but when the sun pops out they would prefer to meet outside. The complication is that both sometimes attend meetings in rooms lacking phone reception. A first idea is that if one of them notices that the weather is fine, they will text the other, who will confirm, and then they can meet outside for lunch.

“But wait”, says one to the other. “If I text you, but receive no reply, I will have to assume that my text was not received. In that case I would wait for you here, in the cafeteria.” “In fact,” replies the other, “I would have the symmetric problem: even if I do receive your text, I wouldn’t know that you received my confirmation, and would have no choice but to wait for you in here in the cafeteria. And if you confirm my confirmation, that doesn’t help either!”

This is very strange. After all, once the intial text is confirmed, and the confirmation is confirmed, both are aware that it is a sunny day. Yet no matter how many messages they exchange, they do not converge to the identical state. An inductive analysis always leads to the cafeteria: their “default” option.

Both fall silent: the impossibility of meeting outside for lunch now being apparent. “Well,” says one, “if the weather is nice I’ll just send you a text and will be out here. No need to confirm. If you can’t make it, I’ll understand!”

This first example illustrates that (1) Posed in this manner, logicians can only base “symmetric” decisions on existing common knowledge. (2) No matter how many messages are exchanged knowledge asymmetry cannot be eliminated. Of course, in real life we don’t need common knowledge (and sometimes, things happen, and we can’t join the lunch crowd).

Discussion: The insight to take from this first story is that distributed systems in which information must be observed (by some process) and then learned (by other processes) embody an asymmetry. When formalized, their members will never all be in the identical knowledge state, and attempts to achieve symmetry lead to unbounded yet ineffective exchanges of messages.

In what way is this relevant to distributed computing? The main and perhaps only importance relates to specification and proof. It is very easy to write a specification that unintentionally requires common knowledge. However, such a statement must either be implied from the initial conditions (and hence vacuous), or if not, cannot be achieved by any protocol. A proof assistant can check the logic of a given proof, or even find certain kinds of proofs or counterexamples on its own, but will not signal this type of specification error. Thus a seemingly innocent mistake can lead to an impossible-to-prove specification. The person tasked with carrying out the proof would either give up or, more likely, abandon parts of the task. This last scenario should worry us: it suggests that there could be “proved correct” systems for critical tasks that actually ignore parts of the protocols used.

Effectively-common knowledge is in fact not identical to the form of common knowledge of the kind Halpern and Moses considered in [9]. With effectively-common knowledge, we consider a modular system in which one module implements epochs, and the other modules run within epochs and simply trust the view and any annotations as if they were common knowledge. We carry out separate proofs for the two modules, then compose one system from the two modules. Our proof coverage is stronger, and the developer never confronts what would otherwise be an infeasible task.

1.2 A.2 The Inscription on the Cake was a Lie!

On Carol’s birthday, her friends come to play outside before lunch. It being Seattle, all are quite muddy when they enter the kitchen. “In this house we have a rule!”, proclaims her father, Ted. “No dessert for anyone who has a dirty face!”. His wording is ill-chosen, because no child likes to wash their face, and every child optimistically believes their own face to be clean until proven otherwise. None moves a hair, although all the children see one-another’s dirty faces. Increasingly annoyed, Ted repeats himself a few times. But even after n repetitions (n being the number of children), no child has washed. Ted puts the cake to the side and sends them all to wash up.

Later he relents after Carol explains the inductive proof that justified their action. She first addresses \(n=1\). “Daddy, just the other day this happened. You told me I would need to wash if my face was dirty, but I was hoping it was clean.” “Carol, ” replies Ted, “all you needed to do was to look in the mirror.” “But Daddy, the mirror is too high!”. Ted is forced to acknowledge that Carol would have had no way to deduce that her face must have been dirty.

“Now Daddy, consider \(n=2\). Timmy and I come in, both dirty. You remind us of the rule. But neither of us likes to wash our faces, and anyway, Timmy is mean and would love for me to not get cake and have to watch him enjoying it. And I feel the same! So we both look at each other, and I see that Timmy’s face is dirty, and he sees that mine is dirty, and neither of us moves.” Ted replies, “Yes Carol, but now your logic fails. I repeated myself.” “You did, Daddy. But I was hoping my face was clean. Timmy hoped that his was clean. So our decision not to go and wash up was consistent with one of us believing that neither of our faces was dirty, even if it also consistent with one in which both of us had dirty faces. You didn’t give us enough information!”

At the next party, when the children come in from playing, Ted first says “Well, I see some very dirty faces here!” and then repeats the household rule n times. On the \(n^{th}\) repetition, all the children simultaneously rush to the sink and wash up. Beaming, Ted unveils a cake which is inscribed: “\(K^*\) is necessary and sufficient!” The children groan: A typical Seattle “dad joke.”

Later, Carol corners her dad. “Daddy, that was embarrassing! What if one of my friends hadn’t heard you clearly at the start!” Ted realizes that this is a valid criticism: was his initial statement genuinely common knowledge?

Discussion: Here, we illustrate another peculiarity of common knowledge. Even in classic problems such as muddy children, it is debatable that common knowledge is really being introduced dynamically. To the extent that this does occur, some form of assertion of trust is required: the participants trust that the mechanism that shared the new common knowledge is completely reliable.

An epoch-based virtual synchrony system has an advantage here: to switch from epoch j to epoch j, members definitely must receive and “install” the new view together with any additional data annotating it. Thus for process a to interact with process b as members of epoch j, it genuinely is the case that both have replicas of the new view. By proving that the group membership cannot partition into two logically distinct views, we arrive at guarantee that the annotation can be treated like common knowledge. Ted, for example, waited until all the children were present and then assumed they would understand him.

1.3 A.3 Other Forms of Effectively-Common Knowledge

The example we offered in Sect. 1.2 focused on message ordering. What would be other uses for effectively-common knowledge?

A good place to start is with an old, classic, database partitioning scenario. When ATM machines were first introduced, they depended on dialup modems that were not always able to establish a connection (a flurry of ATM use could overload the central modem pool, leading to persistent busy signals). To fix the issue, banks introduced the idea of a “primary ATM”. Perhaps, Carol almost always uses the ATM machine at the intersection of Main Street and Old Market Avenue. The bank could give that ATM “ownership” of some of Carol’s current balance. For a withdrawal up to this limit, the ATM could authorize that transaction without first phoning the main office. Of course, the bank’s other ATMs would not be able to access Carol’s full balance: the bank has locked down this portion of her balance. But schemes were then proposed for dynamically adapting the policy.

More broadly, effectively-common knowledge arises in situations where some form of policy will span a dynamically varying set of participants. If the participant set was non-varying, we don’t really need effectively-common knowledge: totally ordered multicast would suffice. But if the set of participants changes and simultaneously we need a policy that depends on a nondeterministic decision or attribute of the members, it is hard to avoid an effectively-common knowledge model.

Our insight is that virtual synchrony epochs can be viewed as virtualizing many otherwise intractable behaviors and unachievable guarantees. Within an epoch, failures “do not occur”, hence protocols do not need to be fault-tolerant. Instead they can simply trust the view. And then when we realized that it would be faster to preagree on multicast delivery order in Derecho, we simply annotated the view with the ordering policy to use. The fully generalized case simply allows the application itself to provide additional annotations, which it can then treat as effectively-common knowledge once the epoch begins.

Appendix B: Higher-Order Protocol Components

Effective common knowledge in the context of virtually synchronous epochs enables a deductive strategy also seen in protocol verification. This statement may feel like a non-sequitor: any protocol exchanges messages to gain information, and is designed to achieve a state in which it is safe to take whatever action the protocol embodies. Yet we do not normally think of formal reasoning of the kind used in protocol verification as offering ideas that can be directly useful in protocol design.

Developers of complex protocols have always struggled to prove them correct. Today this burden is much reduced: Provers such as Dafny, TLA+ and Ivy are widely used to check the correctness of protocols [10, 19, 22]. DistAlgo, a specification and proof framework, goes even further, allowing rigorously specified protocols to be proved correct and even generating an executable verified code instance [20]. Less widely appreciated is that they struggle to overcome a significant expressivity limit. Today’s most popular provers operate by taking a specification and reducing it to a decidable logic formula expressed entirely in first order logic. The basic tactic is to form a conjunction of protocol invariants, invert it, and then use Z3 (an SMT solver) to search for a counterexample. If Z3 terminates, either it exhibits a counterexample and the protocol is not correct, or it finds none and the protocol is proved. If Z3 fails to terminate, the developer modifies assertions and then tries again. If a protocol is buggy, this yields a concrete example of how the bug can be triggered.

The expressivity issue stems from the inability of first-order logic to capture and hence verify higher order properties, such as conditions that need to be expressed over traces, or progress conditions. However, encountering such an issue is not a dead end. In such systems it is also possible for a developer to combine hand-created higher order proofs with first order automated checking.

To see how this is done, we should start by noting that first-order provers normally support modularization of protocol proofs, allowing the user to isolate and reason about a component of the protocol without simultaneously reasoning about the rest of the system. An example of this might involve a “sub-protocol” for forming a collection of processes into a ring: an example relevant to our running example, which used a ring to define the round-robin order used in Derecho message delivery.

It may be surprising to realize that a ring is an example of a system property that cannot be expressed in a first order logic. The central issue is that first-order logics are limited to boolean variables, relations that take boolean inputs and output a boolean result, logical conjunctions and (with significant limitations) existential quantifiers. This model is not strong enough to define the natural numbers, or to talk about the natural order on the natural numbers, and for the same reason, it is not strong enough to express some properties that depend on protocol traces that represent runs. And, to be very specific, first order logic cannot verify a protocol that organizes a set of nodes into a ring.

Yet this is simply a limitation of first-order logic. There are many logics within which we do have access to the natural numbers, can reason about orderings and other properties, and can define a ring. For example, on a ring every process has a predecessor, a successor. Call these pred(a) and succ(a) for process a. Both are unique, and moreover there exists some integer k such that \(pred^k(a)=a\) and \(succ^k(a)=a\). The issue is that to the extent that Dafny and Ivy proofs are checked by Z3, we accept that it will be infeasible to verify protocol modules that maintain properties such as the ring one. There would be no problem doing this in a higher-order logic such as the one used in Coq, but the task will be much less automated: a human would need to carry out the proof, and perform many steps by hand.

The usual work-around is to provide a second proof framework in which a human developer can express higher order questions and carry out higher order proofs of protocol fragments that rely on higher order logic. To integrate such proofs into the first-order layer, they then need a way to export artifacts from these proofs back into first-order logic (and keep in mind: this cannot involve extending first order logic, which is a fixed and unchangeable aspect of the methodology).

The solution leverages the fact that first order logic can express relations: functions on first-order variables that perform some kind of logical computation and return true or false. We simply treat the higher order protocol as an uninterpreted black box that outputs relations magically populated with the correct content. Our higher order protocol component can be proved to correctly construct these relations. Then, having completed this proof, we can simply declare that “there exists a relation with the following properties”, using first-order logic to define those properties. In this way, the higher-order artifact can be reasoned about rigorously, then used as a tool by the first-order relation. This is how first-order systems deal with properties such as the ordering on the natural numbers.

Thus, from the perspective of the first order logic, succ, \(succ^k\), pred, \(pred^k\) and k are relations, but uninterpreted ones populated “elsewhere”. To reason about how they are constructed we use the higher-order prover. But if we simply need to describe a step in which a protocol takes some action, such as a node a passing a message to its successor, we can use an existential quantifier to assert that there exists a node b such that \(succ(a)=b\), and this uses only first-order logic, because the verifier doesn’t actually need to compute a value for a or b: it treats the logic statement as a universal property. The same is true for the assertion that in a ring, \(\exists k: ~succ^k(a)=a\). This statement is true for all rings, and for all members, and hence the first-order prover can make use of it without needing specific values.

Our realization was that these higher order objects and properties are a bit like effectively-common knowledge: the first-order layer of the protocol simply trusts that they exist and were properly created. By packaging effectively-common knowledge as an annotation to the view, we simplify the use of this idea. The developer writes software to run in the membership leader and able to compute any desired annotations for the next membership view. One would potentially need to prove that module correct, in the higher-order logic. Having done so, the output of the module becomes effectively-common knowledge and can be treated as a well-known fact by processes running during the epoch. In effect, we compartmentalize an otherwise complicated, error-prone task.

We are not claiming that such steps magically make proofs trivial. In the case of Derecho, we are still faced with doing manual higher-order proofs for many properties. As an example, the termination condition for Derecho’s virtually synchronous view update protocol is a fixed-point: eventually either the system shuts down, or reaches a point where (1) some process believes itself to be the leader, and (2) it suspects every higher-ranked process, and (3) it gains consent for some sequence of membership updates, (4) that consent is obtained from a majority of the most recently active view, and from a majority of members of each proposed view, and (5) no process in the last of these proposed views suspects the leader. This is clearly not expressible in first-order logic, nor is it a trivial proof goal even when expressed in higher-order logic. Yet it is a feasible proof goal, and yields a progress condition for Derecho. We can even express optimality assertions as higher-order statements.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Birman, K., Jha, S., Milano, M., Rosa, L., Song, W., Tremel, E. (2023). Invited Paper: Monotonicity and Opportunistically-Batched Actions in Derecho. In: Dolev, S., Schieber, B. (eds) Stabilization, Safety, and Security of Distributed Systems. SSS 2023. Lecture Notes in Computer Science, vol 14310. Springer, Cham. https://doi.org/10.1007/978-3-031-44274-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-44274-2_14
Published: 30 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44273-5
Online ISBN: 978-3-031-44274-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Invited Paper: Monotonicity and Opportunistically-Batched Actions in Derecho

Abstract

Access this chapter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendices

Appendix A: Common Knowledge

1.1 A.1 Impossibility of Outdoor Dining in Seattle

1.2 A.2 The Inscription on the Cake was a Lie!

1.3 A.3 Other Forms of Effectively-Common Knowledge

Appendix B: Higher-Order Protocol Components

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation