ABSTRACT
Distributed systems are inherently complex as they need to address the interplay between features like communication, concurrency, and failure. Due to the inherent complexity of these interacting features, it is typically not possible to systematically test these kind of systems; yet, unexpected and unlikely combinations of events might cause corner cases that are hard to find. But since these systems are running typically for long durations, these events are likely to materialize eventually and must be handled correctly. Caught in such a dilemma, students are able to experience the benefits of applying verification tools to check their own algorithms and implementations. Having executable models with automatically generated executions allows them to experiment with different solutions by iteratively adapting and refining their algorithms.
In this experience report, we report on our experience of teaching verification in a (hands-on) distributed systems course. We argue that broadcast algorithms provide a sweet spot in design and verification complexity. To this end, we give an implementation of these algorithms in Erlang and derive a TLA+ specification. TLA+ is a formal language for describing and reasoning about distributed and concurrent systems and provides a model checker, TLC, among other things. Our study reveals interesting parallels between the Erlang and TLA+ code, while exposing the challenges of formally modeling communication and parallelism in distributed systems. Presenting selected aspects of our course design, we aim to motivate the feasibility and need for introducing verification in close correspondence to programming tasks.
- Alzahrani, N., Spichkova, M., Blech, J.O. : From temporal models to property-based testing. In: Damiani, E., Spanoudakis, G., Maciaszek, L.A. (eds.) ENASE 2017-Proceedings of the 12th International Conference on Evaluation of Novel Approaches to Software Engineering, Porto, Portugal, April 28-29, 2017. pp. 241-246. SciTePress ( 2017 ), htps://doi.org/10.5220/0006340302410246Google Scholar
- Armstrong, J. : Making reliable distributed systems in the presence of software errors. Ph.D. thesis, Royal Institute of Technology, Stockholm, Sweden ( 2003 ), htp://erlang.org/download/armstrong_thesis_2003. pdfGoogle Scholar
- Birman, K.P., Joseph, T.A. : Reliable communication in the presence of failures. ACM Trans. Comput. Syst. 5 ( 1 ), 47-76 ( 1987 ), htps://doi.org/ 10.1145/7351.7478Google Scholar
- Cachin, C., Guerraoui, R., Rodrigues, L.E.T. : Introduction to Reliable and Secure Distributed Programming (2. ed.). Springer ( 2011 ), htps: //doi.org/10.1007/978-3-642-15260-3Google Scholar
- Cassar, I., Francalanza, A. : On synchronous and asynchronous monitor instrumentation for actor-based systems. In: Cámara, J., Proença, J. (eds.) Proceedings 13th International Workshop on Foundations of Coordination Languages and Self-Adaptive Systems, FOCLASA 2014, Rome, Italy, 6th September 2014. EPTCS, vol. 175, pp. 54-68 ( 2014 ), htps://doi.org/10.4204/EPTCS.175.4Google Scholar
- Défago, X., Schiper, A., Urbán, P. : Total order broadcast and multicast algorithms: Taxonomy and survey. ACM Comput. Surv. 36 ( 4 ), 372-421 ( 2004 ), htps://doi.org/10.1145/1041680.1041682Google ScholarDigital Library
- Demirbas, M. : My experience with using tla+ in distributed systems class. htp://muratbufalo.blogspot.com/ 2014 /08/using-tla-forteaching-distributed.html, accessed: 2019-06-14Google Scholar
- Gotovos, A., Christakis, M., Sagonas, K. : Test-driven development of concurrent programs using concuerror. In: Rikitake, K., Stenman, E. (eds.) Proceedings of the 10th ACM SIGPLAN workshop on Erlang, Tokyo, Japan, September 23, 2011. pp. 51-61. ACM ( 2011 ), htps://doi. org/10.1145/2034654.2034664Google Scholar
- Gotovos, A., Christakis, M., Sagonas, K. : Test-driven development of concurrent programs using concuerror. In: Proceedings of the 10th ACM SIGPLAN Workshop on Erlang. p. 51-61. Erlang ' 11, Association for Computing Machinery, New York, NY, USA ( 2011 ), htps://doi.org/ 10.1145/2034654.2034664Google ScholarDigital Library
- Hadzilacos, V., Toueg, S.: A modular approach to fault-tolerant broadcasts and related problems. Tech. rep., USA ( 1994 )Google Scholar
- Jin, X., Li, X., Zhang, H., Foster, N., Lee, J., Soulé, R., Kim, C., Stoica, I. : Netchain: Scale-free sub-rtt coordination. In: 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). pp. 35-49. USENIX Association, Renton, WA ( 2018 ), htps://www.usenix.org/ conference/nsdi18/presentation/jinGoogle Scholar
- Lamport, L. : The temporal logic of actions. ACM Trans. Program. Lang. Syst. 16 ( 3 ), 872-923 ( 1994 ), htps://doi.org/10.1145/177492.177726Google ScholarDigital Library
- Lamport, L. : Specifying Systems, The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley ( 2002 )Google Scholar
- Lamport, L. : Teaching concurrency. SIGACT News 40 ( 1 ), 58-62 ( 2009 ), htps://doi.org/10.1145/1515698.1515713Google ScholarDigital Library
- Michael, E., Woos, D., Anderson, T., Ernst, M.D., Tatlock, Z. : Teaching rigorous distributed systems with eficient model checking. In: Proceedings of the Fourteenth EuroSys Conference 2019. EuroSys '19, Association for Computing Machinery, New York, NY, USA ( 2019 ), htps://doi.org/10.1145/3302424.3303947Google ScholarDigital Library
- Mu, S., Nelson, L., Lloyd, W., Li, J. : Consolidating concurrency control and consensus for commits under conflicts. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI'16). pp. 517-532 ( 2016 )Google Scholar
- Newcombe, C. : Why Amazon chose TLA+. In: Abstract State Machines, Alloy, B, TLA, VDM, and Z. pp. 25-39. Springer ( 2014 ). https://doi.org/10.1007/978-3-662-43652-3 Google ScholarCross Ref
- Newcombe, C., Rath, T., Zhang, F., Munteanu, B., Brooker, M., Deardeuf, M. : How Amazon web services uses formal methods. Communications of the ACM 58 ( 4 ), 66-73 ( 2015 )Google ScholarDigital Library
- Raynal, M. : Fault-Tolerant Message-Passing Distributed Systems-An Algorithmic Approach. Springer ( 2018 ), htps://doi.org/10.1007/978-3-319-94141-7Google Scholar
- Schneider, F.B. : Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv. 22 ( 4 ), 299-319 ( 1990 ), htps://doi.org/10.1145/98163.98167Google ScholarDigital Library
- Svensson, H., Fredlund, L., Earle, C.B. : A unified semantics for future erlang. In: Fritchie, S.L., Sagonas, K. (eds.) Proceedings of the 9th ACM SIGPLAN workshop on Erlang, Baltimore, Maryland, USA, September 30, 2010. pp. 23-32. ACM ( 2010 ), htps://doi.org/10.1145/1863509. 1863514Google Scholar
Index Terms
- Teaching practical realistic verification of distributed algorithms in Erlang with TLA+
Recommendations
Formal verification of a distributed dynamic reconfiguration protocol
CPP 2022: Proceedings of the 11th ACM SIGPLAN International Conference on Certified Programs and ProofsWe present a formal, machine checked TLA+ safety proof of MongoRaftReconfig, a distributed dynamic reconfiguration protocol. MongoRaftReconfig was designed for and implemented in MongoDB, a distributed database whose replication protocol is derived from ...
Specifying and verifying PLC systems with TLA+: A case study
We report on a method for formally specifying and verifying programmable logic controllers (PLCs) in the specification language TLA^+. The specification framework is generic. It separates the description of the environment from that of the controller ...
Specification and Verification of the Zab Protocol with TLA+
AbstractZooKeeper Atomic Broadcast (Zab) is an atomic broadcast protocol specially designed for ZooKeeper, which supports additional crash recovery. This protocol actually has been widely adopted by famous Internet companies, but there are few studies on ...
Comments