skip to main content
10.1145/2429069.2429100acmconferencesArticle/Chapter ViewAbstractPublication PagespoplConference Proceedingsconference-collections
research-article

Fault tolerance via idempotence

Published: 23 January 2013 Publication History

Abstract

Building distributed services and applications is challenging due to the pitfalls of distribution such as process and communication failures. A natural solution to these problems is to detect potential failures, and retry the failed computation and/or resend messages. Ensuring correctness in such an environment requires distributed services and applications to be idempotent.
In this paper, we study the inter-related aspects of process failures, duplicate messages, and idempotence. We first introduce a simple core language (based on lambda calculus inspired by modern distributed computing platforms. This language formalizes the notions of a service, duplicate requests, process failures, data partitioning, and local atomic transactions that are restricted to a single store.
We then formalize a desired (generic) correctness criterion for applications written in this language, consisting of idempotence (which captures the desired safety properties) and failure-freedom (which captures the desired progress properties).
We then propose language support in the form of a monad that automatically ensures failfree idempotence. A key characteristic of our implementation is that it is decentralized and does not require distributed coordination. We show that the language support can be enriched with other useful constructs, such as compensations, while retaining the coordination-free decentralized nature of the implementation.
We have implemented the idempotence monad (and its variants) in F# and C# and used our implementation to build realistic applications on Windows Azure. We find that the monad has low runtime overheads and leads to more declarative applications.

Supplementary Material

JPG File (r1d2_talk4.jpg)
MP4 File (r1d2_talk4.mp4)

References

[1]
Azurescope: Benchmarking and Guidance for Windows Azure. http://azurescope.cloudapp.net/BenchmarkTestCases/#4f2bdbcc-7c23-4c06-9c00-f2cc12d2d2a7, June 2011.
[2]
Bid Now Sample. http://bidnow.codeplex.com, June 2011.
[3]
The Tailspin Scenario. http://msdn.microsoft.com/en-us/library/ff966486.aspx, June 2011.
[4]
Windows Azure Patterns and Practices. http://wag.codeplex.com/, 2011.
[5]
Roberto Bruni, Hernán Melgratti, and Ugo Montanari. Theoretical Foundations for Compensations in Flow Composition Languages. In Proceedings of POPL, pages 209--220, 2005.
[6]
Luis Caires, Carla Ferreira, and Hugo Vieira. A Process Calculus Analysis of Compensations. In Trustworthy Global Computing, volume 5474 of Lecture Notes in Computer Science, pages 87--103. 2009.
[7]
John Field and Carlos A. Varela. Transactors: A Programming Model for Maintaining Globally Consistent Distributed State in Unreliable Environments. In Proceedings of POPL, pages 195--208, 2005.
[8]
M.J. Fischer, N.A. Lynch, and M.S. Paterson. Impossibility of Distributed Consensus with one Faulty Process. Journal of the ACM(JACM), 32(2):374--382, 1985.
[9]
Svend Frolund and Rachid Guerraoui. X-Ability: A Theory of Replication. Distributed Computing, 14, 2000.
[10]
Hector Garcia-Molina and Kenneth Salem. Sagas. In Proc. of ICMD, pages 249--259, 1987.
[11]
Pat Helland. Idempotence is not a medical condition. ACM Queue, 10(4):30--46, 2012.
[12]
Mohan Kamath and Krithi Ramamritham. Correctness Issues in Workflow Management. Distributed Systems Engineering, 3(4):213, 1996.
[13]
Sheng Liang, Paul Hudak, and Mark Jones. Monad Transformers and Modular Interpreters. In In Proc. of POPL, pages 333--343, 1995.
[14]
Barbara Liskov. Distributed programming in argus. Communications of ACM, 31:300--312, March 1988.
[15]
J. Eliot B. Moss. Nested Transactions: An Approach to Reliable Distributed Computing, 1981.
[16]
Dan Pritchett. Base: An acid alternative. Queue, 6(3):48--55, May 2008.
[17]
Philip Wadler and Peter Thiemann. The Marriage of Effects and Monads. ACM Trans. Comput. Log., 4(1):1--32, 2003.
[18]
David Walker, Lester Mackey, Jay Ligatti, George A. Reis, and David I. August. Static Typing for a Faulty Lambda Calculus. In In ACM International Conference on Functional Programming, 2006.
[19]
Gerhard Weikum and Hans-J. Schek. Concepts and Applications of Multilevel Transactions and Open Nested Transactions. In Database Transaction Models for Advanced Applications, pages 515--553, 1992.
[20]
Gerhard Weikum and Gottfried Vossen. Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control. 2001.

Cited By

View all
  • (2024)Fault‐tolerance approaches for distributed and cloud computing environments: A systematic review, taxonomy and future directionsConcurrency and Computation: Practice and Experience10.1002/cpe.808136:13Online publication date: 18-Mar-2024
  • (2023)A Type System for Safe Intermittent ComputingProceedings of the ACM on Programming Languages10.1145/35912507:PLDI(736-760)Online publication date: 6-Jun-2023
  • (2023)Memento: A Framework for Detectable Recoverability in Persistent MemoryProceedings of the ACM on Programming Languages10.1145/35912327:PLDI(292-317)Online publication date: 6-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
POPL '13: Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
January 2013
586 pages
ISBN:9781450318327
DOI:10.1145/2429069
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 48, Issue 1
    POPL '13
    January 2013
    561 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2480359
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 January 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. fault tolerance
  2. idempotence
  3. monad
  4. transaction
  5. workflow

Qualifiers

  • Research-article

Conference

POPL '13
Sponsor:

Acceptance Rates

Overall Acceptance Rate 824 of 4,130 submissions, 20%

Upcoming Conference

POPL '26

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)1
Reflects downloads up to 30 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Fault‐tolerance approaches for distributed and cloud computing environments: A systematic review, taxonomy and future directionsConcurrency and Computation: Practice and Experience10.1002/cpe.808136:13Online publication date: 18-Mar-2024
  • (2023)A Type System for Safe Intermittent ComputingProceedings of the ACM on Programming Languages10.1145/35912507:PLDI(736-760)Online publication date: 6-Jun-2023
  • (2023)Memento: A Framework for Detectable Recoverability in Persistent MemoryProceedings of the ACM on Programming Languages10.1145/35912327:PLDI(292-317)Online publication date: 6-Jun-2023
  • (2023)Executing Microservice Applications on Serverless, CorrectlyProceedings of the ACM on Programming Languages10.1145/35712067:POPL(367-395)Online publication date: 11-Jan-2023
  • (2023)Optimizing Data Stream Throughput for Real-Time ApplicationsBig Data Intelligence and Computing10.1007/978-981-99-2233-8_29(410-417)Online publication date: 1-May-2023
  • (2021)Much ADO about failures: a fault-aware model for compositional verification of strongly consistent distributed systemsProceedings of the ACM on Programming Languages10.1145/34854745:OOPSLA(1-31)Online publication date: 15-Oct-2021
  • (2020)Towards a formal foundation of intermittent computingProceedings of the ACM on Programming Languages10.1145/34282314:OOPSLA(1-31)Online publication date: 13-Nov-2020
  • (2020)A Reactive Batching Strategy of Apache Kafka for Reliable Stream Processing in Real-time2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE5003.2020.00028(207-217)Online publication date: Oct-2020
  • (2020)Learning to Reliably Deliver Streaming Data with Apache Kafka2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN48063.2020.00068(564-571)Online publication date: Jun-2020
  • (2019)Safer smart contract programming with ScillaProceedings of the ACM on Programming Languages10.1145/33606113:OOPSLA(1-30)Online publication date: 10-Oct-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media