A survey on reliable distributed communication

https://doi.org/10.1016/j.jss.2017.03.028Get rights and content

Highlights

  • A body of knowledge regarding reliable distributed communication.

  • Synthesis of technical solutions for reliable distributed communication.

  • Analysis of applications, their reliability requirements, and solutions used.

  • Discussion of gaps between the state of the art and solutions used in the field.

  • Identification of open research lines in the field of reliable communication.

Abstract

From entertainment to personal communication, and from business to safety-critical applications, the world increasingly relies on distributed systems. Despite looking simple, distributed systems hide a major source of complexity: tolerating faults and component crashes is very difficult, due to the incompleteness of (remote) knowledge. The need to overcome this problem, and provide different guarantees to applications, sparked a huge research effort and resulted in a large body of communication protocols, and middleware. Thus, it is worthwhile to survey the state of the art in distributed systems, with a particular emphasis on reliable communication. We discuss key concepts in reliable communication, such as interaction patterns (e.g., one-way vs. request-response, synchronous vs. asynchronous), reliability semantics (e.g., at-least-once, at-most-once), and reliability targets (e.g., message, conversation), and we analyze a wide set of current communication solutions, which map to the different concepts. Building on the concepts, we analyze applications that have different reliable communication needs. As a result, we observe that, in most cases, elaborate communication solutions offering superior guarantees are purely academic efforts that cannot compete with the popularity and maturity of established, albeit poorer solutions. Based on our analysis, we identify and discuss open research topics in this area.

Introduction

For many distributed applications supporting businesses and services, reliable communication, i.e., communication that can justifiably be trusted (Avizienis et al., 2004), is of vital importance. In general, two different communication models are used to accomplish communication between distributed application peers: point-to-point (also known as unicast), and multicast (including broadcast communication) (Tanenbaum and Steen, 2006). In the point-to-point communication model, a message is sent from one application peer to another peer, whereas in the multicast communication model, a message is sent from one application peer to several other peers. In most applications, including critical ones, where reliable communication is a primary concern (Rushby, 1994), e.g., in healthcare, e-commerce, or banking, the point-to-point model is, by far, the most popular means of interaction (Zhang et al., 2010). Even when several peers are involved in a distributed communication, e.g., for sharing a file or exchanging emails, communication is still predominantly point-to-point, many times through some intermediate server, which is responsible for properly handling data for the peers involved.

Depending on the application’s specific objectives, very different concerns may apply when the goal is to achieve reliable communication. Clearly, no application can deliver service that can justifiably be trusted, if it is unreliable, or if it is supported by unreliable communication mechanisms. Disruption in services caused by unreliable components can, not only, result in huge direct losses, in the form of human lives, financial costs, or others, but also bring in severe indirect costs, for example, in terms of reputation (Jones et al., 2000). However, ensuring the reliability of communication is a very difficult task, especially considering the unreliable nature of the Internet and applications (Fekete, Lynch, Mansour, Spinelli, 1993, Halpern, 1987, Gray, 1979). These can exhibit a large spectrum of failures, resulting from pretty much any component. When the network, or one of the peers crashes and restarts, client and server need to engage in a complex process of rolling back to some consistent state (Chandy and Lamport, 1985). This is a complex distributed process, almost always lacking any support from the communication stack. For instance, when using the Transmission Control Protocol (TCP)  (Postel, 1981) — usually considered reliable —, peers have no mechanism to know which data to resend.

Our overview of reliability in distributed interactions, in Section 2, makes it very clear that the acknowledgments of a transport layer protocol, such as TCP, cannot solve all reliability problems, because applications display a large range of different interaction patterns. For instance, messages may or may not need a response; senders may need to wait for an acknowledgment of the application itself, or they may accept such acknowledgment at a later time; peers may need to be running at the same time, or they might be decoupled by persistent storage. Moreover, applications have different reliability semantics, depending on their characteristics and goals (Tanenbaum and Steen, 2006). For example, file sharing needs ordered and guaranteed delivery of messages — no gaps or byte swaps would be acceptable in a file; bank transfer orders need these properties and more, because payments should be retried in case they fail, but must not occur more than once. Furthermore, TCP only takes care of byte streams, but byte streams are only one of the targets to care for: different applications, such as publish-subscribe, may also require reliability for an entire message or an object, while banking applications may require an entire conversation to be reliable.

Each one of the above mentioned targets, alongside with the reliability semantics, or interaction pattern, requires its own specific solution, such as logging, retransmission, or message filtering, just to name a few. In theory, no developer needs to implement such mechanisms from scratch: he or she should rely on available middleware to provide (at least some of) the desired goals. As we see in Section 3, where we review a large number of protocols, libraries and Application Programming Interfaces (APIs) for stream, message, object and conversation-based applications, this middleware exists in vast amounts. In practice, some of these solutions are similar to each other, but target different operating systems and languages; some never gained traction; others are purely research works.

In Section 4, we provide evidence supporting the point of view that much of the above mentioned undertaking on middleware was, to some extent, purely academic effort. We categorize distributed applications that require reliable communication and identify their reliability requirements. From this effort, it becomes very clear that only a few solutions have actually thrived. We can narrow down the successful options to TCP, HyperText Transfer Protocol (Krishnamurthy and Rexford, 2001), and a few more, including message-oriented middleware. The limited number of choices involves a clear penalty for developers. Depending on the application, they must manage most communication issues: keeping track of all the peers involved in the interaction; setting TCP connections on and off; detecting faulty TCP connections and handling subsequent reconnections; or detecting and avoiding duplicate HTTP requests. This is far from ideal, because it is complex, error-prone, and requires a very high level of expertise.

As a summary, in this paper, we survey and synthesize the state of the art in distributed systems, with particular emphasis on reliable communication. To do so, we searched three information sources (Google Scholar, IEEE Xplore digital library, and the ACM digital library) using key terms related with each of the topics discussed in this paper (e.g., reliable distributed communication, reliability mechanisms). We also tracked the citations to the papers previously identified, to find further relevant work. The result of this endeavour aims to: (1) outline the body of knowledge on reliable communication, by collecting the main related concepts (Section 2); (2) review the most important reliable communication solutions and identify their characteristics according to key reliable communication aspects, such as reliability semantics (e.g., at-most-once, exactly-once), or interaction patterns (e.g., request-response, one-way)(Section 3); (3) categorize well-known applications requiring reliable distributed communication, according to key reliable communication aspects, and identify their requirements in terms of reliability (Section 4); and (4) discover the gaps between the applications requirements and the existing solutions, and accordingly, provide insights into future research possibilities (Section 5).

The analysis carried out in this paper ended up being especially complex, not only given the huge amount of combinations of concepts, configurations, and solutions, but also considering the multiple definitions, sometimes overlapping or contradictory, present in the literature (Birman, 1997, Avizienis, Laprie, Randell, Landwehr, 2004, Elnozahy, Alvisi, Wang, Johnson, 2002, Popescu, Constantinescu, Erman, Ilie, 2007, Tay, Ananda, 1990, Coulouris, Dollimore, Kindberg, 2005). Despite of the ample influence that research has on the development of successful communication solutions (Emmerich et al., 2008), we observed that in many cases, elaborate communication solutions offering a larger number of guarantees are purely academic efforts that can, by no means, compete with the popularity, maturity and importance of older, more established, albeit poorer solutions. This suggests that research and development work in libraries, APIs, design solutions, and protocols is still necessary, to build the reliable distributed systems of the future.

Section snippets

Reliability in distributed interactions

In this section, we review the main concepts concerning reliable distributed communication from the application layer perspective. In addition to essential definitions regarding reliable distributed systems, we discuss the following key aspects:

  • Interaction patterns (e.g., one-way or asynchronous) of distributed applications;

  • Types of failures (e.g., omission failures) that can threaten reliable communication;

  • Reliability semantics, which refer to the conceptual levels of reliability (e.g.,

Solutions for reliable communication

In this section, we overview several existing solutions for building reliable distributed applications. We divide these solutions depending on their reliability targets (i.e., stream, message, object, or conversation) and describe them according to the other reliability features, such as reliability mechanisms and reliability semantics.

Applications and reliability requirements

The number and diversity of distributed applications requiring reliable communication is quite large. In this section, we present a set of nine well-known groups of applications with distinctive communication features, to perform an analysis on their reliability requirements and the communication solutions commonly used for development of such applications. We organize applications according to the following features (please refer to Table 5): objectives (a), criticality (b), timeliness

Discussion and open research topics

In this section, we highlight and discuss the main findings that resulted from this survey work. We also identify, based on the analysis and discussion, what we believe are currently open research topics to pursue in the field of reliable distributed communication.

We initiated the paper by presenting the main concepts involved in distributed communication in Section 2, where we synthesized and discussed key aspects of reliable distributed communication, such as the main type of interactions,

Conclusion

Achieving reliable distributed communication can be a difficult task. This is especially true if we consider the variety of requirements needed by applications and the lack of proper solutions that match those requirements. The applications needs greatly vary in many dimensions, such as the semantics needed (e.g., at-most-once, ordered delivery), the types of failures to handle, or what is the target of the reliable communication (e.g. a byte stream, or an entire message). The available

Naghmeh Ivaki received the PhD degree in 2016 from the University of Coimbra, Portugal. She obtained her Master of Science in Information Technology Engineering at the Faculty of Engineering in the Tarbiat Modares University (TMU), Tehran, Iran, December 2008. Her PhD work is on the dependable distributed systems domain. She has authored more than 10 papers in international conferences and workshops.

References (138)

  • N. Aghdaie et al.

    Coral: a transparent fault-tolerant web service

    J. Syst. Software

    (2009)
  • M.N. Azaiez et al.

    Optimal resource allocation for security in reliability systems

    Eur. J. Oper. Res.

    (2007)
  • H. Abie et al.

    Robust, secure, self-adaptive and resilient messaging middleware for business critical systems

    Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns.

    (2009)
  • L. Alvisi et al.

    Wrapping server-side TCP to mask connection failures

    IEEE International Conference on Computer Communications (INFOCOM)

    (2001)
  • L. Alvisi et al.

    Trade-offs in implementing causal message logging protocols

    Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing

    (1996)
  • A. Avizienis et al.

    Basic concepts and taxonomy of dependable and secure computing

    IEEE Trans. Depend. Secure Comput.

    (2004)
  • Banks, A., Challenger, J., Clarke, P., Davis, D., King, R., Witting, K., Donoho, A., Holloway, T., Ibbotson, J., Todd,...
  • R. Barga et al.

    Persistent applications via automatic recovery

    Database Engineering and Applications Symposium, 2003. Proceedings. Seventh International

    (2003)
  • Bilorusets, R., Box, D., Cabrera, L. F., Davis, D., Ferguson, D., Ferris, C., Freund, T., Hondo, M. A., Ibbotson, J.,...
  • K.P. Birman

    Building Secure and Reliable Network Applications

    (1997)
  • A.D. Birrell et al.

    Implementing remote procedure calls

    ACM Trans. Comput. Syst. (TOCS)

    (1984)
  • A. Bouteiller et al.

    Coordinated checkpoint versus message log for fault tolerant mpi

    Cluster Computing, 2003. Proceedings. 2003 IEEE International Conference on

    (2003)
  • B.S. Boutros et al.

    A two-phase commit protocol and its performance

    Database and Expert Systems Applications, 1996. Proceedings., Seventh International Workshop on

    (1996)
  • R. Braden et al.

    Computing the internet checksum

    ACM SIGCOMM Comput. Commun. Rev.

    (1989)
  • F. Brosch et al.

    Reliability prediction for fault-tolerant software architectures

    Proceedings of the Joint ACM SIGSOFT Conference–QoSA and ACM SIGSOFT Symposium–ISARCS on Quality of Software Architectures–QoSA and Architecting Critical Systems–ISARCS

    (2011)
  • A. Buchmann et al.

    Complex event processing

    it-Information Technology Methoden und innovative Anwendungen der Informatik und Informationstechnik

    (2009)
  • N. Burton-Krahn

    Hotswap-transparent server failover for linux

    LISA

    (2002)
  • N. Carvalho et al.

    Scalable qos-based event routing in publish-subscribe systems

    Fourth IEEE International Symposium on Network Computing and Applications

    (2005)
  • S.T. Chakradhar et al.

    Best-effort computing: re-thinking parallel software and hardware

    Proceedings of the 47th Design Automation Conference

    (2010)
  • S. Chakravorty et al.

    A fault tolerance protocol with fast fault recovery

    IEEE International Parallel and Distributed Processing Symposium (IPDPS)

    (2007)
  • S. Chakravorty et al.

    Proactive fault tolerance in mpi applications via task migration

    High Performance Computing-HiPC

    (2006)
  • K.M. Chandy et al.

    Distributed snapshots: determining global states of distributed systems

    ACM Trans. Comput. Syst. (TOCS)

    (1985)
  • Z. Chen et al.

    Algorithm-based fault tolerance for fail-stop failures

    IEEE Trans. Parallel Distrib. Syst.

    (2008)
  • G.F. Coulouris et al.

    Distributed Systems: Concepts and Design

    (2005)
  • Crispin, M. R., 2003. Internet message access protocol-version...
  • F. Cristian

    Understanding fault-tolerant distributed systems

    Commun. ACM

    (1991)
  • W.S. Dantas et al.

    Not quickly, just in time: improving the timeliness and reliability of control traffic in utility networks

    Networks

    (2009)
  • T.B. Downing

    Java RMI: Remote Method Invocation

    (1998)
  • K. Driscoll et al.

    Byzantine fault tolerance, from theory to reality

  • K. Dutta et al.

    User action recovery in internet sagas (isagas)

    Technologies for E-Services

    (2001)
  • C. Dwyer et al.

    Trust and privacy concern within social networking sites: a comparison of facebook and myspace

    AMCIS 2007 proceedings

    (2007)
  • I.P. Egwutuoha et al.

    A proactive fault tolerance approach to high performance computing (hpc) in the cloud

    Cloud and Green Computing (CGC), 2012 Second International Conference on

    (2012)
  • R. Ekwall et al.

    Robust tcp connections for fault tolerant computing

    Parallel and Distributed Systems, 2002. Proceedings. Ninth International Conference on

    (2002)
  • E.N. Elnozahy et al.

    A survey of rollback-recovery protocols in message-passing systems

    ACM Comput. Surv. (CSUR)

    (2002)
  • W. Emmerich et al.

    The impact of research on the development of middleware technology

    ACM Trans. Software Eng. Methodol. (TOSEM)

    (2008)
  • P.T. Eugster et al.

    The many faces of publish/subscribe

    ACM Comput. Surv. (CSUR)

    (2003)
  • Evans, C., Chappell, D., Bunting, D., Tharakan, G., Shimamura, H., Durand, J., Mischkinsky, J., Nihei, K., Iwasa, K.,...
  • Feather, C. D., 2006. Network news transfer protocol (NNTP), rfc 3977,...
  • A. Fekete et al.

    The impossibility of implementing reliable communication in the face of crashes

    J. ACM (JACM)

    (1993)
  • W.-c. Feng et al.

    Priority-based technique for the best-effort delivery of stored video

    Electronic Imaging’99

    (1998)
  • Ferrari, D., 1990. Client requirements for real-time communication...
  • Fette, I., Melnikov, A., 2011. The websocket...
  • M.J. Fischer et al.

    Impossibility of distributed consensus with one faulty process

    J. ACM (JACM)

    (1985)
  • R. Frei et al.

    Advances in complexity engineering

    Int. J. Bio-Inspired Comput.

    (2011)
  • J. Galdun et al.

    Distributed control systems reliability: consideration of multi-agent behavior

    6th International Symposium on Applied Machine Intelligence and Informatics

    (2008)
  • E. Gamma et al.

    Design Patterns: Elements of Reusable Object-Oriented Software

    (1994)
  • N. Garg

    Apache Kafka

    (2013)
  • D. Garlan et al.

    Rainbow: architecture-based self-adaptation with reusable infrastructure

    Computer

    (2004)
  • F.C. Gartner

    Fundamentals of fault-tolerant distributed computing in asynchronous environments

    ACM Comput. Surv. (CSUR)

    (1999)
  • D. Gawlick

    Messaging/queuing in oracle8

    IEEE 29th International Conference on Data Engineering (ICDE)

    (1998)
  • Cited by (0)

    Naghmeh Ivaki received the PhD degree in 2016 from the University of Coimbra, Portugal. She obtained her Master of Science in Information Technology Engineering at the Faculty of Engineering in the Tarbiat Modares University (TMU), Tehran, Iran, December 2008. Her PhD work is on the dependable distributed systems domain. She has authored more than 10 papers in international conferences and workshops.

    Nuno Laranjeiro received the PhD degree in 2012 from the University of Coimbra, Portugal, where he currently is an assistant professor. His research focuses on robust software services as well as experimental dependability evaluation, web services interoperability, services security, and enterprise application integration. He has authored more than 40 papers in refereed conferences and journals in the dependability and services computing areas. He has been involved in several international projects in these topics.

    Filipe Araujo is an Assistant Professor at the University of Coimbra, Portugal. He received his graduation in Electrical Engineering in 1996 and his M. Sc. in Informatics Engineering in 1999, both from the University of Coimbra. He received his PhD in 2006 from the University of Lisboa. His current research interests are focused on parallel, grid and cloud computing. He participated in several national and international projects.

    View full text