Skip to main content
Log in

ER-TCP: an efficient TCP fault-tolerance scheme for cluster computing

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

This paper proposes a novel scheme, named ER-TCP, which transparently masks the failures happened on the server nodes of a cluster from clients at TCP connection granularity. In this scheme, TCP connections at the server side are actively and fully replicated to remain consistency so as to be transplanted over healthy parts during failure. A log mechanism is designed to cooperate with the replication to achieve small sacrifice on the performance of communication and makes the scheme scales beyond a few nodes, even when they have different processing capacities. We built a prototype system at a four-node cluster with ER-TCP, and conducted a series of experiments on that. The experimental result told us that ER-TCP has relatively small penalty on the communication performance, especially when it is used to synchronize multiple replicas. The results of real applications show that ER-TCP will incur small sacrifice on performance of web server at light load, and it can be used to distribute files very efficiently and reliably.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alvisi L, Bressoud T, El-Khashab Marzullo A, Zagorodnov D (2001) Wrapping server-side TCP to mask connection failures. In: Proc of IEEE INFOCOM. IEEE Press, 2001, pp 329–337

  2. Aghdaie N, Tamir Y (2003) Fast transparent failover for reliable web service In: Proc of international conference on parallel and distributed computing and systems (PDCS). ACTA Press, pp 757–762

  3. Apache HTTP server project. Available at http://httpdapacheorg/

  4. Armstrong S, et al (1992) Multicast Transport Protocol. Internet RFC 1301, (IETF)

  5. Baker M, et al (2000) Cluster Computing White Paper. University of Portsmouth, UK

  6. Burton-Krahn N (2002) HotSwap—transparent server failover for Linux. In: Proc of USENIX LISA: sixteenth systems administration conference. SAGE Press, 2002, pp 205–212

  7. Chandra T, Toueg S (1991) Unreliable failure detectors for asynchronous systems. In: Proc of the 10th ACM symposium on principles of distributed computing (PDCS). ACM Press, 1991, pp 325–340

  8. Floyd S, Jacobson V, McCanne S (1995) A reliable multicast framework for light-weight sessions and application level framing. In: Proc ACM SIGCOMM. ACM Press, 1995, pp 342–356

  9. Linux Virtual Server. Available at http://linuxvirtualserver.org

  10. Marwah M, Mishra S, Fetzer C (2003) TCP server fault tolerance using connection migration to a backup server. In: Proc of the 2003 IEEE international conference on dependable systems and networks (DSN). IEEE Press, 2003, pp 373–382

  11. Mosberger D, Jin T (1998) httperf: a tool for measuring web server performance. Perform Eval Rev 26(3):31–37

    Article  Google Scholar 

  12. Plank J, Beck M, Kingsley G (1995) Libckpt: transparent checkpointing under Unix. In: Proc of usenix winter technical conference. SAGE Press, 1995, pp 213–223

  13. Schulzrinne H, et al (1996) RTP: a transport protocol for real-time applications, Internet RFC1889

  14. Shao Z, Jin H, Chen B, Xu J, Yue J (2003) HARTS: high availability cluster architecture with redundant TCP stacks. In: Proc of the international performance computing and communication conference (IPCCC). IEEE Press, 2003, pp 255–262

  15. Shenoy G, Satapati S, Bettati R (2000) HydraNet-FT: network support for dependable services. In: Proc of the 20th IEEE international conference on distributed computing systems (ICDCS). IEEE Press, 2000, pp 699–706

  16. Snell Q, Mikler A, Gustafson J (1996) Netpipe: a network protocol independent performance evaluator. In: Proc of IASTED international conference on intelligent information management and systems. MIT Press, 1996, pp 196–204

  17. Sultan F, Srinivasan K, Iyer D, Iftode L (2002) Migratory TCP: connection migration for service continuity in the Internet. In: Proc of the international conference on distributed computing systems (ICDCS). IEEE Press, 2002, pp 469–470

  18. Whang Z, Crowcroft J, Diot C, Ghosh A (1997) Framework for reliable multicast application design. In: Proc of high performance protocol architecture (HIPPARCH), 1997, pp 123–131

  19. Yang C, Luo M (1999) An effective mechanism for supporting content-based routing in scalable web server clusters. In: Proc of international conference on parallel processing (ICPP). IEEE Press, 1999, pp 240–245

  20. Yang C, Luo M (2000) Building an adaptable, fault tolerant, and highly manageable web server on clusters of non-dedicated workstations. In: Proc of international conference on parallel processing (ICPP). IEEE Press, 2000, pp 413–420

  21. Zhang R, Abdelzaher T, Stankovic J (2004) Efficient TCP connection failover in web server clusters. In: Proc of the IEEE INFOCOM, 2004, pp 1220–1229

  22. Zhong H, Nieh J (2001) CRAK: Linux checkpoint/restart as a kernel module, Technical Report CUCS-014–01, Department of Computer Science, Columbia University

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hai Jin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shao, Z., Jin, H., Cheng, B. et al. ER-TCP: an efficient TCP fault-tolerance scheme for cluster computing. J Supercomput 43, 127–145 (2008). https://doi.org/10.1007/s11227-007-0123-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-007-0123-7

Keywords

Navigation