ER-TCP: an efficient TCP fault-tolerance scheme for cluster computing

Shao, Zhiyuan; Jin, Hai; Cheng, Bin; Jiang, Wenbin

doi:10.1007/s11227-007-0123-7

ER-TCP: an efficient TCP fault-tolerance scheme for cluster computing

Published: 06 April 2007

Volume 43, pages 127–145, (2008)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Zhiyuan Shao¹,
Hai Jin¹,
Bin Cheng¹ &
…
Wenbin Jiang¹

88 Accesses
2 Citations
Explore all metrics

Abstract

This paper proposes a novel scheme, named ER-TCP, which transparently masks the failures happened on the server nodes of a cluster from clients at TCP connection granularity. In this scheme, TCP connections at the server side are actively and fully replicated to remain consistency so as to be transplanted over healthy parts during failure. A log mechanism is designed to cooperate with the replication to achieve small sacrifice on the performance of communication and makes the scheme scales beyond a few nodes, even when they have different processing capacities. We built a prototype system at a four-node cluster with ER-TCP, and conducted a series of experiments on that. The experimental result told us that ER-TCP has relatively small penalty on the communication performance, especially when it is used to synchronize multiple replicas. The results of real applications show that ER-TCP will incur small sacrifice on performance of web server at light load, and it can be used to distribute files very efficiently and reliably.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Connection Handler: A Design Pattern for Recovery from Connection Crashes

Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud

Article 16 November 2020

A fault-tolerant scheduling algorithm that minimizes the number of replicas in heterogeneous service-oriented cloud computing systems

Article 27 February 2024

References

Alvisi L, Bressoud T, El-Khashab Marzullo A, Zagorodnov D (2001) Wrapping server-side TCP to mask connection failures. In: Proc of IEEE INFOCOM. IEEE Press, 2001, pp 329–337
Aghdaie N, Tamir Y (2003) Fast transparent failover for reliable web service In: Proc of international conference on parallel and distributed computing and systems (PDCS). ACTA Press, pp 757–762
Apache HTTP server project. Available at http://httpdapacheorg/
Armstrong S, et al (1992) Multicast Transport Protocol. Internet RFC 1301, (IETF)
Baker M, et al (2000) Cluster Computing White Paper. University of Portsmouth, UK
Burton-Krahn N (2002) HotSwap—transparent server failover for Linux. In: Proc of USENIX LISA: sixteenth systems administration conference. SAGE Press, 2002, pp 205–212
Chandra T, Toueg S (1991) Unreliable failure detectors for asynchronous systems. In: Proc of the 10th ACM symposium on principles of distributed computing (PDCS). ACM Press, 1991, pp 325–340
Floyd S, Jacobson V, McCanne S (1995) A reliable multicast framework for light-weight sessions and application level framing. In: Proc ACM SIGCOMM. ACM Press, 1995, pp 342–356
Linux Virtual Server. Available at http://linuxvirtualserver.org
Marwah M, Mishra S, Fetzer C (2003) TCP server fault tolerance using connection migration to a backup server. In: Proc of the 2003 IEEE international conference on dependable systems and networks (DSN). IEEE Press, 2003, pp 373–382
Mosberger D, Jin T (1998) httperf: a tool for measuring web server performance. Perform Eval Rev 26(3):31–37
Article Google Scholar
Plank J, Beck M, Kingsley G (1995) Libckpt: transparent checkpointing under Unix. In: Proc of usenix winter technical conference. SAGE Press, 1995, pp 213–223
Schulzrinne H, et al (1996) RTP: a transport protocol for real-time applications, Internet RFC1889
Shao Z, Jin H, Chen B, Xu J, Yue J (2003) HARTS: high availability cluster architecture with redundant TCP stacks. In: Proc of the international performance computing and communication conference (IPCCC). IEEE Press, 2003, pp 255–262
Shenoy G, Satapati S, Bettati R (2000) HydraNet-FT: network support for dependable services. In: Proc of the 20th IEEE international conference on distributed computing systems (ICDCS). IEEE Press, 2000, pp 699–706
Snell Q, Mikler A, Gustafson J (1996) Netpipe: a network protocol independent performance evaluator. In: Proc of IASTED international conference on intelligent information management and systems. MIT Press, 1996, pp 196–204
Sultan F, Srinivasan K, Iyer D, Iftode L (2002) Migratory TCP: connection migration for service continuity in the Internet. In: Proc of the international conference on distributed computing systems (ICDCS). IEEE Press, 2002, pp 469–470
Whang Z, Crowcroft J, Diot C, Ghosh A (1997) Framework for reliable multicast application design. In: Proc of high performance protocol architecture (HIPPARCH), 1997, pp 123–131
Yang C, Luo M (1999) An effective mechanism for supporting content-based routing in scalable web server clusters. In: Proc of international conference on parallel processing (ICPP). IEEE Press, 1999, pp 240–245
Yang C, Luo M (2000) Building an adaptable, fault tolerant, and highly manageable web server on clusters of non-dedicated workstations. In: Proc of international conference on parallel processing (ICPP). IEEE Press, 2000, pp 413–420
Zhang R, Abdelzaher T, Stankovic J (2004) Efficient TCP connection failover in web server clusters. In: Proc of the IEEE INFOCOM, 2004, pp 1220–1229
Zhong H, Nieh J (2001) CRAK: Linux checkpoint/restart as a kernel module, Technical Report CUCS-014–01, Department of Computer Science, Columbia University

Download references

Author information

Authors and Affiliations

Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, 430074, Wuhan, China
Zhiyuan Shao, Hai Jin, Bin Cheng & Wenbin Jiang

Authors

Zhiyuan Shao
View author publications
You can also search for this author inPubMed Google Scholar
Hai Jin
View author publications
You can also search for this author inPubMed Google Scholar
Bin Cheng
View author publications
You can also search for this author inPubMed Google Scholar
Wenbin Jiang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Hai Jin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shao, Z., Jin, H., Cheng, B. et al. ER-TCP: an efficient TCP fault-tolerance scheme for cluster computing. J Supercomput 43, 127–145 (2008). https://doi.org/10.1007/s11227-007-0123-7

Download citation

Published: 06 April 2007
Issue Date: February 2008
DOI: https://doi.org/10.1007/s11227-007-0123-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ER-TCP: an efficient TCP fault-tolerance scheme for cluster computing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Connection Handler: A Design Pattern for Recovery from Connection Crashes

Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud

A fault-tolerant scheduling algorithm that minimizes the number of replicas in heterogeneous service-oriented cloud computing systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now