Abstract
This paper proposes a novel scheme, named ER-TCP, which transparently masks the failures happened on the server nodes of a cluster from clients at TCP connection granularity. In this scheme, TCP connections at the server side are actively and fully replicated to remain consistency so as to be transplanted over healthy parts during failure. A log mechanism is designed to cooperate with the replication to achieve small sacrifice on the performance of communication and makes the scheme scales beyond a few nodes, even when they have different processing capacities. We built a prototype system at a four-node cluster with ER-TCP, and conducted a series of experiments on that. The experimental result told us that ER-TCP has relatively small penalty on the communication performance, especially when it is used to synchronize multiple replicas. The results of real applications show that ER-TCP will incur small sacrifice on performance of web server at light load, and it can be used to distribute files very efficiently and reliably.
Similar content being viewed by others
References
Alvisi L, Bressoud T, El-Khashab Marzullo A, Zagorodnov D (2001) Wrapping server-side TCP to mask connection failures. In: Proc of IEEE INFOCOM. IEEE Press, 2001, pp 329–337
Aghdaie N, Tamir Y (2003) Fast transparent failover for reliable web service In: Proc of international conference on parallel and distributed computing and systems (PDCS). ACTA Press, pp 757–762
Apache HTTP server project. Available at http://httpdapacheorg/
Armstrong S, et al (1992) Multicast Transport Protocol. Internet RFC 1301, (IETF)
Baker M, et al (2000) Cluster Computing White Paper. University of Portsmouth, UK
Burton-Krahn N (2002) HotSwap—transparent server failover for Linux. In: Proc of USENIX LISA: sixteenth systems administration conference. SAGE Press, 2002, pp 205–212
Chandra T, Toueg S (1991) Unreliable failure detectors for asynchronous systems. In: Proc of the 10th ACM symposium on principles of distributed computing (PDCS). ACM Press, 1991, pp 325–340
Floyd S, Jacobson V, McCanne S (1995) A reliable multicast framework for light-weight sessions and application level framing. In: Proc ACM SIGCOMM. ACM Press, 1995, pp 342–356
Linux Virtual Server. Available at http://linuxvirtualserver.org
Marwah M, Mishra S, Fetzer C (2003) TCP server fault tolerance using connection migration to a backup server. In: Proc of the 2003 IEEE international conference on dependable systems and networks (DSN). IEEE Press, 2003, pp 373–382
Mosberger D, Jin T (1998) httperf: a tool for measuring web server performance. Perform Eval Rev 26(3):31–37
Plank J, Beck M, Kingsley G (1995) Libckpt: transparent checkpointing under Unix. In: Proc of usenix winter technical conference. SAGE Press, 1995, pp 213–223
Schulzrinne H, et al (1996) RTP: a transport protocol for real-time applications, Internet RFC1889
Shao Z, Jin H, Chen B, Xu J, Yue J (2003) HARTS: high availability cluster architecture with redundant TCP stacks. In: Proc of the international performance computing and communication conference (IPCCC). IEEE Press, 2003, pp 255–262
Shenoy G, Satapati S, Bettati R (2000) HydraNet-FT: network support for dependable services. In: Proc of the 20th IEEE international conference on distributed computing systems (ICDCS). IEEE Press, 2000, pp 699–706
Snell Q, Mikler A, Gustafson J (1996) Netpipe: a network protocol independent performance evaluator. In: Proc of IASTED international conference on intelligent information management and systems. MIT Press, 1996, pp 196–204
Sultan F, Srinivasan K, Iyer D, Iftode L (2002) Migratory TCP: connection migration for service continuity in the Internet. In: Proc of the international conference on distributed computing systems (ICDCS). IEEE Press, 2002, pp 469–470
Whang Z, Crowcroft J, Diot C, Ghosh A (1997) Framework for reliable multicast application design. In: Proc of high performance protocol architecture (HIPPARCH), 1997, pp 123–131
Yang C, Luo M (1999) An effective mechanism for supporting content-based routing in scalable web server clusters. In: Proc of international conference on parallel processing (ICPP). IEEE Press, 1999, pp 240–245
Yang C, Luo M (2000) Building an adaptable, fault tolerant, and highly manageable web server on clusters of non-dedicated workstations. In: Proc of international conference on parallel processing (ICPP). IEEE Press, 2000, pp 413–420
Zhang R, Abdelzaher T, Stankovic J (2004) Efficient TCP connection failover in web server clusters. In: Proc of the IEEE INFOCOM, 2004, pp 1220–1229
Zhong H, Nieh J (2001) CRAK: Linux checkpoint/restart as a kernel module, Technical Report CUCS-014–01, Department of Computer Science, Columbia University
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shao, Z., Jin, H., Cheng, B. et al. ER-TCP: an efficient TCP fault-tolerance scheme for cluster computing. J Supercomput 43, 127–145 (2008). https://doi.org/10.1007/s11227-007-0123-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-007-0123-7