Elsevier

Parallel Computing

Volume 30, Issues 9–10, September–October 2004, Pages 1151-1167
Parallel Computing

Fault-tolerant routing for complete Josephus Cubes

https://doi.org/10.1016/j.parco.2004.07.003Get rights and content

Abstract

This paper proposes and analyses a fault-tolerant routing system for the Complete Josephus Cube [Parallel Computing 26 (2000) 427]. For a Complete Josephus Cube of dimension r, the routing strategy tolerates at least (r + 1) and up to 2(r  1) encountered component faults. Routes generated are free from cycles and guarantee message delivery. Within stated fault conditions, the message is routed within a maximum of (2(r  1)) hops. The message overhead incurred is low for the specified fault tolerance level. Finally, our router design shows that hardware support may be simply achieved with practical components, facilitating integration with the host network.

Introduction

The Josephus Cube is a recently proposed novel interconnection network that has improved topological properties and exhibits better embedding and communications performance than the Binary Hypercube and several of its variants [6], [13]. Its link-augmented form, Complete Josephus Cubes, can also be applied as node clusters in an optical-based architecture (Fig. 1) suitable for large-scale hierarchical networks [7], [12], [15], [16].

In recent years, the design approach in large-scale parallel processing systems is towards interconnecting larger and more complex node clusters, with each cluster containing up to several nodes interconnected in a specified topology. Several such clusters may be interconnected to each other via inter-cluster channels in a similar or different topology [17], [18], [19]. Examples of such systems include the SGI-Cray Origin 2000 [7], and the Pleiades Alpha 4100 Compute Cluster [22] at the Massachusetts Institute of Technology.

There are several benefits to be derived from this design approach. Each hierarchical network layer can be constructed with technology most suited to that layer’s requirements. Clustered networks can offer system upgrade on a node cluster basis, improving overall network scalability. For implementing fault-tolerant, large-scale parallel processing systems, more fault-tolerant node clusters may be added to inter-cluster channels. Each node cluster is inherently responsible for its own fault tolerance. This has the advantage of maintaining the fault tolerance of the entire hierarchical network when the number of node clusters increase.

In this paper, we describe a cost effective fault-tolerant strategy that includes a fault-tolerant routing algorithm with supporting routing hardware. The fault-tolerant routing algorithm guarantees message delivery in the presence of non-functional network components (nodes and links). At the same time, it does not incur undue overheads under fault-free conditions. Routes generated are both deadlock-free and livelock-free. Finally, routing is adaptive and may be supported by off-the-shelf routing hardware requiring minimal CPU intervention.

The rest of the paper is organised as follows. Section 2 gives a brief introduction to the Complete Josephus Cube to provide continuity in comprehending this paper. Section 3 provides the terminology, definitions, assumptions for the routing sections as well as a survey of existing work. Section 4 develops the fault-tolerant routing algorithm and illustrates its application with some examples. Section 5 develops the proofs of path distance bounds, deadlock-freedom and livelock-freedom. Section 6 presents a router design that is simple and practical. Finally, Section 7 concludes the paper.

Section snippets

The complete Josephus Cube

Fig. 2 shows a Complete Josephus Cube with Complementary (C) links, Josephus (J) links, and Hamming (H) links. The Complete Josephus Cube (CJC) is a link augmented Josephus Cube, admitting network sizes in powers of 2. Formally, a dimension r (for r > 2) CJC(N) of size N = 2r is defined as an undirected graph G = (V, E), where the set of nodes, V = {u∣0  u < 2r) and the set of links, E = Eh  Ej  Ec. For any x, y  V:

x, y  Eh (H links) iff H(x, y) = 1, where H(x, y) is the Hamming distance between x and y.

x, y  Ec (C

Related work

Important objectives in the design of a fault-tolerant routing strategy include scalability, reliability and efficiency [1], [2], [3], [8], [9], [10], [11]. To achieve these objectives, however, requires some information concerning the network status to be known. This amount of information must be kept small and maintained efficiently so as not to appreciably degrade performance. With the advancements in reliability achieved in contemporary and next generation chip technologies, the ability to

Fault-tolerant routing algorithm

In the fault-tolerant routing algorithm, FTROUTE( ), ∧ is the logical-AND operator and ∣V∣ returns the number of one bits in V. The function, OneBitPos(a, F(w)), returns the first 1-bit position in acorresponding to the reliable node specified by F(w), w is the current node. For example, OneBitPos(3, 10110100) returns the value 2 since the first 1-bit occurs in bit position 2. Recall that the input link vector, I, has all (r + 2) bits set at the source node. Note also that enabled fault-free J and C

Analysis of fault-tolerant routing algorithm

In this section, we analyse the fault-tolerant routing algorithm, FTROUTE( ), proposed in Section 4 and develop the proofs of fault tolerance limits, distance bounds, deadlock-freedom and livelock-freedom. Formally, we define the fault-tolerant routing algorithm as a function FTR(u, v) = {〈wk, wk+1〉∣〈wk, wk+1  E for all 0  k < p with w0 = u and wp = v}. That is, FTR(u, v) generates a p node fault-free path from u to v, inclusive. Let H(u, v) be the Hamming distance between u and v. Let ∣a∣ denote the number

Fault-tolerant routing hardware

In this section, we present a router design with standard gates. The router schematic (not to scale) is shown in Fig. 6. The simplicity of the design facilitates chip implementation. A similar 8-link router prototype, with traffic control, has been implemented on a single Xilinx 4000 series Field Programmable Gate Array (FPGA) for fault-tolerant routing on meshes [14].

During minimal or non-minimal routing, Rv(u) or R¯v(u) respectively, is ANDed (denoted by gates marked ∧) with the input link

Conclusion

We have presented the design and analysis of a cost-effective fault-tolerant routing strategy for the Complete Josephus Cube. The fault-tolerant routing strategy can tolerate up to (r + 1) encountered faults in a dimension r CJC cluster, while remaining deadlock-free and livelock-free. It is guaranteed to deliver the message in not more than (2r + 1) hops under stated fault conditions. The message overhead incurred is very low––each message header comprises only a single r-bit dimensions-traversed

Acknowledgement

We would like to thank the unknown referees whose comments and suggestions have greatly improved the organization, clarity and content of this paper.

References (22)

  • P.K.K. Loh

    Artificial intelligence search techniques as fault-tolerant routing strategies

    Parallel Computing

    (1996)
  • P.K.K. Loh et al.

    The Josephus cube: a novel interconnection network

    Parallel Computing

    (2000)
  • Y. Oyanagi

    Development of supercomputers in Japan: hardware and software

    Parallel Computing

    (1999)
  • M.-S. Chen et al.

    Depth-first search approach for fault-tolerant routing in hypercube multicomputers

    IEEE Transactions on Parallel and Distributed Systems

    (1990)
  • M.-S. Chen et al.

    Adaptive fault-tolerant routing in hypercube multicomputers

    IEEE Transaction on Computers

    (1990)
  • G.-M. Chiu et al.

    Use of routing capability for fault-tolerant routing in hypercube multicomputers

    IEEE Transaction on Computers

    (1997)
  • L. Geppert

    Technology 1999 analysis and forecast: solid state

    IEEE Spectrum

    (1999)
  • J.M. Gordon, Q.F. Stout, Hypercube message routing in the presence of faults, in: Proceedings of the Second Symposium...
  • W.J. Hsu et al.

    Linear recursive networks and their applications in distributed systems

    IEEE Transactions on Parallel and Distributed Systems

    (1997)
  • K. Kennedy et al.

    A nationwide parallel computing environment

    Communications of the ACM

    (1997)
  • Y. Lan

    An adaptive fault-tolerant routing algorithm for hypercube multicomputers

    IEEE Transactions on Parallel and Distributed Systems

    (1995)
  • Cited by (3)

    View full text