Fault-tolerant routing for complete Josephus Cubes
Introduction
The Josephus Cube is a recently proposed novel interconnection network that has improved topological properties and exhibits better embedding and communications performance than the Binary Hypercube and several of its variants [6], [13]. Its link-augmented form, Complete Josephus Cubes, can also be applied as node clusters in an optical-based architecture (Fig. 1) suitable for large-scale hierarchical networks [7], [12], [15], [16].
In recent years, the design approach in large-scale parallel processing systems is towards interconnecting larger and more complex node clusters, with each cluster containing up to several nodes interconnected in a specified topology. Several such clusters may be interconnected to each other via inter-cluster channels in a similar or different topology [17], [18], [19]. Examples of such systems include the SGI-Cray Origin 2000 [7], and the Pleiades Alpha 4100 Compute Cluster [22] at the Massachusetts Institute of Technology.
There are several benefits to be derived from this design approach. Each hierarchical network layer can be constructed with technology most suited to that layer’s requirements. Clustered networks can offer system upgrade on a node cluster basis, improving overall network scalability. For implementing fault-tolerant, large-scale parallel processing systems, more fault-tolerant node clusters may be added to inter-cluster channels. Each node cluster is inherently responsible for its own fault tolerance. This has the advantage of maintaining the fault tolerance of the entire hierarchical network when the number of node clusters increase.
In this paper, we describe a cost effective fault-tolerant strategy that includes a fault-tolerant routing algorithm with supporting routing hardware. The fault-tolerant routing algorithm guarantees message delivery in the presence of non-functional network components (nodes and links). At the same time, it does not incur undue overheads under fault-free conditions. Routes generated are both deadlock-free and livelock-free. Finally, routing is adaptive and may be supported by off-the-shelf routing hardware requiring minimal CPU intervention.
The rest of the paper is organised as follows. Section 2 gives a brief introduction to the Complete Josephus Cube to provide continuity in comprehending this paper. Section 3 provides the terminology, definitions, assumptions for the routing sections as well as a survey of existing work. Section 4 develops the fault-tolerant routing algorithm and illustrates its application with some examples. Section 5 develops the proofs of path distance bounds, deadlock-freedom and livelock-freedom. Section 6 presents a router design that is simple and practical. Finally, Section 7 concludes the paper.
Section snippets
The complete Josephus Cube
Fig. 2 shows a Complete Josephus Cube with Complementary (C) links, Josephus (J) links, and Hamming (H) links. The Complete Josephus Cube (CJC) is a link augmented Josephus Cube, admitting network sizes in powers of 2. Formally, a dimension r (for r > 2) CJC(N) of size N = 2r is defined as an undirected graph G = (V, E), where the set of nodes, V = {u∣0 ⩽ u < 2r) and the set of links, E = Eh ∪ Ej ∪ Ec. For any x, y ∈ V:
〈x, y〉 ∈ Eh (H links) iff H(x, y) = 1, where H(x, y) is the Hamming distance between x and y.
〈x, y〉 ∈ Ec (C
Related work
Important objectives in the design of a fault-tolerant routing strategy include scalability, reliability and efficiency [1], [2], [3], [8], [9], [10], [11]. To achieve these objectives, however, requires some information concerning the network status to be known. This amount of information must be kept small and maintained efficiently so as not to appreciably degrade performance. With the advancements in reliability achieved in contemporary and next generation chip technologies, the ability to
Fault-tolerant routing algorithm
In the fault-tolerant routing algorithm, FTROUTE( ), ∧ is the logical-AND operator and ∣V∣ returns the number of one bits in V. The function, OneBitPos(a, F(w)), returns the first 1-bit position in acorresponding to the reliable node specified by F(w), w is the current node. For example, OneBitPos(3, 10110100) returns the value 2 since the first 1-bit occurs in bit position 2. Recall that the input link vector, I, has all (r + 2) bits set at the source node. Note also that enabled fault-free J and C
Analysis of fault-tolerant routing algorithm
In this section, we analyse the fault-tolerant routing algorithm, FTROUTE( ), proposed in Section 4 and develop the proofs of fault tolerance limits, distance bounds, deadlock-freedom and livelock-freedom. Formally, we define the fault-tolerant routing algorithm as a function FTR(u, v) = {〈wk, wk+1〉∣〈wk, wk+1〉 ∈ E for all 0 ⩽ k < p with w0 = u and wp = v}. That is, FTR(u, v) generates a p node fault-free path from u to v, inclusive. Let H(u, v) be the Hamming distance between u and v. Let ∣a∣ denote the number
Fault-tolerant routing hardware
In this section, we present a router design with standard gates. The router schematic (not to scale) is shown in Fig. 6. The simplicity of the design facilitates chip implementation. A similar 8-link router prototype, with traffic control, has been implemented on a single Xilinx 4000 series Field Programmable Gate Array (FPGA) for fault-tolerant routing on meshes [14].
During minimal or non-minimal routing, Rv(u) or respectively, is ANDed (denoted by gates marked ∧) with the input link
Conclusion
We have presented the design and analysis of a cost-effective fault-tolerant routing strategy for the Complete Josephus Cube. The fault-tolerant routing strategy can tolerate up to (r + 1) encountered faults in a dimension r CJC cluster, while remaining deadlock-free and livelock-free. It is guaranteed to deliver the message in not more than (2r + 1) hops under stated fault conditions. The message overhead incurred is very low––each message header comprises only a single r-bit dimensions-traversed
Acknowledgement
We would like to thank the unknown referees whose comments and suggestions have greatly improved the organization, clarity and content of this paper.
References (22)
Artificial intelligence search techniques as fault-tolerant routing strategies
Parallel Computing
(1996)- et al.
The Josephus cube: a novel interconnection network
Parallel Computing
(2000) Development of supercomputers in Japan: hardware and software
Parallel Computing
(1999)- et al.
Depth-first search approach for fault-tolerant routing in hypercube multicomputers
IEEE Transactions on Parallel and Distributed Systems
(1990) - et al.
Adaptive fault-tolerant routing in hypercube multicomputers
IEEE Transaction on Computers
(1990) - et al.
Use of routing capability for fault-tolerant routing in hypercube multicomputers
IEEE Transaction on Computers
(1997) Technology 1999 analysis and forecast: solid state
IEEE Spectrum
(1999)- J.M. Gordon, Q.F. Stout, Hypercube message routing in the presence of faults, in: Proceedings of the Second Symposium...
- et al.
Linear recursive networks and their applications in distributed systems
IEEE Transactions on Parallel and Distributed Systems
(1997) - et al.
A nationwide parallel computing environment
Communications of the ACM
(1997)
An adaptive fault-tolerant routing algorithm for hypercube multicomputers
IEEE Transactions on Parallel and Distributed Systems
Cited by (3)
Minimum Linear Arrangement of the Cartesian Product of Optimal Order Graph and Path
2021, Parallel Processing LettersNode set optimization problem for complete Josephus cubes
2019, Journal of Combinatorial OptimizationConditional diagnosability of complete Josephus cubes
2013, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)