Adaptive wormhole routing in tori with faults: A mathematical approach

doi:10.1016/j.simpat.2009.06.005

Simulation Modelling Practice and Theory

Volume 17, Issue 9, October 2009, Pages 1468-1484

https://doi.org/10.1016/j.simpat.2009.06.005 Get rights and content

Abstract

Fault-tolerance in a communication network is defined as the ability of the network to effectively utilize its redundancy in the presence of faulty components (i.e., nodes or links). New technologies of integration now enable the design of computing systems with hundreds and even thousands of independent processing elements which can cooperate on the solution of the same problem for a corresponding improvement in the execution time. However, as the number of processing units increases, concerns for reliability and continued operation of the system in the presence of failures must be addressed. Adaptive routing algorithms have been frequently suggested as a means of improving communication performance in large-scale massively parallel computers, Multiprocessors System-on-Chip (MP-SoCs), and peer-to-peer communication networks. Before such schemes can be successfully incorporated in networks, it is necessary to have a clear understanding of the factors which affect their performance potential. This paper proposes a novel analytical model to investigate the performance of five prominent adaptive routings in wormhole-switched 2-D tori fortified with an effective scheme suggested by Chalasani and Boppana [S. Chalasani, R.V. Boppana, Adaptive wormhole routing in tori with faults, IEE Proc. Comput. Digit. Tech. 42(6) (1995) 386–394], as an instance of a fault-tolerant method widely used in the literature to achieve high adaptivity and support inter-processor communications in parallel computers. Analytical approximations of the model are confirmed by comparing them with those obtained through simulation experiments.

Introduction

Computer networks comprise of a large number of technologies ranging from millimeters for on-chip networks such as processor-cache communication to the world spanning Internet. Interconnection networks traditionally belong to the smaller part of the range; from chip-to-chip communication to the system area networks, and in particular as the communication medium for multiprocessors. Interconnection networks offer communication with high reliability, high throughput, and low latency, all being vital factors for closely cooperating units. Interconnection networks are represented through technologies such as AutoNet [2], ServerNet [3], Myrinet [4], InfiniBand [5], RapidIO [6], PCI-Express AS [7], and HyperTransport [8]. Ten Gigabit Ethernet [9] with link-level backpressure is also emerging; however, this technology implements a “soft” backpressure which does not guarantee the absence of packet loss. An interconnection network is defined by its topology, flow control, and routing. The topology is the pattern of network node interconnection via physical communication channels. The torus topology has become a popular interconnection architecture for constructing massively parallel computers. Many parallel systems adopt low dimensional torus networks due to their low communication latency and high bandwidth [9], [10].

Flow control deals with the allocation of channel and buffer resources to packets as they proceed through the network. Since processors in a parallel computer network need to communicate with the others, efficient communication is essential to enhance the performance of the system. The wormhole switching (also widely known as wormhole routing [9], [10], [11]) has been dominant for its low latency communication, and it has been adopted by most of the contemporary massively parallel machines. In wormhole switching, a message is divided into a sequence of fixed-size units of data, called flits. If a communication channel transmits the first flit of a message, it must transmit all the remaining flits of the same message before transmitting flits of another message. Wormhole switching only requires small buffers in the routers through which messages are routed. Also, it makes message latency largely insensitive to the message distance in the network. The main drawback of wormhole switching is that blocked messages remain in the network, therefore wasting the channel bandwidth and blocking other messages. In order to reduce the impact of message blocking, physical channels may be split into virtual channels by providing a separate buffer for each virtual channel and by multiplexing the physical channel bandwidth. The use of virtual channels can increase throughput considerably by dynamically sharing the physical bandwidth among several messages [9], [10], [12], [13].

Routing and fault-tolerance in interconnection networks are issues belonging to the network layer of the OSI model [14]. The network layer deals with the end-to-end problem of moving packets from the source node to the final destination. The network layer has knowledge of the network topology and knows how to route the packets through the network. The routing algorithm is generally classified as being either deterministic or adaptive [9], [10], [11]. Deterministic routing is used in a variety of parallel computers because it is exceedingly simple and provides low latency and high bandwidth. However, deterministic routing has a number of significant disadvantages: poor performance under non-uniform traffic loads and poor fault-tolerance. On the other hand, adaptive routing provides alternative paths to route messages, thus avoiding congested regions in the network and increasing throughput.

Fault-tolerance is important to allow functionality of the interconnection network in the presence of faults. Preferably, faults and changes should be tolerated in run-time without disrupting network operations. If one routing node or one link fails, the rest of the network should not be forced into a halt. Preferably, they should be allowed continuous use of the interconnection network, suffering only from the degradation of performance presented by the fault. If a routing node or a link is inserted into the network, the network should incorporate the extra resource without requiring a stop in the communication between the communicating devices. Such hot-insertion of components simplifies incremental growth, which can also be considered as scalability issue. In a large sized network, it is therefore essential to design a fault-tolerant routing algorithm that can route messages in the presence of faults. Fault-tolerant routing for large-scale parallel computers has been the subject of extensive research in recent studies [1], [14], [15], [16], [17], [18], [19], [20], [21], [22].

Most network performance evaluation studies have been conducted by means of software simulation [1], [14], [15], [16], [17], [18], [19], [20], [21], [22]. Studying the relative performance merits of routing algorithms using simulation techniques is, however, limited by the excessive computation time required to run large simulations. Analytical modeling, in contrast, offers a cost-effective and versatile tool to carry out such a study, typically requiring a far lower computational load. This paper proposes a novel analytical model to assess the performance behavior of a number of prominent adaptive routing algorithms in wormhole-switched 2-D tori fortified with an efficient scheme suggested by Chalasani and Boppana [1], as an instance of routing methodology widely used in the literature to achieve high adaptivity and fault-tolerance capability in communication networks.

The remainder of the paper is organized as follows. Section 2 describes the context of this work, torus topology, and prominent adaptive wormhole routings, which are used throughout the paper. Our model assumptions as well as the performance modeling are presented in Section 3. Section 4 compares the message latency predicted by analytical model with those obtained through simulation experiments and finally, in Section 5, we summarize the results presented in the paper.

Section snippets

Preliminaries

This section explores the basis on which our analytical model is founded including the structure of a node in the underlying network, and routing schemes used in this study. The definitions in this section adhere to standard notation and definitions in wormhole-switched networks.

The analytic model

In this section, we derive an analytical model for fully adaptive wormhole-switched routings in a torus network. Our analysis focuses on the PHop, Pbc, Nbc, Duato-Pbc, and Duato-Nbc routing algorithms. However, the proposed model can be applied to other routing algorithms with slight alterations in the model. The most important performance measure in our model is calculating the average message latency in the network.

Simulation experiments

To further understand and evaluate the performance issues of the routing algorithms, we have developed an event-driven simulator at the flit-level. The simulator mimics the behavior of Chalasani–Boppana’s scheme in k-ary n-cube networks (multidimensional tori) with and without faults. The crossbar switch in the router allows multiple messages to traverse a node simultaneously. Virtual channels that have messages to transmit use the physical channel in a round robin manner. Each virtual channel

Conclusions

In this paper, we have presented a novel analytical modeling to capture the mean message latency of several wormhole-switched adaptive routing algorithms using an efficient scheme suggested by Chalasani and Boppana in 2-D tori with faults. We have proposed relevant mathematical models for five adaptive routing algorithms, namely PHop, Pbc, Nbc, Duato-Pbc, and Duto-Nbc. Simulation experiments have revealed that the message latencies predicted by the analytical model are in good agreement with

References (38)

J.T. Draper et al.
A comprehensive analytical model for wormhole routing in multicomputer systems
Journal of Parallel and Distributed Computing
(1994)
S. Chalasani et al.
Adaptive wormhole routing in tori with faults
IEE Proceedings-Computers and Digital Techniques
(1995)
M.D. Schroeder, et al., Autonet: a high-speed, self-configuring local area network using point-to-point links. SRC...
R. Horst, D. Garcia, Servernet SAN I/O architecture, in: Hot Interconnects V,...
N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J.N. Seizovic, Wen-King Su, Myrinet: a...
InfiniBand architecture specification, InfiniBand Trade...
RapidIO, Trade Association RapidIO specifications, 2002....
PCI-SIG, PCI-Express, 2003....
HyperTransport Technology Consortium, HyperTransport I/O link specification, 2003....
W.J. Dally et al.
Principles and practices of interconnection networks
(2004)

J. Duato et al.

Interconnection Networks: An Engineering Approach

(2003)

P. Mohapatra

Wormhole routing techniques for directly connected multicomputer systems

ACM Computing Surveys

(1998)

W.J. Dally et al.

Deadlock-free message routing in multiprocessor interconnection networks

IEEE Transactions on Computers

(1987)

W.J. Dally

Virtual channel flow control

IEEE Transactions on Parallel and Distributed Systems

(1992)

I. Theiss, Modularity, routing and fault tolerance in interconnection networks, Ph.D. Thesis, Faculty of Mathematics...

C.L. Chen et al.

A fault-tolerant routing scheme for meshes with nonconvex faults

IEEE Transactions on Parallel and Distributed Systems

(2001)

J.-D. Shih

Fault-tolerant wormhole routing in torus networks with overlapped block faults

IEE Proceedings-Computers and Digital Techniques

(2003)

J. Zhou, F.C.M. Lau, Adaptive fault-tolerant wormhole routing with two virtual channels in 2D meshes, in: Proceedings...

J. Wu, Z. Jiang, On constructing the minimum orthogonal convex polygon in 2-D faulty meshes, IPDPS,...

Cited by (1)

The effects of traffic patterns on power consumption of torus-connected NoCs with faults
2009, International Conference on Scalable Computing and Communications - The 8th International Conference on Embedded Computing, ScalCom-EmbeddedCom 2009

View full text