Upper bounds on the connection probability for 2-D meshes and tori

doi:10.1016/j.jpdc.2011.11.006

Journal of Parallel and Distributed Computing

Volume 72, Issue 2, February 2012, Pages 185-194

https://doi.org/10.1016/j.jpdc.2011.11.006 Get rights and content

Abstract

Mesh is an important and popular interconnection network topology for large parallel computer systems. A mesh can be divided into submeshes to obtain the upper bounds on the connection probability for the mesh. Combinatorial techniques are used to get closer upper bounds on the connection probability for 2-D meshes compared with the existing upper bounds we have known. Simulation results of meshes of various sizes show that our upper bounds are close to the exact connection probability. The combinatorial methods and tools used in this paper can be used to study the connection probabilities for other networks.

Highlights

► We give a method to get upper bounds on the connection probability for meshes/tori. ► A mesh is divided into submeshes to reduce the complexity of the analysis. ► Combinatorial techniques are used, with single-node disconnections mainly concerned. ► Our new upper bounds for meshes are much tighter than the known ones. ► Simulation results show that our upper bounds are close to the exact probabilities.

Introduction

Many different topologies are used to model the architectures of large parallel computers, the important and popular ones being meshes and tori. Basically a $d$ -dimensional mesh has $k_{0} \times k_{1} \times \dots \times k_{d - 1}$ nodes, with $k_{i}$ nodes along dimension $i (k_{i} \geq 2)$ . 2-dimensional and 3-dimensional meshes are studied by many researchers, abbreviated as 2-D and 3-D meshes, respectively. Tori can be considered as meshes with wraparound connections to achieve vertex and edge symmetry. Being simple and scalable, meshes and tori have not only become the basic topologies for many theoretical researches [19], but have also been adopted in many famous commercial multicomputer systems. For example, 2-D mesh computers include Intel Touchstone DELTA [27], Standford DASH [16], MIT Alewife [1], and Goodyear Aerospace MPP [5], etc. 3-D meshes include Blue gene Supercomputer [2], Tera Computer System [3], and MIT J-machine [21]. Alpha 21364 is a 2-D torus and Cray T3D is a 3-D torus [6].

A system is said to be fault tolerant if it can remain functional in the presence of faults [13]. A massively parallel computer system is complex, which increases the possibility of having some faulty components (e.g., processors/nodes or communication links/edges). Hence, fault tolerance of a system plays a crucial role in its actual performance. One of the functionality criteria determines that a system is functional if and only if there is a nonfaulty communication path between each pair of nonfaulty processors [23]. That is, if all nonfaulty nodes and links in the system form a connected graph–which can be used to simulate the entire system to mask the effects of faults–then the system is fault tolerant.

Some graph theoretic concepts are used to develop deterministic measures of the fault tolerance. Conventionally, network fault tolerance is defined as the maximum number of nodes that are tolerant to failure without inducing a possible disconnection in the network [22], a number which is smaller than the connectivity of the network by one. Accordingly, the network fault tolerance of a regular graph topology with degree $n$ is at most $n - 1$ . For a 2-D mesh, the network fault tolerance is only 1, because the failure of the two neighboring nodes of a nonfaulty corner node would disconnect the corner node from any other nonfaulty node (if any) of the system. As a matter of fact, this traditional measure of fault tolerance is quite conservative. Consequently, much effort has been devoted to introduce more realistic measures. Many measures, either deterministic or probabilistic, have been proposed over the years. Esfahanian [13] introduced the concept of forbidden faulty sets, in which components cannot be faulty at the same time. Esfahanian’s results showed that an $n$ -cube can tolerate up to $2 n - 3$ node failures and remain connected provided that, for each node $v$ in the network, all the nodes which are directly connected to $v$ do not fail at the same time. Latifi et al. [15] proposed the concept of k-safeness, based on the idea of forbidden faulty sets, and showed that a $k$ -safe $n$ -cube can tolerate up to $2^{k} (n - k) - 1$ faulty nodes. All of these results demonstrate that the traditional fault tolerance measure tends to underestimate the fault tolerance of large networks. In fact, although many deterministic fault-tolerant routing algorithms for meshes and tori have been proposed [14], [7], [28], [12], little research has focused on deterministic fault tolerance measures of meshes or tori.

Probabilistic measures consider a random graph model to characterize a system. These measures evaluate the probability of the system being functional by assuming that nodes or links can fail independently with known probabilities. Probabilistic measures of fault tolerance can often model the real world better than deterministic ones because the real world is too complex to be studied in a deterministic way. Several studies have been conducted on probabilistic measures. Najjar and Gaudiot [20] introduced network resilience, a probabilistic measure of network fault tolerance expressed as the probability of a disconnection. However, the problem of computing the connection/disconnection probability of a general network has proved to be NP-complete [11], [24]. In contrast, computing upper and lower bounds on the connection/disconnection probability are more efficient. For example, [26], [9], [8] studied the lower bounds on the reliability of hypercubes, while Angel et al. [4], Chen et al. [10], and Liang et al. [17] investigated upper bounds and lower bounds on the connection probability of meshes.

While the lower bound on the connection probability seems to be of more interest, the upper bound is also of great importance. First of all, to achieve a desired network connection probability, the individual node failure probability can absolutely not be greater than some upper bound. In fact, the gap between the known lower and upper bounds is very large for meshes with large sizes. Nevertheless, the upper bound can still provide a reference to the corresponding lower bound. Although simulations demonstrate a good estimation to the connection probability, the theoretical analysis demonstrates accurate results.

Chen et al. [10] gave upper bounds on the connection probability for 2-D meshes based on rhombuses. A rhombus of a 2-D mesh is defined as a connected subgraph consisting of five nodes — one node located at the center and each of the other four nodes located on one direction. However, there are two important factors causing the weakness of upper bounds determined in this fashion. First, Chen et al. [10] derived the upper bounds by distinguishing disjoint rhombuses. A rhombus is of too small a size to be a good tool in studying the connection probability for meshes. Second, because the degree of a node on the boundary of a grid-based mesh is smaller than 4, such a node has a greater probability to be an isolated node than those nodes of degree 4 in the middle of the mesh. This factor is not considered in [10].

Motivated by the aforementioned deficiencies, in this paper, we study the upper bounds on the connection probability for 2-D meshes and 2-D tori, and further obtain tighter upper bounds. To achieve better upper bounds, we deal with nodes on the boundary and in the middle of a mesh based on combinatorial tools such as The Principle of Inclusion and Exclusion, and divide the meshes into small submeshes instead of rhombuses. Simulation results show that the upper bounds we have derived are much closer to the possible exact connection probability than previously determined bounds.

The main contribution of this paper is giving combinatorial method tools to obtain tighter upper bound on the connection probability without too complicated computation, which is much more better than that in [10]. For most meshes of large sizes studied in [10], the improvements are large. The methods and tools are much more important than the results themselves. If a new network was designed, people often want to know the connection probability to tell if this network is a good one. Although such a newly designed network may be much complicated than meshes, the tools and methods used to study the bounds on the connection probabilities for meshes may be used to study such a network. The methods and tools may be used to study the reliabilities for more objects.

To simplify our discussion, here we only consider node faults rather than link faults.

The rest of this paper is organized as follows. Section 2 overviews related work. Section 3 introduces the notations and the methods used in this paper. In Section 4, we give the main results of this paper, including some novel combinatorial approaches and new upper bounds on the connection probability for 2-D meshes. The extensive simulation results are presented in Section 5. Section 6 concludes the paper.

Section snippets

Related work

In this section, we survey related results and methods on the connection probability for meshes and other networks.

Najjar and Gaudiot [20] addressed the issue of network fault tolerance by looking at the probability of disconnection in a family of regular graph network topologies. A k-cluster in a regular network $G$ is defined to be a connected branch of $k$ nodes in $G$ . A disconnection is caused by a $k$ -cluster $C$ if all nodes in $C$ are nonfaulty but all other nodes adjacent to $C$ are faulty. In [20],

Notations and definitions

A mesh or a torus considered in this paper has $m n$ nodes, with $m$ nodes along the $x$ dimension and $n$ nodes along the $y$ dimension. Each node is identified by a coordinate pair $(x, y)$ , where $x$ , $y$ are integers such that $1 \leq x \leq m$ and $1 \leq y \leq n$ . Such a mesh or a torus is denoted as $M_{m \times n}$ or $T_{m \times n}$ , respectively.

Definition 3.1

[10]

In $M_{m \times n}$ , two nodes $v = (x, y)$ and $v^{'} = (x^{'}, y^{'})$ are neighbors if and only if $| x - x^{'} | + | y - y^{'} | = 1$ .

In $M_{m \times n}$ , nodes at the four corners are of degree 2, other nodes on the boundary are of degree 3, and the rest of the

Upper bounds on the connection probability for meshes and tori

One of the important ideas used in this paper is to analyze the connection probability for a mesh by dividing the mesh into some small blocks, which follows similar ideas in [9], [8], [10].

The following discussion is based on the analysis of a single block. Therefore, when analyzing a specific submesh $M_{s, t}$ , each node is located by a local coordinate pair $(x, y)$ , where $1 \leq x \leq s$ and $1 \leq y \leq t$ .

Simulation results and discussions

Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results [25]. Since each node in a 2-D mesh $M_{m \times n}$ fails independently with probability $p$ , in this section Monte Carlo methods are used to estimate the connection probability for mesh networks, and the simulation results can be used to evaluate those upper bounds obtained in Section 4.

A simulator was designed to generate meshes $M_{m \times n}$ with node faulty probability $p$ , and then judge

Conclusions and future work

Meshes and tori are among the most important and popular interconnection topologies to model the architectures of multicomputer systems. To understand the performance of these systems, fault tolerance should be taken into account. However, the fault tolerance analysis of mesh networks is a complicated job, especially given large scale networks. Fortunately, dividing a mesh into several submeshes and analyzing the connection probability for these submeshes significantly reduces the complexity.

Acknowledgments

The authors thank the reviewers for their instructive comments. This work is partially supported by the NSF of China (61064002), by Guangxi Natural Science Foundation (2011GXNSFA018142, 0991027, 0991061), by the Scientific Research Foundation of GuangXi University (X081059), the Sichuan Youth Science & Technology Foundation (2010JQ0032), and the Chengdu University School Foundation (2010XJZ27).

Meilian Liang received her BSc degree and MSc degree in computer science from Guangxi University in 2002 and 2005, respectively. She is currently a lecturer of Information Science at Guangxi University, Nanning. Her research interests include data mining and algorithm design.

References (28)

J. Chen et al.
Probabilistic analysis on mesh network fault tolerance
Journal of Parallel and Distributed Computing
(2007)
S. Kim et al.
Fault-tolerant wormhole routing in mesh with overlapped solid fault regions
Parallel Computing
(1997)
A. Agarwal et al.
The mit alewife machine: architecture and performance
ISCA
(1995)
F.E. Allen et al.
Blue gene: a vision for protein science using a petaflop supercomputer
IBM Systems Journal
(2001)
R. Alverson et al.
The Tera Computer System
ICS
(1990)
O. Angel et al.
Routing complexity of faulty networks
Random Structures and Algorithm
(2008)
K. Batcher
Design of a massively parallel processor
IEEE Transactions on Computers
(1980)
C. Chen et al.
A fault-tolerant routing scheme for meshes with nonconvex faults
IEEE Transactions on Parallel and Distributed Systems
(2001)
J. Chen et al.
Hypercube network fault tolerance: a probabilistic approach
Proceedings of the International Conference on Parallel Processing (ICPP’02)
(2002)

J. Chen et al.

Locally subcube-connected hypercube networks: theoretical analysis and experimental results

IEEE Transactions on Computers

(2002)

C. Colbourn

The Combinatorics of Network Reliability

(1987)

X. Dong et al.

Practical deadlock-free fault-tolerant routing in meshes based on the planar network fault model

IEEE Transactions on Computers

(2009)

A. Esfahanian

Generalized measures of fault tolerance with application to $n$ -cube networks

IEEE Transactions on Computers

(1989)

Cited by (9)

The reliability analysis of k-ary n-cube networks
2020, Theoretical Computer Science
With the development of scalability and application of multiprocessor system, the components of the system possibly become faulty. It is desirable to know the reliability of the system. However, the exact reliability of a complicated network system is usually difficult to determine. A typical approach is to decompose the system into smaller ones based on a graph-theory model in which nodes are assumed to fail independently with known probabilities to measure the subsystem-based reliability, which is defined as the probability that a fault-free subsystem of a specific size is still available when the system has faults. In this paper, we use the probability fault model to establish upper and lower bounds on the subsystem-based reliability of k-ary n-cube networks, by taking into account the intersection of no more than five or four subgraphs. In addition, some numerical simulations of the subsystem-based reliability of k-ary n-cube networks are conducted.
On the reliability of alternating group graph-based networks
2018, Theoretical Computer Science
The probability of having faults in a multiprocessor computer system increases as the size of system grows. One way to quantify the reliability of a system is using the probability that a fault-free subsystem of a certain size still exists with the presence of individual faults. The higher the probability is, the more reliable the system is. In this paper, we establish the reliability for networks based on $A G_{n}$ , the n-dimensional alternating group graph. More specifically, we calculate the probability of a subnetwork (or subgraph) $A G_{n}^{n - 1}$ being fault-free, when given a single node's fault probability. Since subnetworks of $A G_{n}$ intersect in highly complex manners, our scheme is to use the Principle of Inclusion–Exclusion to obtain a lower-bound of the probability, by considering intersections of up to four subgraphs. We show that the lower-bound derived this way is very close to the upper-bound obtained in a previous result, which means the lower-bound we get is a very tight one. Therefore, both lower-bound and upper-bound are close approximations of the accurate probability.
Estimating the subsystem reliability of bubblesort networks
2017, Theoretical Computer Science
Citation Excerpt :
Zhu et al. [18] analyzed the reliability of the folded hypercube; Chen et al. [5] developed a technique for deriving lower bounds on the connectivity probability for both 2-D and 3-D meshes. Later, Wang et al. [15] investigated fault tolerance analysis of mesh networks with uniform versus nonuniform node failure distribution, and Liang et al. [12] studied the connection probability for 2-D meshes and tori. The remainder of this paper is structured as follows.
The exact reliability of a complicated network system is usually difficult to determine, and numerical approximations may play a crucial role in indicating the reliable probability that a system is still operational under a specified suite of conditions. In this paper, we establish upper and lower bounds on the first-order subsystem reliability of bubblesort networks using the probabilistic fault model. Numerical results show that the curves of upper- and lower-bounded reliability are in good agreement, especially when the node reliability is at a low level.
Probabilistic Reliability via Subsystem Structures of Arrangement Graph Networks
2024, IEEE Transactions on Reliability
Subgraph Reliability of Alternating Group Graph With Uniform and Nonuniform Vertex Fault-Free Probabilities
2022, Computer Journal
The reliability analysis based on subsystems of (n,k)-star graph
2016, IEEE Transactions on Reliability

View all citing articles on Scopus

Xiaodong Xu received his BSc degree from Harbin Nornal University in Mathematics in 1995, and MSc degree in applied mathematics from National Defense Science and Technique University in 2002. He is currently a research professor of Information Science at Guangxi Academy of Science, Nanning. His research interests include combinatorics, Information theory, and number theory.

Jiarong Liang received his Doctorate in Engineering from South China University of Technology in 1998. He is currently a professor of Computer Science at Guangxi University, Nanning. His research interests include control theory and computer science.

Zehui Shao received his BSc degree and MSc degree in Engineering from Chengdu University of Technique in 2002 and 2005, respectively. He received his Doctorate in Science from Huazhong University of Science and Technology in 2009. He is currently a lecturer of Computer Science in Chengdu University, Chengdu. His research interests include combinatorics and algorithm design.

View full text

Upper bounds on the connection probability for 2-D meshes and tori

Abstract

Highlights

Introduction

Section snippets

Related work

Notations and definitions

[10]

Upper bounds on the connection probability for meshes and tori

Simulation results and discussions

Conclusions and future work

Acknowledgments

Journal of Parallel and Distributed Computing

Parallel Computing

The mit alewife machine: architecture and performance

ISCA

Blue gene: a vision for protein science using a petaflop supercomputer

IBM Systems Journal

The Tera Computer System

ICS

Routing complexity of faulty networks

Random Structures and Algorithm

Design of a massively parallel processor

IEEE Transactions on Computers

A fault-tolerant routing scheme for meshes with nonconvex faults

IEEE Transactions on Parallel and Distributed Systems

Hypercube network fault tolerance: a probabilistic approach

Proceedings of the International Conference on Parallel Processing (ICPP’02)

Locally subcube-connected hypercube networks: theoretical analysis and experimental results

IEEE Transactions on Computers

The Combinatorics of Network Reliability

Practical deadlock-free fault-tolerant routing in meshes based on the planar network fault model

IEEE Transactions on Computers

Generalized measures of fault tolerance with application to n-cube networks

IEEE Transactions on Computers

Generalized measures of fault tolerance with application to $n$ -cube networks