Upper bounds on the connection probability for 2-D meshes and tori

https://doi.org/10.1016/j.jpdc.2011.11.006Get rights and content

Abstract

Mesh is an important and popular interconnection network topology for large parallel computer systems. A mesh can be divided into submeshes to obtain the upper bounds on the connection probability for the mesh. Combinatorial techniques are used to get closer upper bounds on the connection probability for 2-D meshes compared with the existing upper bounds we have known. Simulation results of meshes of various sizes show that our upper bounds are close to the exact connection probability. The combinatorial methods and tools used in this paper can be used to study the connection probabilities for other networks.

Highlights

► We give a method to get upper bounds on the connection probability for meshes/tori. ► A mesh is divided into submeshes to reduce the complexity of the analysis. ► Combinatorial techniques are used, with single-node disconnections mainly concerned. ► Our new upper bounds for meshes are much tighter than the known ones. ► Simulation results show that our upper bounds are close to the exact probabilities.

Introduction

Many different topologies are used to model the architectures of large parallel computers, the important and popular ones being meshes and tori. Basically a d-dimensional mesh has k0×k1××kd1 nodes, with ki nodes along dimension i(ki2). 2-dimensional and 3-dimensional meshes are studied by many researchers, abbreviated as 2-D and 3-D meshes, respectively. Tori can be considered as meshes with wraparound connections to achieve vertex and edge symmetry. Being simple and scalable, meshes and tori have not only become the basic topologies for many theoretical researches [19], but have also been adopted in many famous commercial multicomputer systems. For example, 2-D mesh computers include Intel Touchstone DELTA [27], Standford DASH [16], MIT Alewife [1], and Goodyear Aerospace MPP [5], etc. 3-D meshes include Blue gene Supercomputer [2], Tera Computer System [3], and MIT J-machine [21]. Alpha 21364 is a 2-D torus and Cray T3D is a 3-D torus [6].

A system is said to be fault tolerant if it can remain functional in the presence of faults [13]. A massively parallel computer system is complex, which increases the possibility of having some faulty components (e.g., processors/nodes or communication links/edges). Hence, fault tolerance of a system plays a crucial role in its actual performance. One of the functionality criteria determines that a system is functional if and only if there is a nonfaulty communication path between each pair of nonfaulty processors [23]. That is, if all nonfaulty nodes and links in the system form a connected graph–which can be used to simulate the entire system to mask the effects of faults–then the system is fault tolerant.

Some graph theoretic concepts are used to develop deterministic measures of the fault tolerance. Conventionally, network fault tolerance is defined as the maximum number of nodes that are tolerant to failure without inducing a possible disconnection in the network [22], a number which is smaller than the connectivity of the network by one. Accordingly, the network fault tolerance of a regular graph topology with degree n is at most n1. For a 2-D mesh, the network fault tolerance is only 1, because the failure of the two neighboring nodes of a nonfaulty corner node would disconnect the corner node from any other nonfaulty node (if any) of the system. As a matter of fact, this traditional measure of fault tolerance is quite conservative. Consequently, much effort has been devoted to introduce more realistic measures. Many measures, either deterministic or probabilistic, have been proposed over the years. Esfahanian [13] introduced the concept of forbidden faulty sets, in which components cannot be faulty at the same time. Esfahanian’s results showed that an n-cube can tolerate up to 2n3 node failures and remain connected provided that, for each node v in the network, all the nodes which are directly connected to v do not fail at the same time. Latifi et al. [15] proposed the concept of k-safeness, based on the idea of forbidden faulty sets, and showed that a k-safe n-cube can tolerate up to 2k(nk)1 faulty nodes. All of these results demonstrate that the traditional fault tolerance measure tends to underestimate the fault tolerance of large networks. In fact, although many deterministic fault-tolerant routing algorithms for meshes and tori have been proposed [14], [7], [28], [12], little research has focused on deterministic fault tolerance measures of meshes or tori.

Probabilistic measures consider a random graph model to characterize a system. These measures evaluate the probability of the system being functional by assuming that nodes or links can fail independently with known probabilities. Probabilistic measures of fault tolerance can often model the real world better than deterministic ones because the real world is too complex to be studied in a deterministic way. Several studies have been conducted on probabilistic measures. Najjar and Gaudiot [20] introduced network resilience, a probabilistic measure of network fault tolerance expressed as the probability of a disconnection. However, the problem of computing the connection/disconnection probability of a general network has proved to be NP-complete [11], [24]. In contrast, computing upper and lower bounds on the connection/disconnection probability are more efficient. For example, [26], [9], [8] studied the lower bounds on the reliability of hypercubes, while Angel et al. [4], Chen et al. [10], and Liang et al. [17] investigated upper bounds and lower bounds on the connection probability of meshes.

While the lower bound on the connection probability seems to be of more interest, the upper bound is also of great importance. First of all, to achieve a desired network connection probability, the individual node failure probability can absolutely not be greater than some upper bound. In fact, the gap between the known lower and upper bounds is very large for meshes with large sizes. Nevertheless, the upper bound can still provide a reference to the corresponding lower bound. Although simulations demonstrate a good estimation to the connection probability, the theoretical analysis demonstrates accurate results.

Chen et al. [10] gave upper bounds on the connection probability for 2-D meshes based on rhombuses. A rhombus of a 2-D mesh is defined as a connected subgraph consisting of five nodes — one node located at the center and each of the other four nodes located on one direction. However, there are two important factors causing the weakness of upper bounds determined in this fashion. First, Chen et al. [10] derived the upper bounds by distinguishing disjoint rhombuses. A rhombus is of too small a size to be a good tool in studying the connection probability for meshes. Second, because the degree of a node on the boundary of a grid-based mesh is smaller than 4, such a node has a greater probability to be an isolated node than those nodes of degree 4 in the middle of the mesh. This factor is not considered in [10].

Motivated by the aforementioned deficiencies, in this paper, we study the upper bounds on the connection probability for 2-D meshes and 2-D tori, and further obtain tighter upper bounds. To achieve better upper bounds, we deal with nodes on the boundary and in the middle of a mesh based on combinatorial tools such as The Principle of Inclusion and Exclusion, and divide the meshes into small submeshes instead of rhombuses. Simulation results show that the upper bounds we have derived are much closer to the possible exact connection probability than previously determined bounds.

The main contribution of this paper is giving combinatorial method tools to obtain tighter upper bound on the connection probability without too complicated computation, which is much more better than that in [10]. For most meshes of large sizes studied in [10], the improvements are large. The methods and tools are much more important than the results themselves. If a new network was designed, people often want to know the connection probability to tell if this network is a good one. Although such a newly designed network may be much complicated than meshes, the tools and methods used to study the bounds on the connection probabilities for meshes may be used to study such a network. The methods and tools may be used to study the reliabilities for more objects.

To simplify our discussion, here we only consider node faults rather than link faults.

The rest of this paper is organized as follows. Section 2 overviews related work. Section 3 introduces the notations and the methods used in this paper. In Section 4, we give the main results of this paper, including some novel combinatorial approaches and new upper bounds on the connection probability for 2-D meshes. The extensive simulation results are presented in Section 5. Section 6 concludes the paper.

Section snippets

Related work

In this section, we survey related results and methods on the connection probability for meshes and other networks.

Najjar and Gaudiot [20] addressed the issue of network fault tolerance by looking at the probability of disconnection in a family of regular graph network topologies. A k-cluster in a regular network G is defined to be a connected branch of k nodes in G. A disconnection is caused by a k-cluster C if all nodes in C are nonfaulty but all other nodes adjacent to C are faulty. In [20],

Notations and definitions

A mesh or a torus considered in this paper has mn nodes, with m nodes along the x dimension and n nodes along the y dimension. Each node is identified by a coordinate pair (x,y), where x, y are integers such that 1xm and 1yn. Such a mesh or a torus is denoted as Mm×n or Tm×n, respectively.

Definition 3.1

[10]

In Mm×n, two nodes v=(x,y) and v=(x,y) are neighbors if and only if |xx|+|yy|=1.

In Mm×n, nodes at the four corners are of degree 2, other nodes on the boundary are of degree 3, and the rest of the

Upper bounds on the connection probability for meshes and tori

One of the important ideas used in this paper is to analyze the connection probability for a mesh by dividing the mesh into some small blocks, which follows similar ideas in [9], [8], [10].

The following discussion is based on the analysis of a single block. Therefore, when analyzing a specific submesh Ms,t, each node is located by a local coordinate pair (x,y), where 1xs and 1yt.

Simulation results and discussions

Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results [25]. Since each node in a 2-D mesh Mm×n fails independently with probability p, in this section Monte Carlo methods are used to estimate the connection probability for mesh networks, and the simulation results can be used to evaluate those upper bounds obtained in Section 4.

A simulator was designed to generate meshes Mm×n with node faulty probability p, and then judge

Conclusions and future work

Meshes and tori are among the most important and popular interconnection topologies to model the architectures of multicomputer systems. To understand the performance of these systems, fault tolerance should be taken into account. However, the fault tolerance analysis of mesh networks is a complicated job, especially given large scale networks. Fortunately, dividing a mesh into several submeshes and analyzing the connection probability for these submeshes significantly reduces the complexity.

Acknowledgments

The authors thank the reviewers for their instructive comments. This work is partially supported by the NSF of China (61064002), by Guangxi Natural Science Foundation (2011GXNSFA018142, 0991027, 0991061), by the Scientific Research Foundation of GuangXi University (X081059), the Sichuan Youth Science & Technology Foundation (2010JQ0032), and the Chengdu University School Foundation (2010XJZ27).

Meilian Liang received her BSc degree and MSc degree in computer science from Guangxi University in 2002 and 2005, respectively. She is currently a lecturer of Information Science at Guangxi University, Nanning. Her research interests include data mining and algorithm design.

References (28)

  • J. Chen et al.

    Probabilistic analysis on mesh network fault tolerance

    Journal of Parallel and Distributed Computing

    (2007)
  • S. Kim et al.

    Fault-tolerant wormhole routing in mesh with overlapped solid fault regions

    Parallel Computing

    (1997)
  • A. Agarwal et al.

    The mit alewife machine: architecture and performance

    ISCA

    (1995)
  • F.E. Allen et al.

    Blue gene: a vision for protein science using a petaflop supercomputer

    IBM Systems Journal

    (2001)
  • R. Alverson et al.

    The Tera Computer System

    ICS

    (1990)
  • O. Angel et al.

    Routing complexity of faulty networks

    Random Structures and Algorithm

    (2008)
  • K. Batcher

    Design of a massively parallel processor

    IEEE Transactions on Computers

    (1980)
  • C. Chen et al.

    A fault-tolerant routing scheme for meshes with nonconvex faults

    IEEE Transactions on Parallel and Distributed Systems

    (2001)
  • J. Chen et al.

    Hypercube network fault tolerance: a probabilistic approach

    Proceedings of the International Conference on Parallel Processing (ICPP’02)

    (2002)
  • J. Chen et al.

    Locally subcube-connected hypercube networks: theoretical analysis and experimental results

    IEEE Transactions on Computers

    (2002)
  • C. Colbourn

    The Combinatorics of Network Reliability

    (1987)
  • X. Dong et al.

    Practical deadlock-free fault-tolerant routing in meshes based on the planar network fault model

    IEEE Transactions on Computers

    (2009)
  • A. Esfahanian

    Generalized measures of fault tolerance with application to n-cube networks

    IEEE Transactions on Computers

    (1989)
  • Cited by (9)

    • The reliability analysis of k-ary n-cube networks

      2020, Theoretical Computer Science
    • Estimating the subsystem reliability of bubblesort networks

      2017, Theoretical Computer Science
      Citation Excerpt :

      Zhu et al. [18] analyzed the reliability of the folded hypercube; Chen et al. [5] developed a technique for deriving lower bounds on the connectivity probability for both 2-D and 3-D meshes. Later, Wang et al. [15] investigated fault tolerance analysis of mesh networks with uniform versus nonuniform node failure distribution, and Liang et al. [12] studied the connection probability for 2-D meshes and tori. The remainder of this paper is structured as follows.

    View all citing articles on Scopus

    Meilian Liang received her BSc degree and MSc degree in computer science from Guangxi University in 2002 and 2005, respectively. She is currently a lecturer of Information Science at Guangxi University, Nanning. Her research interests include data mining and algorithm design.

    Xiaodong Xu received his BSc degree from Harbin Nornal University in Mathematics in 1995, and MSc degree in applied mathematics from National Defense Science and Technique University in 2002. He is currently a research professor of Information Science at Guangxi Academy of Science, Nanning. His research interests include combinatorics, Information theory, and number theory.

    Jiarong Liang received his Doctorate in Engineering from South China University of Technology in 1998. He is currently a professor of Computer Science at Guangxi University, Nanning. His research interests include control theory and computer science.

    Zehui Shao received his BSc degree and MSc degree in Engineering from Chengdu University of Technique in 2002 and 2005, respectively. He received his Doctorate in Science from Huazhong University of Science and Technology in 2009. He is currently a lecturer of Computer Science in Chengdu University, Chengdu. His research interests include combinatorics and algorithm design.

    View full text