Upper bounds on the connection probability for 2-D meshes and tori
Highlights
► We give a method to get upper bounds on the connection probability for meshes/tori. ► A mesh is divided into submeshes to reduce the complexity of the analysis. ► Combinatorial techniques are used, with single-node disconnections mainly concerned. ► Our new upper bounds for meshes are much tighter than the known ones. ► Simulation results show that our upper bounds are close to the exact probabilities.
Introduction
Many different topologies are used to model the architectures of large parallel computers, the important and popular ones being meshes and tori. Basically a -dimensional mesh has nodes, with nodes along dimension . 2-dimensional and 3-dimensional meshes are studied by many researchers, abbreviated as 2-D and 3-D meshes, respectively. Tori can be considered as meshes with wraparound connections to achieve vertex and edge symmetry. Being simple and scalable, meshes and tori have not only become the basic topologies for many theoretical researches [19], but have also been adopted in many famous commercial multicomputer systems. For example, 2-D mesh computers include Intel Touchstone DELTA [27], Standford DASH [16], MIT Alewife [1], and Goodyear Aerospace MPP [5], etc. 3-D meshes include Blue gene Supercomputer [2], Tera Computer System [3], and MIT J-machine [21]. Alpha 21364 is a 2-D torus and Cray T3D is a 3-D torus [6].
A system is said to be fault tolerant if it can remain functional in the presence of faults [13]. A massively parallel computer system is complex, which increases the possibility of having some faulty components (e.g., processors/nodes or communication links/edges). Hence, fault tolerance of a system plays a crucial role in its actual performance. One of the functionality criteria determines that a system is functional if and only if there is a nonfaulty communication path between each pair of nonfaulty processors [23]. That is, if all nonfaulty nodes and links in the system form a connected graph–which can be used to simulate the entire system to mask the effects of faults–then the system is fault tolerant.
Some graph theoretic concepts are used to develop deterministic measures of the fault tolerance. Conventionally, network fault tolerance is defined as the maximum number of nodes that are tolerant to failure without inducing a possible disconnection in the network [22], a number which is smaller than the connectivity of the network by one. Accordingly, the network fault tolerance of a regular graph topology with degree is at most . For a 2-D mesh, the network fault tolerance is only 1, because the failure of the two neighboring nodes of a nonfaulty corner node would disconnect the corner node from any other nonfaulty node (if any) of the system. As a matter of fact, this traditional measure of fault tolerance is quite conservative. Consequently, much effort has been devoted to introduce more realistic measures. Many measures, either deterministic or probabilistic, have been proposed over the years. Esfahanian [13] introduced the concept of forbidden faulty sets, in which components cannot be faulty at the same time. Esfahanian’s results showed that an -cube can tolerate up to node failures and remain connected provided that, for each node in the network, all the nodes which are directly connected to do not fail at the same time. Latifi et al. [15] proposed the concept of k-safeness, based on the idea of forbidden faulty sets, and showed that a -safe -cube can tolerate up to faulty nodes. All of these results demonstrate that the traditional fault tolerance measure tends to underestimate the fault tolerance of large networks. In fact, although many deterministic fault-tolerant routing algorithms for meshes and tori have been proposed [14], [7], [28], [12], little research has focused on deterministic fault tolerance measures of meshes or tori.
Probabilistic measures consider a random graph model to characterize a system. These measures evaluate the probability of the system being functional by assuming that nodes or links can fail independently with known probabilities. Probabilistic measures of fault tolerance can often model the real world better than deterministic ones because the real world is too complex to be studied in a deterministic way. Several studies have been conducted on probabilistic measures. Najjar and Gaudiot [20] introduced network resilience, a probabilistic measure of network fault tolerance expressed as the probability of a disconnection. However, the problem of computing the connection/disconnection probability of a general network has proved to be NP-complete [11], [24]. In contrast, computing upper and lower bounds on the connection/disconnection probability are more efficient. For example, [26], [9], [8] studied the lower bounds on the reliability of hypercubes, while Angel et al. [4], Chen et al. [10], and Liang et al. [17] investigated upper bounds and lower bounds on the connection probability of meshes.
While the lower bound on the connection probability seems to be of more interest, the upper bound is also of great importance. First of all, to achieve a desired network connection probability, the individual node failure probability can absolutely not be greater than some upper bound. In fact, the gap between the known lower and upper bounds is very large for meshes with large sizes. Nevertheless, the upper bound can still provide a reference to the corresponding lower bound. Although simulations demonstrate a good estimation to the connection probability, the theoretical analysis demonstrates accurate results.
Chen et al. [10] gave upper bounds on the connection probability for 2-D meshes based on rhombuses. A rhombus of a 2-D mesh is defined as a connected subgraph consisting of five nodes — one node located at the center and each of the other four nodes located on one direction. However, there are two important factors causing the weakness of upper bounds determined in this fashion. First, Chen et al. [10] derived the upper bounds by distinguishing disjoint rhombuses. A rhombus is of too small a size to be a good tool in studying the connection probability for meshes. Second, because the degree of a node on the boundary of a grid-based mesh is smaller than 4, such a node has a greater probability to be an isolated node than those nodes of degree 4 in the middle of the mesh. This factor is not considered in [10].
Motivated by the aforementioned deficiencies, in this paper, we study the upper bounds on the connection probability for 2-D meshes and 2-D tori, and further obtain tighter upper bounds. To achieve better upper bounds, we deal with nodes on the boundary and in the middle of a mesh based on combinatorial tools such as The Principle of Inclusion and Exclusion, and divide the meshes into small submeshes instead of rhombuses. Simulation results show that the upper bounds we have derived are much closer to the possible exact connection probability than previously determined bounds.
The main contribution of this paper is giving combinatorial method tools to obtain tighter upper bound on the connection probability without too complicated computation, which is much more better than that in [10]. For most meshes of large sizes studied in [10], the improvements are large. The methods and tools are much more important than the results themselves. If a new network was designed, people often want to know the connection probability to tell if this network is a good one. Although such a newly designed network may be much complicated than meshes, the tools and methods used to study the bounds on the connection probabilities for meshes may be used to study such a network. The methods and tools may be used to study the reliabilities for more objects.
To simplify our discussion, here we only consider node faults rather than link faults.
The rest of this paper is organized as follows. Section 2 overviews related work. Section 3 introduces the notations and the methods used in this paper. In Section 4, we give the main results of this paper, including some novel combinatorial approaches and new upper bounds on the connection probability for 2-D meshes. The extensive simulation results are presented in Section 5. Section 6 concludes the paper.
Section snippets
Related work
In this section, we survey related results and methods on the connection probability for meshes and other networks.
Najjar and Gaudiot [20] addressed the issue of network fault tolerance by looking at the probability of disconnection in a family of regular graph network topologies. A k-cluster in a regular network is defined to be a connected branch of nodes in . A disconnection is caused by a -cluster if all nodes in are nonfaulty but all other nodes adjacent to are faulty. In [20],
Notations and definitions
A mesh or a torus considered in this paper has nodes, with nodes along the dimension and nodes along the dimension. Each node is identified by a coordinate pair , where , are integers such that and . Such a mesh or a torus is denoted as or , respectively.
Definition 3.1 In , two nodes and are neighbors if and only if .[10]
In , nodes at the four corners are of degree 2, other nodes on the boundary are of degree 3, and the rest of the
Upper bounds on the connection probability for meshes and tori
One of the important ideas used in this paper is to analyze the connection probability for a mesh by dividing the mesh into some small blocks, which follows similar ideas in [9], [8], [10].
The following discussion is based on the analysis of a single block. Therefore, when analyzing a specific submesh , each node is located by a local coordinate pair , where and .
Simulation results and discussions
Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results [25]. Since each node in a 2-D mesh fails independently with probability , in this section Monte Carlo methods are used to estimate the connection probability for mesh networks, and the simulation results can be used to evaluate those upper bounds obtained in Section 4.
A simulator was designed to generate meshes with node faulty probability , and then judge
Conclusions and future work
Meshes and tori are among the most important and popular interconnection topologies to model the architectures of multicomputer systems. To understand the performance of these systems, fault tolerance should be taken into account. However, the fault tolerance analysis of mesh networks is a complicated job, especially given large scale networks. Fortunately, dividing a mesh into several submeshes and analyzing the connection probability for these submeshes significantly reduces the complexity.
Acknowledgments
The authors thank the reviewers for their instructive comments. This work is partially supported by the NSF of China (61064002), by Guangxi Natural Science Foundation (2011GXNSFA018142, 0991027, 0991061), by the Scientific Research Foundation of GuangXi University (X081059), the Sichuan Youth Science & Technology Foundation (2010JQ0032), and the Chengdu University School Foundation (2010XJZ27).
Meilian Liang received her BSc degree and MSc degree in computer science from Guangxi University in 2002 and 2005, respectively. She is currently a lecturer of Information Science at Guangxi University, Nanning. Her research interests include data mining and algorithm design.
References (28)
- et al.
Probabilistic analysis on mesh network fault tolerance
Journal of Parallel and Distributed Computing
(2007) - et al.
Fault-tolerant wormhole routing in mesh with overlapped solid fault regions
Parallel Computing
(1997) - et al.
The mit alewife machine: architecture and performance
ISCA
(1995) - et al.
Blue gene: a vision for protein science using a petaflop supercomputer
IBM Systems Journal
(2001) - et al.
The Tera Computer System
ICS
(1990) - et al.
Routing complexity of faulty networks
Random Structures and Algorithm
(2008) Design of a massively parallel processor
IEEE Transactions on Computers
(1980)- et al.
A fault-tolerant routing scheme for meshes with nonconvex faults
IEEE Transactions on Parallel and Distributed Systems
(2001) - et al.
Hypercube network fault tolerance: a probabilistic approach
Proceedings of the International Conference on Parallel Processing (ICPP’02)
(2002)
Locally subcube-connected hypercube networks: theoretical analysis and experimental results
IEEE Transactions on Computers
The Combinatorics of Network Reliability
Practical deadlock-free fault-tolerant routing in meshes based on the planar network fault model
IEEE Transactions on Computers
Generalized measures of fault tolerance with application to -cube networks
IEEE Transactions on Computers
Cited by (9)
The reliability analysis of k-ary n-cube networks
2020, Theoretical Computer ScienceOn the reliability of alternating group graph-based networks
2018, Theoretical Computer ScienceEstimating the subsystem reliability of bubblesort networks
2017, Theoretical Computer ScienceCitation Excerpt :Zhu et al. [18] analyzed the reliability of the folded hypercube; Chen et al. [5] developed a technique for deriving lower bounds on the connectivity probability for both 2-D and 3-D meshes. Later, Wang et al. [15] investigated fault tolerance analysis of mesh networks with uniform versus nonuniform node failure distribution, and Liang et al. [12] studied the connection probability for 2-D meshes and tori. The remainder of this paper is structured as follows.
Probabilistic Reliability via Subsystem Structures of Arrangement Graph Networks
2024, IEEE Transactions on ReliabilityThe reliability analysis based on subsystems of (n,k)-star graph
2016, IEEE Transactions on Reliability
Meilian Liang received her BSc degree and MSc degree in computer science from Guangxi University in 2002 and 2005, respectively. She is currently a lecturer of Information Science at Guangxi University, Nanning. Her research interests include data mining and algorithm design.
Xiaodong Xu received his BSc degree from Harbin Nornal University in Mathematics in 1995, and MSc degree in applied mathematics from National Defense Science and Technique University in 2002. He is currently a research professor of Information Science at Guangxi Academy of Science, Nanning. His research interests include combinatorics, Information theory, and number theory.
Jiarong Liang received his Doctorate in Engineering from South China University of Technology in 1998. He is currently a professor of Computer Science at Guangxi University, Nanning. His research interests include control theory and computer science.
Zehui Shao received his BSc degree and MSc degree in Engineering from Chengdu University of Technique in 2002 and 2005, respectively. He received his Doctorate in Science from Huazhong University of Science and Technology in 2009. He is currently a lecturer of Computer Science in Chengdu University, Chengdu. His research interests include combinatorics and algorithm design.