Characterization, testing and reconfiguration of faults in mesh networks
Introduction
VLSI systems have been widely used as parallel models of computation [1], [2]. These systems consist of a large number of identical and elementary processing elements locally connected in a regular fashion. Each element receives data from its neighbors, computes and sends the results again to its neighbors. Few particular elements located at the extremes of the systems (these extremes depend on the particular system) are allowed to communicate with the external world. In this paper, we will focus on VLSI mesh networks.
Recently, a number of two-dimensional mesh-based networks have been proposed, owing to their advantages of scalability, modularity, expendability, and degree boundedness. Commercial multiprocessor products based on the mesh have been announced from Ametek and Intel Scientific Computers. Mesh-based designs have been used in the ILLIAC IV computer, Intel Paragon, Cray T3D, and the Goodyear MPP massively parallel computer.
Fault tolerant techniques are very important to VLSI systems. Here we assume that only processors can fail. The likelihood of failure increases with the increase in the number of processing elements. Without the provision of fault-tolerance capabilities, the yield of VLSI chips for such an architecture would be so poor that it would be unacceptable. Thus, fault-tolerant mechanisms must be provided in order to avoid faulty processing elements taking part in the computation. A widely used technique to achieve fault tolerance consists of providing redundancy to the desired architecture [3], [4]. In VLSI systems the redundancy consists of additional processing elements, called spares, and additional connections, called bypass links. Bypass links are links that connect each processor with another processor at a fixed distance greater than 1. The redundant processing elements are used to replace any faulty processing element; the redundant links are used to bypass the faulty processing elements and reach others. The effectiveness of using redundancy to increase fault tolerance clearly depends on both the amount of redundancy and the reconfiguration capability of the system. It does, however, depend also on the distribution of faults in the system. There are sets of faulty processing elements for which no reconfiguration strategy is possible. Such sets are called catastrophic fault patterns (CFPs). From a network perspective, such fault patterns can cause network disconnection.
If we have to reconfigure a system when a fault pattern occurs, it is necessary to know if the fault pattern is catastrophic or not. Therefore, it is important to study the properties of catastrophic fault patterns. Till today, the characterization of CFPs is known for linear arrays with the following results. The characterization has been used to obtain efficient testing algorithms both for unidirectional and bidirectional cases [5] with order of magnitude improvement over [6], [7]. Efficient techniques has been obtained for constructing CFPs [8]. Using random walk as a tool, a closed form solution for the number of CFPs for uni- and bidirectional links has been provided in [9], an improvement over [10]. Recently, Maity et al. [11] characterize catastrophic fault patterns for two-dimensional arrays.
The main contribution of this paper is complete characterization of CFPs for mesh networks. We determine the minimum number of faults required for a fault pattern to be catastrophic. From a practical viewpoint, above result allow to prove some answers to the question about the guaranteed level of fault tolerance of a design. Guaranteed fault tolerance indicates positive answer to the question as: will the system withstand up to k faults always regardless of how and where they occur? We analyze catastrophic sets having the minimal number of faults. The paper also describes algorithm for testing whether a set of faults is catastrophic or not. In addition, when a fault pattern is not catastrophic, we consider the problem of finding optimal reconfiguration strategies for both unidirectional and bidirectional networks. Where the optimality is with respect to the number of processors in the reconfigured network or with respect to the number of bypass links. The reconfiguration is optimal if number of processing elements is maximized in the former case, while the number of bypass links are to be minimum in the latter case.
The results for arrays apply to a large variety of commercially available array processors such as geometric arithmetic parallel processor (GAPP) [12] of NCR, distributed array processor (DAP) [13] of ICL, England, NASA's massively parallel processor (MPP) [14], and connection machines [15] of Thinking Machines Corporation. The results presented in this paper also apply to a large number of processor arrays which include the systolic arrays [1], reconfigurable array of processors ELSA (European Large SIMD Array) [16], and a variety of special-purpose VLSI and experimental WSI devices for applications such as signal processing, image processing, and numerical computations. Furthermore, the results are equally applicable to the memory chips. Memory chips are the most obvious candidates since the underlying architecture is highly regular and has a large number of identical cells.
Section snippets
Preliminaries
In this paper, we will focus on mesh networks. The basic components of such a network are the processing elements(PEs) indicated by circles in Fig. 1. There are two kinds of links: regular and bypass. Regular links connect neighboring (either horizontal or vertical) PEs while bypass links connect non-neighbors. The bypass links are used strictly for reconfiguration purposes when a fault is detected, otherwise they are considered to be the redundant links. We now introduce the following
Characterization of catastrophic fault patterns
In this section, we will characterize the catastrophic fault patterns for mesh networks and prove that the minimum number of faults in a catastrophic fault pattern is a function of , , the length of the longest horizontal bypass link and the length of the longest vertical bypass link. Theorem 3.1 Suppose divides and divides , then F is catastrophic with respect to implies that the cardinality of F, . Proof Suppose to the contrary that . Then partition
Necessary and sufficient conditions for a pattern to be catastrophic
In this section, we consider the particular case . Suppose we are given a fault pattern F with faults in a mesh network with link redundancy . We now consider the Tower-Bridge representation of used in the proof of Theorem 3.1. We label the rows of a sub-block or floor with and the columns with . For example consider the sub-block or floor with PEs , , , , in Fig. 5. We label the two rows [,
Bidirectional mesh
Let be a bidirectional mesh network of processors with link redundancy , and let F be a fault pattern with m faults. A simple way to test if F is catastrophic for is to consider a graph whose set of vertices is given by the chunks of working processors. More formally, we construct a graph as follows: The set V of vertices is , where 's represent chunks of F and if and only if there are two processors, and such that
Maximum escape paths
In this section we consider the problem of finding maximum escape paths. We prove that the problem is NP-complete for a bidirectional mesh network, while for unidirectional mesh we provide an algorithm that finds a maximum escape path in time.
Minimum escape paths
In this section we consider the problem of finding minimum escape paths. We prove that the problem can be solved in time if the bypass links are bidirectional, and in time if the bypass links are unidirectional. First we consider the case of bidirectional links.
Given a fault pattern F in a bidirectional mesh network, we construct the auxiliary graph and assign weight 1 to all the edges representing the bypass links (horizontal or vertical) in
Conclusions and open problems
In this paper we have completely characterized catastrophic fault patterns for mesh networks. We proved that, F is catastrophic with respect to implies that the cardinality of F, when divides and divides . Necessary and sufficient conditions are given for a fault pattern to be catastrophic with respect to link redundancy . Given a mesh network with a link redundancy G, an important problem is to count the number of catastrophic fault
Soumen Maity is currently an Assistant Professor in the Department of Mathematics of Indian Institute of Technology, Guwahati, India. He received his Ph.D. degree in Combinatorics from Indian Statistical Institute, Kolkata, India, in 2002. His research interests include Combinatorial methods in Statistics and VLSI Designs, and Cryptography.
References (18)
- et al.
A fault tolerant massively parallel processing architecture
J. Parallel Distributed Comput.
(1987) - et al.
On testing of catastrophic faults in reconfigurable arrays with arbitrary link redundancy
Integration VLSI J.
(1996) - et al.
Catastrophic faults in reconfigurable systolic linear arrays
Discrete Appl. Math.
(1997) - et al.
Efficient construction of catastrophic patterns for VLSI reconfigurable arrays
Integration VLSI J.
(1993) - et al.
Enumerating catastrophic fault patterns in VLSI arrays with both uni- and bidirectional links
Integration VLSI J.
(2001) - et al.
Counting the number of fault patterns in redundant VlSI arrays
Inf. Process. Lett.
(1994) - et al.
On characterization of catastrophic faults in two-dimensional VLSI arrays
Intergration VLSI J.
(2004) Why systolic architecture?
IEEE Comput.
(1982)- et al.
Wafer-scale integration of systolic arrays
IEEE Trans. Comput.
(1985)
Cited by (1)
Triggering cascades on strongly connected directed graphs
2015, Theoretical Computer Science
Soumen Maity is currently an Assistant Professor in the Department of Mathematics of Indian Institute of Technology, Guwahati, India. He received his Ph.D. degree in Combinatorics from Indian Statistical Institute, Kolkata, India, in 2002. His research interests include Combinatorial methods in Statistics and VLSI Designs, and Cryptography.
Amiya Nayak received his B.Math. degree in Computer Science and Combinatorics & Optimization from University of Waterloo in 1981, and Ph.D. in Systems and Computer Engineering from Carleton University in 1991. He has over 17 years of industrial experience, working at CMC Electronics (formerly known as Canadian Marconi Company), Defence Research Establishment Ottawa (DREO), EER Systems and Nortel Networks, in software engineering, avionics and navigation systems, simulation and system level performance analysis. He has been an Adjunct Research Professor in the School of Computer Science at Carleton University since 1994. He had been the Book Review and Canadian Editor of VLSI Design from 1996 till 2002. He is in the Editorial Board of International Journal of Parallel, Emergent and Distributed Systems, International Journal of Computers, Information Technology & Engineering, and the Associate Editor of International Journal of Computing and Information Science. Currently, he is a Full Professor at the School of Information Technology and Engineering (SITE) at the University of Ottawa. His research interests are in the area of fault tolerance, distributed systems/algorithms, and mobile ad hoc networks with over 100 publications in refereed journals and conference proceedings.
Sundarkumar Ramsundar is currently pursuing his B. Tech Degree in Computer Science and Engineering at Indian Institute of Technology, Guwahati, India. His research interests include Fault -Tolerant Computing, Robust System Design for Emerging Nanotechnologies, and Design and Analysis of Algorithms.