Elsevier

Integration

Volume 40, Issue 4, July 2007, Pages 525-535
Integration

Characterization, testing and reconfiguration of faults in mesh networks

https://doi.org/10.1016/j.vlsi.2006.11.002Get rights and content

Abstract

Achieving fault-tolerance through incorporation of redundancy and reconfiguration is quite common. The distribution of faults can have several impacts on the effectiveness of any reconfiguration scheme; in fact, patterns of faults occurring at strategic locations may render an entire VLSI system unusable regardless of its component redundancy and its reconfiguration capabilities. Such fault patterns are called catastrophic fault patterns (CFPs). In this paper, we characterize catastrophic fault patterns in mesh networks when the links are bidirectional or unidirectional. We determine the minimum number of faults required for a fault pattern to be catastrophic. We consider the problem of testing whether a fault pattern is catastrophic. When a fault pattern is not catastrophic we study the problem of finding optimal reconfiguration strategies, where optimality is with respect to either the number of processing elements in the reconfigured network (the reconfiguration is optimal if such a number is maximized) or the number of bypass links to activate in order to reconfigure the array (the reconfiguration is optimal if such a number is minimized). The problem of finding a reconfiguration strategy that is optimal with respect to the size of the reconfigured network is NP-complete, when the links are bidirectional, while it can be solved in polynomial time, when the links are unidirectional. Considering optimality with respect to the number of bypass links to activate, we provide algorithms which efficiently find an optimal reconfiguration.

Introduction

VLSI systems have been widely used as parallel models of computation [1], [2]. These systems consist of a large number of identical and elementary processing elements locally connected in a regular fashion. Each element receives data from its neighbors, computes and sends the results again to its neighbors. Few particular elements located at the extremes of the systems (these extremes depend on the particular system) are allowed to communicate with the external world. In this paper, we will focus on VLSI mesh networks.

Recently, a number of two-dimensional mesh-based networks have been proposed, owing to their advantages of scalability, modularity, expendability, and degree boundedness. Commercial multiprocessor products based on the mesh have been announced from Ametek and Intel Scientific Computers. Mesh-based designs have been used in the ILLIAC IV computer, Intel Paragon, Cray T3D, and the Goodyear MPP massively parallel computer.

Fault tolerant techniques are very important to VLSI systems. Here we assume that only processors can fail. The likelihood of failure increases with the increase in the number of processing elements. Without the provision of fault-tolerance capabilities, the yield of VLSI chips for such an architecture would be so poor that it would be unacceptable. Thus, fault-tolerant mechanisms must be provided in order to avoid faulty processing elements taking part in the computation. A widely used technique to achieve fault tolerance consists of providing redundancy to the desired architecture [3], [4]. In VLSI systems the redundancy consists of additional processing elements, called spares, and additional connections, called bypass links. Bypass links are links that connect each processor with another processor at a fixed distance greater than 1. The redundant processing elements are used to replace any faulty processing element; the redundant links are used to bypass the faulty processing elements and reach others. The effectiveness of using redundancy to increase fault tolerance clearly depends on both the amount of redundancy and the reconfiguration capability of the system. It does, however, depend also on the distribution of faults in the system. There are sets of faulty processing elements for which no reconfiguration strategy is possible. Such sets are called catastrophic fault patterns (CFPs). From a network perspective, such fault patterns can cause network disconnection.

If we have to reconfigure a system when a fault pattern occurs, it is necessary to know if the fault pattern is catastrophic or not. Therefore, it is important to study the properties of catastrophic fault patterns. Till today, the characterization of CFPs is known for linear arrays with the following results. The characterization has been used to obtain efficient testing algorithms both for unidirectional and bidirectional cases [5] with order of magnitude improvement over [6], [7]. Efficient techniques has been obtained for constructing CFPs [8]. Using random walk as a tool, a closed form solution for the number of CFPs for uni- and bidirectional links has been provided in [9], an improvement over [10]. Recently, Maity et al. [11] characterize catastrophic fault patterns for two-dimensional arrays.

The main contribution of this paper is complete characterization of CFPs for mesh networks. We determine the minimum number of faults required for a fault pattern to be catastrophic. From a practical viewpoint, above result allow to prove some answers to the question about the guaranteed level of fault tolerance of a design. Guaranteed fault tolerance indicates positive answer to the question as: will the system withstand up to k faults always regardless of how and where they occur? We analyze catastrophic sets having the minimal number of faults. The paper also describes algorithm for testing whether a set of faults is catastrophic or not. In addition, when a fault pattern is not catastrophic, we consider the problem of finding optimal reconfiguration strategies for both unidirectional and bidirectional networks. Where the optimality is with respect to the number of processors in the reconfigured network or with respect to the number of bypass links. The reconfiguration is optimal if number of processing elements is maximized in the former case, while the number of bypass links are to be minimum in the latter case.

The results for arrays apply to a large variety of commercially available array processors such as geometric arithmetic parallel processor (GAPP) [12] of NCR, distributed array processor (DAP) [13] of ICL, England, NASA's massively parallel processor (MPP) [14], and connection machines [15] of Thinking Machines Corporation. The results presented in this paper also apply to a large number of processor arrays which include the systolic arrays [1], reconfigurable array of processors ELSA (European Large SIMD Array) [16], and a variety of special-purpose VLSI and experimental WSI devices for applications such as signal processing, image processing, and numerical computations. Furthermore, the results are equally applicable to the memory chips. Memory chips are the most obvious candidates since the underlying architecture is highly regular and has a large number of identical cells.

Section snippets

Preliminaries

In this paper, we will focus on mesh networks. The basic components of such a network are the processing elements(PEs) indicated by circles in Fig. 1. There are two kinds of links: regular and bypass. Regular links connect neighboring (either horizontal or vertical) PEs while bypass links connect non-neighbors. The bypass links are used strictly for reconfiguration purposes when a fault is detected, otherwise they are considered to be the redundant links. We now introduce the following

Characterization of catastrophic fault patterns

In this section, we will characterize the catastrophic fault patterns for mesh networks and prove that the minimum number of faults in a catastrophic fault pattern is a function of N1, N2, the length of the longest horizontal bypass link and the length of the longest vertical bypass link.

Theorem 3.1

Suppose vl divides N1 and gk divides N2, then F is catastrophic with respect to M implies that the cardinality of F, |F|max{N1vl,N2gk}vlgk.

Proof

Suppose to the contrary that |F|<max{N1vl,N2gk}vlgk. Then partition

Necessary and sufficient conditions for a pattern to be catastrophic

In this section, we consider the particular case N1vl=N2gk. Suppose we are given a fault pattern F with N1gk faults in a mesh network with link redundancy G=(g1,g2,,gk|v1,v2,,vl). We now consider the Tower-Bridge representation of M used in the proof of Theorem 3.1. We label the vl rows of a sub-block or floor with 0,1,,vl-1 and the gk columns with 0,1,,gk-1. For example consider the sub-block or floor with PEs (4,1), (4,2), (4,3) (3,1), (3,2), (3,3) in Fig. 5. We label the two rows [(4,1),

Bidirectional mesh

Let M be a bidirectional mesh network of N1N2 processors with link redundancy G=(1,g2,,gk|1,v2,,vl), and let F be a fault pattern with m faults. A simple way to test if F is catastrophic for M is to consider a graph whose set of vertices is given by the chunks of working processors. More formally, we construct a graph H=(V,E) as follows: The set V of vertices is {C0,C1,,Cn}, where Ci's represent chunks of F and (Ci,Cj)E if and only if there are two processors, pxyCi and pxyCj such that y

Maximum escape paths

In this section we consider the problem of finding maximum escape paths. We prove that the problem is NP-complete for a bidirectional mesh network, while for unidirectional mesh we provide an algorithm that finds a maximum escape path in O(N1+N2+w|G|) time.

Minimum escape paths

In this section we consider the problem of finding minimum escape paths. We prove that the problem can be solved in O(w|G|+(N1+N2+w)log(N1+N2+w)) time if the bypass links are bidirectional, and in O(N1+N2+w|G|) time if the bypass links are unidirectional. First we consider the case of bidirectional links.

Given a fault pattern F in a bidirectional mesh network, we construct the auxiliary graph G0=(V,E) and assign weight 1 to all the edges representing the bypass links (horizontal or vertical) in

Conclusions and open problems

In this paper we have completely characterized catastrophic fault patterns for mesh networks. We proved that, F is catastrophic with respect to M implies that the cardinality of F, |F|max{N1vl,N2gk}vlgk when vl divides N1 and gk divides N2. Necessary and sufficient conditions are given for a fault pattern to be catastrophic with respect to link redundancy G=(g1,g2,,gk|v1,v2,,vl). Given a mesh network with a link redundancy G, an important problem is to count the number of catastrophic fault

Soumen Maity is currently an Assistant Professor in the Department of Mathematics of Indian Institute of Technology, Guwahati, India. He received his Ph.D. degree in Combinatorics from Indian Statistical Institute, Kolkata, India, in 2002. His research interests include Combinatorial methods in Statistics and VLSI Designs, and Cryptography.

References (18)

There are more references available in the full text version of this article.

Cited by (1)

Soumen Maity is currently an Assistant Professor in the Department of Mathematics of Indian Institute of Technology, Guwahati, India. He received his Ph.D. degree in Combinatorics from Indian Statistical Institute, Kolkata, India, in 2002. His research interests include Combinatorial methods in Statistics and VLSI Designs, and Cryptography.

Amiya Nayak received his B.Math. degree in Computer Science and Combinatorics & Optimization from University of Waterloo in 1981, and Ph.D. in Systems and Computer Engineering from Carleton University in 1991. He has over 17 years of industrial experience, working at CMC Electronics (formerly known as Canadian Marconi Company), Defence Research Establishment Ottawa (DREO), EER Systems and Nortel Networks, in software engineering, avionics and navigation systems, simulation and system level performance analysis. He has been an Adjunct Research Professor in the School of Computer Science at Carleton University since 1994. He had been the Book Review and Canadian Editor of VLSI Design from 1996 till 2002. He is in the Editorial Board of International Journal of Parallel, Emergent and Distributed Systems, International Journal of Computers, Information Technology & Engineering, and the Associate Editor of International Journal of Computing and Information Science. Currently, he is a Full Professor at the School of Information Technology and Engineering (SITE) at the University of Ottawa. His research interests are in the area of fault tolerance, distributed systems/algorithms, and mobile ad hoc networks with over 100 publications in refereed journals and conference proceedings.

Sundarkumar Ramsundar is currently pursuing his B. Tech Degree in Computer Science and Engineering at Indian Institute of Technology, Guwahati, India. His research interests include Fault -Tolerant Computing, Robust System Design for Emerging Nanotechnologies, and Design and Analysis of Algorithms.

View full text