An Architecture for High Availability Multi-user Systems

https://doi.org/10.1016/S0140-3664(97)00007-8Get rights and content

Abstract

This paper describes fault tolerance features of a multiprocessor system called SPAX (Scalable Parallel Architecture based on X-bar network). It aims at a cost-effective reliable multiprocessor system for both scientific and business applications. The system can be composed of up to sixteen clusters. Each cluster consists of eight nodes which can be any combination of processing nodes, input/output nodes and communication nodes. The system is designed to eliminate potential single-points of failures such as loss of a processor, loss of a network, or loss of a disk drive. Xcent-Net, which is a duplicated hierarchical crossbar interconnection network built into the system, supports dual paths to every node with high bandwidth and with low latency. Each node is designed to support multi-level fault tolerance enabling a user to choose the level of fault tolerance with a possible resource or performance penalty. The system has been implemented at ETRI.

Introduction

High-performance parallel computing machines have started taking the role of servers in a distributed computing environment. In the modern parallel computing industry, clustered shared memory multiprocessing have become more popular than massively parallel processing [1]. This is due to the merits which the clustered shared memory processing may have with the support of reliable high-speed communication. One of the major issues to be faced in the course of designing such a system is how to provide fault tolerance features with an acceptable increase in cost and with minimal performance degradation.

Since 1994, a multiprocessor system called SPAX (Scalable Parallel Architecture based on X-bar network) has been developed at ETRI. The system aims at a cost-effective reliable multiprocessor system for both scientific and business applications, especially for on-line transaction processing (OLTP) applications. In such applications the reliability requirement is a very important factor which may affect the entire system design 2, 3, 4. SPAX can be composed of up to sixteen clusters. Each cluster consists of eight nodes which can be any combination of processing nodes (PNs), input/output nodes (IONs) and communication connection nodes (CCNs). In a node, there are four Intel Pentium™ processors and a locally shared memory. A crossbar network interconnects PNs, IONs and CCNs within a cluster. Clusters are interconnected by the crossbar network as well. In a multiprocessor system which consists of multiple processors connected by an interconnection network, the choice of fault detection mechanism is critical to the high fault coverage required for reliable diagnosis and recovery procedures 5, 6, 7. Since the memory in a processing node of the architecture is locally shared, early detection and minimizing error propagation are major concerns in the design for fault tolerance. The interconnection network needs to be fault-tolerant. That is, the nodes remain connected in the presence of a fault in a network component. The interconnection network affects not only the performance but also the reliability of the entire multiprocessor system. Fig. 1 shows the structure of SPAX. The cluster in the figure is a typical example.

In this paper practical hardware approaches to fault tolerance adopted in the design of SPAX are presented. It is designed to eliminate potential single-points of failures such as loss of a processor, loss of a network or loss of a disk drive. It allows the user to configure the system whose critical components are duplicated within a cluster. The fault tolerance design concept of the system architecture is presented in Section 2. Several interconnection structures are known as good candidates for the reliable interconnection network. One example is the fat tree network which provides inherent redundant paths between nodes [8]. In Section 3, we analyze the reliability of various tree-type interconnection networks based on crossbar switches under the physical packaging constraint and present a duplicated hierarchical crossbar interconnection network called Xcent-Net. Another important component affecting the system reliability is the computing node. In order to support multi-level fault tolerance according to the user's requirements, three different levels of fault tolerance capabilities are provided in each node. Design and implementation of the node is presented in Section 4. Other fault tolerance features such as diagnosis and redundant paths to disks are discussed in Section 5. Finally, in Section 6we present our conclusions.

Section snippets

Fault tolerance design

SPAX is designed to be a flexible and cost-effective product which can be used for multi-user systems requiring high availability. It is a scalable parallel processing machine based on UNIX. Objective features of the system include:

  • 1.

    High-speed processing-the transaction processing speed is 10 000 tpmC (transactions per minute by Transaction Processing Council benchmark C).

  • 2.

    Flexible modular structure-processing power, memory and peripherals are added without affecting applications software. A

Duplicated interconnection network

An efficient message passing interface between nodes is essential to the system performance. The high latency for memory references will cause drastic deterioration in the performance of the multiprocessor system. A failure of the message passing interface may lead to system partitioning, which makes all or some part of the system inaccessible to users. In this section, we present a reliable low-latency interconnection network based on crossbar and discuss trade-off between reliability and

Reliability-enhanced processing node

In order to support multi-level fault tolerance according to the user's requirements, the following three different levels of fault tolerance capabilities are provided.

  • 1.

    Fault avoidance—coding is one of the most important techniques for supporting fault tolerance in this level. Basic mechanisms such as ECC, parity and instruction retry are extensively adopted. The system design includes error detection in some areas but not in others. As a result, error recovery may not be possible in all cases.

  • 2.

Other fault tolerance features

All disks are equipped with dual ports. To increase the reliability of data in a disk, redundant array of inexpensive disks (RAID) with dual controllers are used. The system maintains two independent access paths to each disk and supports disk mirroring as shown in Fig. 8. In case of a disk failure, The data can be retrieved by accessing the mirrored disk.

Diagnosis subsystem block, which is to locate erroneous components before booting-up and during operations, can be divided into the following

Conclusions

A cost-effective solution for high-performance multi-user systems requiring high-availability was presented. We described hardware approaches to fault tolerance in major components of a multiprocessor system called SPAX.

The system is designed to eliminate potential single-point of failures and to support multi-level fault tolerance enabling a user to choose the level of fault tolerance. Based on the design concept presented in this paper, Xcent-Net which is a duplicated hierarchical crossbar

Acknowledgements

This work was performed as a part of 'Highly Parallel Computer Development Project' funded by MIC, Korea. Authors would like to thank a number of researchers involved in this project, especially, members of Processor Section for the work of Xcent-Net and MPE development.

References (16)

  • L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, pp. 635–693,...
  • J. Gray and D.P. Siewiorek, High-Availability Computer Systems, WEE Computer, pp. 39–48,...
  • J. Gray, Why Do Computers Stop and What Can Be Done About It?, 3rd Symp. on Reliability in Distributed S/W and DB, pp....
  • J.-C. Laprie et al., Definition and Analysis of Hardware and Software Fault-Tolerant Architectures, IEEE Computer...
  • P.A. Bernstein, Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing, IEEE Computer...
  • V. Nelson, Reliability Modeling and General Redundancy Techniques, Chapter 2 of IEEE Computer Society Tutorial Notes,...
  • D.P. Siewiorek, Fault Tolerance in Commercial Computers, IEEE Computer (1990)...
  • C.E Leiserson, Fat-trees: universal networks for hardware-efficient supercomputing, IEEE Trans. Comput., 34 (1985)...
There are more references available in the full text version of this article.

Cited by (0)

View full text