Design of a scalable multiprocessor architecture and its simulation

doi:10.1016/S0164-1212(01)00034-6

Journal of Systems and Software

Volume 58, Issue 2, 1 September 2001, Pages 135-152

https://doi.org/10.1016/S0164-1212(01)00034-6 Get rights and content

Abstract

Performance enhancement and system scalability are two of the most important issues in the design of multiprocessor systems. A scalable cluster-based multiprocessor architecture and its simulation environment called SEECMA are proposed. Several new issues in our architecture, including scalable cache coherence protocols, relaxed memory consistency models, memory optimization techniques and several types of processors are considered. It has been developed to meet current trends in clustering architecture design. Additionally, the SEECMA environment is presented as a helpful investigation tool for both education and research. In addition to many simulation options, it is provided with a user-friendly graphic interface. SEECMA can automatically collect data from several simulation runs and display the results for comparison. So far, we have evaluated several vital issues of cluster-based multiprocessors on SEECMA including effective prefetching and replacement policies, and optimization of migratory sharing using both hardware and software mechanisms. On average, these enhance system performance by up to 8%, 9% and 7%, respectively. Our cluster-based multiprocessor architecture also scales more readily than the current general, or cluster-based, multiprocessor environments.

Introduction

The gap between the computing power of microprocessors and that of the largest supercomputers is shrinking, while the price/performance advantage of microprocessors is increasing. This clearly points to using microprocessors as the computation engines in large multiprocessor systems. The challenge lies in building a machine that can scale up its performance while maintaining the initial price/performance advantage of the individual processors. Scalability allows a parallel architecture to leverage commodity microprocessors and small-scale multiprocessors to build larger scale machines. These larger machines offer substantially higher performance, which provides the impetus for programmers to port their sequential applications to parallel architectures instead of waiting for the next higher performance uni-processor. Due to the rapid progress in VLSI and packaging technologies, they have become a driving force in the design and development of highly scalable parallel systems using cluster-based architecture, simultaneously exploiting local communication. Currently there are commercial supercomputers constructed with a cluster-based architecture, such as the CONVEX SPP series (CONVEX, 1994). Meanwhile, system designers must rely upon a convenient and accurate simulation environment to verify their designs or determine the most cost-effective strategies. Our cluster multiprocessor and its simulation environment, called simulation and evaluation environment for cluster-based multiprocessor architecture (SEECMA), are designed to address the above issues. SEECMA is the outcome of follow-up work on simulation and evaluation environment for shared-memory multiprocessor architecture (SEESMA) (Wu et al., 1998c).

Our architecture is designed for high performance and scalability. Our main purpose is to construct effective memory and interconnection architectures. After our evaluation, we found that our architecture has the following desirable attributes.

1.
A multithreaded processor architecture in a cluster-based system is possible, which is more effective when combined with our proposed mechanisms.
2.
Several memory consistency models (Wu, 1998a, Wu and Chen, 1998b) to improve the parallelism of the system were constructed, all resulting in improved system performance.
3.
We propose several new cache coherence protocols and directory structures in the linked-base type of cluster-based architecture, and these also perform better than existing protocols. We also demonstrate that some cache coherence protocols that perform well in non-cluster-based systems perform poorly in cluster-based systems.
4.
Several new prefetching and replacement policies are presented to improve the efficiency of the cache system, and simulation results prove these to be effective.
5.
With the assistance of these effective mechanisms, we find our architecture has high scalability when evaluating performance.

As SEECMA is extended from SEESMA (Wu et al., 1998c), the kernel simulator is the MINT package (Veenstra and Fowler, 1994), which was originally developed at the University of Rochester. SEECMA aims to provide a simulation and evaluation environment for cluster-based multiprocessor systems. It supports the following simulation functions:

1.
two types of processor architectures;
2.
two-level caches with write caches (WCs);
3.
a message-passing-based interconnection network;
4.
four types of memory consistency models;
5.
five types of cache coherence protocols;
6.
three types of cache coherence directory structures;
7.
effective replacement policies; and
8.
effective prefetching schemes.

SEECMA is equipped with versatile simulation options and users may customize the target memory architecture by clicking on the appropriate buttons in a user-friendly X-windows interface.

Users are provided with either a menu or graphic input environment through a Graphic User Interface (GUI). The menu-based interface is similar to conventional windows functions while the graph-based interface is more convenient for beginners. Users move the cursor around various architectural components and click on them to select the required simulation options. Each time a simulation option is located, the graph is updated. On-line help is also supported. Other than the above, SEECMA automatically collects numerical data from several simulation results, and displays the corresponding statistical graphs. So far, both bar charts and line charts are available. Using these graphs, users are able to compare the performance of different architectural parameters.

SEECMA, with its abundant simulation features, serves as a valuable research environment for system designers who are interested in cluster-based multiprocessor architecture. We have used SEECMA on many research issues related to memory subsystems, including cache coherence protocols, memory consistency models, interconnection networks, cache hierarchies, migratory sharing (Su et al., 1996), as well as prefetching and replacement policies (Pean et al., 1998a).

As a result of simulation, we found that our new mechanisms, developed in our cluster-based architecture, performed much better than current methods. We illustrate some of them in the following. Note that all the comparisons, including non-cluster system architecture and DASH (Lenoski and Laudon, 1989) system architecture, are made based on the simulations performed using SEECMA. Our inter-clustered cache prefetching mechanism performs, on average, about 8% better than the original non-inter-clustered cache system for benchmarks in the SPLASH benchmark suite (Singh et al., 1992; Woo et al., 1995). Our effective replacement policy performs, on average, about 9% better than the original replacement policy. We propose a migratory-sharing optimization mechanism; the total system performs 7% better, on average, than the system without our optimization. Other effective mechanisms also perform well, and we describe them in detail in the following sections.

The rest of the paper is organized as follows. Section 2 gives an overview of the overall system architecture. Section 3 describes the main system features and design issue considerations. An overview of the SEECMA simulation and evaluation environment is given in Section 4. Section 5 evaluates the performance of our architecture on SEECMA, and Section 6 gives our conclusions and outlines future work.

Section snippets

Overall system architecture

Our system architecture is a cluster-based distributed shared-memory multiprocessor system. The general architecture of our target machine is shown in Fig. 1. Within our clustering architecture are multiple cluster nodes interconnected by a k-ary n-cube network. Each cluster node contains local shared-memory, an inter-cluster cache, a few processor environments (PEs), and a local bus.

The inner structure of a cluster node is shown in Fig. 1(b), and either MIPS R3000 processors or PMPs (Hirata et

Cluster-based multiprocessor and interconnection network

A cluster-based multiprocessor system has multiple cluster nodes interconnected through a k-ary n-cube network, where each cluster node is assembled by several PEs linked by a local bus. Each cluster node is a small-scale multiprocessor system and multiple clusters form a large-scale system. This clustering architecture benefits from both small- and large-scale architectures, and hence is extremely attractive. The number of cluster nodes and PEs per node is specified by the users. If there is a

Basic Structure of SEECMA

SEECMA is programmed in C and implemented on a SUN workstation running under UNIX System V. To facilitate future extensions, the whole program is developed in a modular structure. The front-end of SEECMA is a memory reference generator supported by MINT, and the back-end is a memory subsystem simulator. As a general tool, SEECMA allows the user to specify and simulate his/her own designs by linking in his/her specific modules. Fig. 9 gives an overview of SEECMA.

The memory reference generator

Preliminary performance evaluations of our architecture on SEECMA

In this section, we give a complete illustration of how to use SEECMA to evaluate several design issues of cluster-based multiprocessor systems, including ICP, effective replacement schemes and migratory sharing with both software and hardware approaches. Some reasonable assumptions about the evaluation environment are summarized in Table 2. The memory page size is 4 Kbytes and is mapped to the local memories in a round-robin fashion.

Some benchmarks were chosen from SPLASH for experimental

Conclusion and future work

System designers usually rely on simulation to verify their conceptual designs and to understand the interaction among the system components. We have constructed a simulation and evaluation environment called SEECMA for research and education, derived from cluster-based multiprocessor systems. It provides versatile simulation functions in an integrated environment with a user-friendly graphical interface. The primary simulation options include:

(1) two CPU types – RISC and PMP architecture;
(2)

Acknowledgements

This research was supported by the National Science Council of the Republic of China under contract number NSC 87-2213-E009-049.

References (30)

F Dahlgren et al.
Using write caches to improve performance of cache coherence protocols in shared-memory multiprocessors
Journal of Parallel and Distributed Computing
(1995)
H Grahn et al.
Implementation and evaluation of update-based cache protocols under relaxed memory consistency models
Future Generation Computer Systems
(1995)
J Boyle et al.
Portable Programs for Parallel Processors
(1987)
CONVEX, 1994. Computer Corporation, CONVEX Exemplar Architecture, 2nd ed. CONVEX Press, Texas,...
D.E Culler et al.
Parallel Computer Architecture: A Hardware/Software Approach
(1999)
W.J Dally
Performance analysis of k-ary n-cube interconnection networks
IEEE Transactions on Computers
(1990)
David, B.G., 1995. Design and analysis of updated-based cache coherence protocols for scalable shared-memory...
Dubois, M., Scheurich, C., Briggs, F., 1986. Memory access buffering in multiprocessors. In: Proceedings of the 13th...
Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A., Hennessy, J., 1990. Memory Consistency and event...
S Gjessing et al.
The SCI cache coherence protocol

Goodman, J.R., 1989. Cache consistency and sequential consistency. Computer Sciences Technical Report #1006, Computer...

A Gupta et al.

Cache invalidation patterns in shared-memory multiprocessors

IEEE Transaction on Computers

(1992)

Hirata, H., Kimura, K., Nagamine, S., Mochizuki, Y., 1992. An elementary processor architecture with simultaneous...

IEEE SCI, 1992. IEEE SCI draft 2.00: SCI Scalable Coherence Interface, Draft Document for the IEEE SCI...

L Lamport

How to make a multiprocessor computer that correctly executes multiprocessor programs

IEEE Transactions on Computers C

(1979)

Cited by (0)

Cheng Chen is a professor in the Department of Computer Science and Information Engineering at National Chiao Tung University, Taiwan, ROC. He received his B.S. degree from the Tatung Institute of Technology, Taiwan, ROC in 1969 and M.S. degree from the National Chiao Tung University, Taiwan, ROC in 1971, both in electrical engineering. Since 1972, he has been on the faculty of National Chiao Tung University, Taiwan, ROC. From 1980 to 1987, he was a visiting scholar at the University of Illinois at Urbana Champaign. During 1987 and 1988, he served as the chairman of the Department of Computer Science and Information Engineering at the National Chiao Tung University. From 1988 to 1989, he was a visiting scholar of the Carnegie Mellon University (CMU). Between 1990 and 1994, he served as the deputy director of the Microelectronics and Information Systems Research Center (MIRC) in National Chiao Tung University. His current research interests include computer architecture, parallel processing system design, parallelizing compiler techniques, and high performance video server design.

Der-Lin Pean is a Ph.D. candidate in Computer Science and Information Engineering at the National Chiao Tung University, Taiwan, ROC. He received his B.S. degree in Information and Computer Engineering at Chung Yuan Christian University, Taiwan, ROC. He served as a lecturer in the Department of Computer and Information Engineering, as well as Employment and Vocational Training Administration, Council of Labor Ministry, Taiwan, ROC. His current research interests include computer architecture, personal computer system architecture design, parallel processing system design, parallelizing compiler techniques, and microprocessor system design.

Chao-Chin Wu was born on 26 February 1968 in Taichung County, Taiwan, Republic of China. He received the B.S. degree in Computer Science and Engineering from Tatung Institute of Technology, Taiwan, in 1990, and the M.S. degree in Computer Science and Information Engineering from National Chiao Tung University, Taiwan, 1992. He received the Ph.D. degree in Electrical Engineering and Computer Science from National Chiao Tung University, Taiwan, 1998. His research interests include computer architecture and parallel processing.

Huey-Ting Chua received the B.S. and M.S. degrees in 1997 and 1999, respectively, both in Computer Science and Information Engineering from National Chiao Tung University, Taiwan. Her major research interest is parallel compiler.

View full text

Design of a scalable multiprocessor architecture and its simulation

Abstract

Introduction

Section snippets

Overall system architecture

Cluster-based multiprocessor and interconnection network

Basic Structure of SEECMA

Preliminary performance evaluations of our architecture on SEECMA

Conclusion and future work

Acknowledgements

Journal of Parallel and Distributed Computing

Future Generation Computer Systems

Portable Programs for Parallel Processors

Parallel Computer Architecture: A Hardware/Software Approach

Performance analysis of k-ary n-cube interconnection networks

IEEE Transactions on Computers

The SCI cache coherence protocol

Cache invalidation patterns in shared-memory multiprocessors

IEEE Transaction on Computers

How to make a multiprocessor computer that correctly executes multiprocessor programs

IEEE Transactions on Computers C