A multiprocessor architecture combining fine-grained and coarse-grained parallelism strategies

doi:10.1016/0167-8191(94)90003-5

Parallel Computing

Volume 20, Issue 5, May 1994, Pages 729-751

https://doi.org/10.1016/0167-8191(94)90003-5 Get rights and content

Abstract

A wide variety of computer architectures have been proposed that attempt to exploit parallelism at different granularities. For example, pipelined processors and multiple instruction issue processors exploit the fine-grained parallelism available at the machine instruction level, while shared memory multiprocessors exploit the coarse-grained parallelism available at the loop level. Using a register-transfer level simulation methodology, this paper examines the performance of a multiprocessor architecture that combines both coarse-grained and fine-grained parallelism strategies to minimize the execution time of a single application program. These simulations indicate that the best system performance is obtained by using a mix of fine-grained and coarse-grained parallelism in which any number of processors can be used, but each processor should be pipelined to a degree of 2 to 4, or each should be capable of issuing from 2 to 4 instructions per cycle. These results suggest that current high-performance microprocessors, which typically can have 2 to 4 instructions simultaneously executing, may provide excellent components with which to construct a multiprocessor system.

References (36)

D.W Anderson et al.
The IBM System/360 Model 91: Machine philosophy and instruction-handling
IBM J. Res. Develop.
(Jan. 1967)
Arvind et al.
Assessing the benefits of fine-grain parallelism in dataflow programs
M Butler et al.
Single instruction stream parallelism is greater than two
L.M Censier et al.
A new solution to coherence problems in multicache systems
IEEE Trans. Comput.
(Dec. 1978)
R.P Colwell et al.
A VLIW architecture for a trace scheduling compiler
IEEE Trans. Comput.
(Aug. 1988)
R Cytron
Doacross: Beyond vectorization for multiprocessors (extended abstract)
P.G Emma et al.
Characterization of branch and data dependencies in programs for evaluating pipeline performance
IEEE Trans. Comput.
(July 1987)
J.A Fisher
Trace scheduling: A technique for global microcode compaction
IEEE Trans. Comput.
(July 1981)
A Gottlieb et al.
The NYU ultracomputer — Designing a MIMD, shared-memory parallel machine
N.P Jouppi
The nonuniform distribution of instruction-level and machine parallelism and its effect on performance
IEEE Trans. Comput.
(Dec. 1989)

N.P Jouppi

Architectural and organizational tradeoffs in the design of the multititan CPU

Internat. Symp. on Computer Architecture

(May 1989)

N.P Jouppi et al.

Available instruction-level parallelism for superscalar and superpipelined machines

C.P Kruskal et al.

The performance of multistage interconnection networks for multiprocessors

IEEE Trans. Comput.

(Dec. 1983)

D.J Kuck et al.

Parallel supercomputing today and the Cedar approach

Science

(28 Feb. 1986)

M Kumar

Effect of storage allocation/reclamation methods on parallelism and storage requirements

M Kumar

Measuring parallelism in computation-intensive scientific/engineering applications

IEEE Trans. Comput.

(Sep. 1988)

S.R Kunkel et al.

Optimal pipelining in supercomputers

M Lam

Software pipelining: An effective scheduling technique for VLIW machines

Cited by (3)

Using advanced compiler technology to exploit the performance of the Cell Broadband Engine™ architecture
2006, IBM Systems Journal
Dynamic scheduling techniques for heterogeneous computing systems
1995, Concurrency: Practice and Experience
Partitioning tasks between a pair of interconnected heterogeneous processors: A case study
1995, Concurrency: Practice and Experience

View full text

A multiprocessor architecture combining fine-grained and coarse-grained parallelism strategies

Abstract

The IBM System/360 Model 91: Machine philosophy and instruction-handling

IBM J. Res. Develop.

Assessing the benefits of fine-grain parallelism in dataflow programs

Single instruction stream parallelism is greater than two

A new solution to coherence problems in multicache systems

IEEE Trans. Comput.

A VLIW architecture for a trace scheduling compiler

IEEE Trans. Comput.

Doacross: Beyond vectorization for multiprocessors (extended abstract)

Characterization of branch and data dependencies in programs for evaluating pipeline performance

IEEE Trans. Comput.

Trace scheduling: A technique for global microcode compaction

IEEE Trans. Comput.

The NYU ultracomputer — Designing a MIMD, shared-memory parallel machine

The nonuniform distribution of instruction-level and machine parallelism and its effect on performance

IEEE Trans. Comput.

Architectural and organizational tradeoffs in the design of the multititan CPU

Internat. Symp. on Computer Architecture

Available instruction-level parallelism for superscalar and superpipelined machines

The performance of multistage interconnection networks for multiprocessors

IEEE Trans. Comput.

Parallel supercomputing today and the Cedar approach

Science

Effect of storage allocation/reclamation methods on parallelism and storage requirements

Measuring parallelism in computation-intensive scientific/engineering applications

IEEE Trans. Comput.

Optimal pipelining in supercomputers

Software pipelining: An effective scheduling technique for VLIW machines