Elsevier

Parallel Computing

Volume 20, Issue 5, May 1994, Pages 729-751
Parallel Computing

A multiprocessor architecture combining fine-grained and coarse-grained parallelism strategies

https://doi.org/10.1016/0167-8191(94)90003-5Get rights and content

Abstract

A wide variety of computer architectures have been proposed that attempt to exploit parallelism at different granularities. For example, pipelined processors and multiple instruction issue processors exploit the fine-grained parallelism available at the machine instruction level, while shared memory multiprocessors exploit the coarse-grained parallelism available at the loop level. Using a register-transfer level simulation methodology, this paper examines the performance of a multiprocessor architecture that combines both coarse-grained and fine-grained parallelism strategies to minimize the execution time of a single application program. These simulations indicate that the best system performance is obtained by using a mix of fine-grained and coarse-grained parallelism in which any number of processors can be used, but each processor should be pipelined to a degree of 2 to 4, or each should be capable of issuing from 2 to 4 instructions per cycle. These results suggest that current high-performance microprocessors, which typically can have 2 to 4 instructions simultaneously executing, may provide excellent components with which to construct a multiprocessor system.

References (36)

  • D.W Anderson et al.

    The IBM System/360 Model 91: Machine philosophy and instruction-handling

    IBM J. Res. Develop.

    (Jan. 1967)
  • Arvind et al.

    Assessing the benefits of fine-grain parallelism in dataflow programs

  • M Butler et al.

    Single instruction stream parallelism is greater than two

  • L.M Censier et al.

    A new solution to coherence problems in multicache systems

    IEEE Trans. Comput.

    (Dec. 1978)
  • R.P Colwell et al.

    A VLIW architecture for a trace scheduling compiler

    IEEE Trans. Comput.

    (Aug. 1988)
  • R Cytron

    Doacross: Beyond vectorization for multiprocessors (extended abstract)

  • P.G Emma et al.

    Characterization of branch and data dependencies in programs for evaluating pipeline performance

    IEEE Trans. Comput.

    (July 1987)
  • J.A Fisher

    Trace scheduling: A technique for global microcode compaction

    IEEE Trans. Comput.

    (July 1981)
  • A Gottlieb et al.

    The NYU ultracomputer — Designing a MIMD, shared-memory parallel machine

  • N.P Jouppi

    The nonuniform distribution of instruction-level and machine parallelism and its effect on performance

    IEEE Trans. Comput.

    (Dec. 1989)
  • N.P Jouppi

    Architectural and organizational tradeoffs in the design of the multititan CPU

    Internat. Symp. on Computer Architecture

    (May 1989)
  • N.P Jouppi et al.

    Available instruction-level parallelism for superscalar and superpipelined machines

  • C.P Kruskal et al.

    The performance of multistage interconnection networks for multiprocessors

    IEEE Trans. Comput.

    (Dec. 1983)
  • D.J Kuck et al.

    Parallel supercomputing today and the Cedar approach

    Science

    (28 Feb. 1986)
  • M Kumar

    Effect of storage allocation/reclamation methods on parallelism and storage requirements

  • M Kumar

    Measuring parallelism in computation-intensive scientific/engineering applications

    IEEE Trans. Comput.

    (Sep. 1988)
  • S.R Kunkel et al.

    Optimal pipelining in supercomputers

  • M Lam

    Software pipelining: An effective scheduling technique for VLIW machines

  • View full text