Elsevier

Parallel Computing

Volume 27, Issue 9, August 2001, Pages 1197-1225
Parallel Computing

D3-Machine: A decoupled data-driven multithreaded architecture with variable resolution support

https://doi.org/10.1016/S0167-8191(01)00083-7Get rights and content

Abstract

This paper presents the Decoupled Data-Driven machine (D3-machine), a multithreaded architecture with data-driven synchronization. The D3-machine is an efficient and cost-effective design that combines the advantages of the data-driven synchronization with those of Instruction Level Parallelism (ILP). Two major design ideas are utilized by the proposed model: asynchronous execution of synchronization and computation operations and multithreaded graphs with variable resolution. The guiding principle in the generation of the threads is to fully exploit the ILP capabilities of the target processor. The entire dynamic Data-Flow (DF) graph is mapped by a one-to-one function onto the virtual space of the machine. Thus, the traditional DF graph operations (synchronization) of token matching and token formatting/routing are reduced into memory access operations. This allows us to utilize the dynamic DF principles, that exploit ultimate parallelism, for thread scheduling at a hardware minimal cost. With a combination of deterministic and stochastic simulation experiments is shown that the D3-machine has the necessary attributes for efficient parallel processing; it can tolerate long latencies, exploit parallelism, and also benefit from locality. Furthermore, by decoupling the synchronization portion of a thread from the computation, the D3-machine effectively neutralizes the overhead associated with dynamic DF scheduling.

Introduction

Multithreading has emerged as one of the most promising and exciting approaches for the exploitation of parallelism. It utilizes techniques developed in several independent research directions such as Data-Flow (DF), RISC, compilation for Instruction Level parallelism (ILP) and dynamic resource management [1]. Multithreading provides hardware support for more than one concurrent program counter and has the ability to switch with some efficiency among them [2]. A thread of control is very similar to the notion of process from multiprogramming. The main difference is that a thread in a multithreaded machine is visible at the architecture level [3].

The Decoupled Data-Driven machine D3-machine is a multithreaded architecture that employs data-driven sequencing based on the Decoupled Data-Driven (D3) model [4]. A thread can be scheduled for execution only after all its input data have been produced and arrived to the level of the memory hierarchy that is closer to the execution unit. When a thread is scheduled for execution in the D3-Machine it will proceed without interruption. The D3 model of execution decouples (separates) each thread (actor1) into two parts: the graph (or synchronization) portion and the computation portion. The computation portion of each actor is a collection of conventional instructions (load/store, add, etc). The graph portion contains information about the executability of the actor and its consumers. Thus, a D3 graph can be viewed as a partially ordered conventional program with a data-dependency graph superimposed on it. The graph synchronization is handled by the Data-Flow Graph Engine (DFGE) whereas computation instruction is executed by the Computation Engine (CE). The two engines execute in a decoupled i.e., asynchronous mode. This helps to reduce the length of the processors' critical path.

The abstract model of execution of the D3-Machine has as its starting point the dynamic DF graphs that are based on the U-Interpret model [5]. The basic unit of computation, however, is the thread (macro-actor in the DF terminology) and not the instruction. We have developed extensions to the basic model, such as hierarchical matching and variable resolution support, that make it more suitable for Multithreaded execution. The entire dynamic DF graph is mapped on the logical space (virtual-memory) of the machine. Therefore, for each instantiation of a thread, we know at execution time were its data will reside by using the standard virtual memory translation techniques. For each thread there is a synchronization point, in the logical space, that keeps track of the number of inputs that have been produced. Therefore, all the DF synchronization operation can be implemented by references/updates to the logical space. These references are mapped into the physical space using conventional virtual space mapping techniques.

The computation elements of each PE of the D3-machine a high performance microprocessor support some form of ILP. However, in order to achieve efficient parallel processing it is not sufficient to just connect a large number of von Neumann processors; the latency and synchronization issues [6] must also be resolved. The D3-machine tolerates long latency and synchronization costs by employing data-driven thread scheduling. Furthermore, it exploits locality by varying the length of the threads to match the machine and program characteristics for best performance.

Several deterministic and stochastic simulation experiments have been conducted in order to assess the D3-machine's ability to tolerate latency, and exploit parallelism and locality. The results, presented in this paper, indicated that the D3-machine can indeed tolerate long latencies. For example increasing the communication latency from 1 to 5 cycles results to an increase in execution time of no more than 25%; going from 5 to 15 cycles increases the execution time by 37%. Furthermore, it does exploit locality: the simulation results have indicated that increasing thread length (TL) does indeed reduce execution time. Finally the experiments have shown that the D3-machine for the most part neutralizes the overhead associated with the data-driven synchronization: a fivefold increase in the processing time per thread in the DFGE resulted in an increase of the overall execution time of no more than 15%.

In summary, the main features of the D3-machine include its ability to:

  • Tolerate long latencies and synchronization delays by using the dynamic DF principles of execution for thread scheduling and synchronization.

  • Neutralize the overhead associated with dynamic DF scheduling by decoupling the synchronization portion of a thread from the computation portion.

  • Keep the computation portion of the processor architecture close to the sequential machines. This enables the use of state-of-the-art microprocessors and compilation technology in the CE design.

  • Employs threads that fully exploit the internal architecture through the use of variable resolution threads.

The overall goal of this paper is to demonstrate the potential of the D3-machine through simulation experiments. In Section 2, the variable-resolution graphs are introduced. The basic framework of the D3 model is presented in Section 3. Section 4 presents the decoupled architecture and addresses various implementation issues. Analytical and simulated performance analysis is presented in Section 5. Concluding remarks are presented in Section 6.

Section snippets

Multithreaded graphs with variable resolution

The generation and scheduling of threads is based on the dynamic DF principles of execution [5]. The partitioning of the DF graph into threads is done according to a number of optimizing criteria; among them are the classic ones, i.e., improving locality, reducing communication overhead and increasing the available parallelism. However, increasing the level of granularity decreases the amount of parallelism. Thus, there is a tradeoff involved in determining the level of resolution. The

Decoupled data-driven model

The basic cycle of a generic DF system can be decomposed to four stages: (1) determination of executability by operand matching, (2) instruction fetch, (3) execution of the instruction, and (4) token formatting and routing. One such configuration is depicted in Fig. 3(a). The first and fourth stages make up the graph portion of an operation (or graph overhead), while stages two and three make up the computation portion of the operation.

The cyclic pipeline of Fig. 3(a) represents the generic

D3-Architecture: Runtime environment and execution model

The D3-machine uses hierarchical scheduling policy: Static Scheduling for coarse grain objects and Dynamic (data-driven) for fine grain objects. As mentioned earlier a program is a collection of code blocks called the context blocks, representing functions or loop bodies. Each context block comprises several threads. Each thread has a unique identification number. A group of context blocks that must be active simultaneously is called an activation block. Scheduling of activation blocks is done

Deterministic simulations

The goal of the experimental performance evaluation phase of the work was to provide proof- of-concept by demonstrating that the D3-machine: (i) is scalable; (ii) can tolerate large communication and synchronization latencies; (iii) exploits locality; and (iv) provides an efficient platform for implementing data-driven thread scheduling. The D3-simulation facility gives to the user several degrees of freedom through the following user-defined simulation parameters:

  • TL: Thread length; Number of

Related work

The D3-machine is a non-blocking multithreading machine that has its origins in the dynamic DF sequencing. Most of the basic features of the D3-machine: decoupling of synchronization and computation, use of virtual space for data-driven synchronization and presence in the cache as a necessary condition for scheduling have been proposed in our earlier work [4]. The D3-machine is the evolution and integration of these ideas into a multithreaded machine. Furthermore, in this paper we present

Concluding remarks and future work

The D3 machine described here is multithreaded machine which supports dynamic DF scheduling. The D3-machine decouples the synchronization portion of a thread from the computational portion.

The decoupled model was conceived for ease of implementation and it has adopted a large portion of the state-of-the-art in processor and compiler technology. Experimental performance evaluation has demonstrated that the D3-machine provides an efficient platform for parallel processing. It uses data-driven

References (21)

  • R.A. Iannucci

    Mutlithreaded Computer Architecture a Summary of the State of the Art

    (1994)
  • Panel Discussion, Architectural implementation issues for multithreading, in: R. Iannucci et al. (Eds.), Mutlithreaded...
  • J. Dennis, G. Gao, Multithreaded architectures: principles, projects and issues, in: R. Iannucci et al. (Eds.),...
  • P. Evripidou, J-L. Gaudiot, The USC decoupled multilevel data-flow execution model, in: Advanced Topics in Data-Flow...
  • Arvind, K.P. Gostelow, The U-Interpreter, IEEE Computer (1982)...
  • Arvind, R.A. Iannucci, Two fundamentals issues in multiprocessors: the data-flow solution, Technical Report LCS/TM-241,...
  • C. Brownhill, A. Nicolau, S. Novack, C. Polychronopoulos. The PROMIS compiler prototype, in: Proceedings of the 1997...
  • P. Evripidou et al.

    Block scheduling of iterative algorithms and graph priority in a simulated data-flow multiprocessor

    IEEE Transactions on Parallel and Distributed Systems

    (1993)
  • D.A. Patterson

    Reduced instruction set computers

    Communications of the ACM

    (1985)
  • Arvind, R.S. Nikhil, Executing a program on the MIT tagged-token data flow architecture, in: Parallel Architectures and...
There are more references available in the full text version of this article.

Cited by (9)

View all citing articles on Scopus
View full text