D3-Machine: A decoupled data-driven multithreaded architecture with variable resolution support
Introduction
Multithreading has emerged as one of the most promising and exciting approaches for the exploitation of parallelism. It utilizes techniques developed in several independent research directions such as Data-Flow (DF), RISC, compilation for Instruction Level parallelism (ILP) and dynamic resource management [1]. Multithreading provides hardware support for more than one concurrent program counter and has the ability to switch with some efficiency among them [2]. A thread of control is very similar to the notion of process from multiprogramming. The main difference is that a thread in a multithreaded machine is visible at the architecture level [3].
The Decoupled Data-Driven machine D3-machine is a multithreaded architecture that employs data-driven sequencing based on the Decoupled Data-Driven (D3) model [4]. A thread can be scheduled for execution only after all its input data have been produced and arrived to the level of the memory hierarchy that is closer to the execution unit. When a thread is scheduled for execution in the D3-Machine it will proceed without interruption. The D3 model of execution decouples (separates) each thread (actor1) into two parts: the graph (or synchronization) portion and the computation portion. The computation portion of each actor is a collection of conventional instructions (load/store, add, etc). The graph portion contains information about the executability of the actor and its consumers. Thus, a D3 graph can be viewed as a partially ordered conventional program with a data-dependency graph superimposed on it. The graph synchronization is handled by the Data-Flow Graph Engine (DFGE) whereas computation instruction is executed by the Computation Engine (CE). The two engines execute in a decoupled i.e., asynchronous mode. This helps to reduce the length of the processors' critical path.
The abstract model of execution of the D3-Machine has as its starting point the dynamic DF graphs that are based on the U-Interpret model [5]. The basic unit of computation, however, is the thread (macro-actor in the DF terminology) and not the instruction. We have developed extensions to the basic model, such as hierarchical matching and variable resolution support, that make it more suitable for Multithreaded execution. The entire dynamic DF graph is mapped on the logical space (virtual-memory) of the machine. Therefore, for each instantiation of a thread, we know at execution time were its data will reside by using the standard virtual memory translation techniques. For each thread there is a synchronization point, in the logical space, that keeps track of the number of inputs that have been produced. Therefore, all the DF synchronization operation can be implemented by references/updates to the logical space. These references are mapped into the physical space using conventional virtual space mapping techniques.
The computation elements of each PE of the D3-machine a high performance microprocessor support some form of ILP. However, in order to achieve efficient parallel processing it is not sufficient to just connect a large number of von Neumann processors; the latency and synchronization issues [6] must also be resolved. The D3-machine tolerates long latency and synchronization costs by employing data-driven thread scheduling. Furthermore, it exploits locality by varying the length of the threads to match the machine and program characteristics for best performance.
Several deterministic and stochastic simulation experiments have been conducted in order to assess the D3-machine's ability to tolerate latency, and exploit parallelism and locality. The results, presented in this paper, indicated that the D3-machine can indeed tolerate long latencies. For example increasing the communication latency from 1 to 5 cycles results to an increase in execution time of no more than 25%; going from 5 to 15 cycles increases the execution time by 37%. Furthermore, it does exploit locality: the simulation results have indicated that increasing thread length (TL) does indeed reduce execution time. Finally the experiments have shown that the D3-machine for the most part neutralizes the overhead associated with the data-driven synchronization: a fivefold increase in the processing time per thread in the DFGE resulted in an increase of the overall execution time of no more than 15%.
In summary, the main features of the D3-machine include its ability to:
- •
Tolerate long latencies and synchronization delays by using the dynamic DF principles of execution for thread scheduling and synchronization.
- •
Neutralize the overhead associated with dynamic DF scheduling by decoupling the synchronization portion of a thread from the computation portion.
- •
Keep the computation portion of the processor architecture close to the sequential machines. This enables the use of state-of-the-art microprocessors and compilation technology in the CE design.
- •
Employs threads that fully exploit the internal architecture through the use of variable resolution threads.
Section snippets
Multithreaded graphs with variable resolution
The generation and scheduling of threads is based on the dynamic DF principles of execution [5]. The partitioning of the DF graph into threads is done according to a number of optimizing criteria; among them are the classic ones, i.e., improving locality, reducing communication overhead and increasing the available parallelism. However, increasing the level of granularity decreases the amount of parallelism. Thus, there is a tradeoff involved in determining the level of resolution. The
Decoupled data-driven model
The basic cycle of a generic DF system can be decomposed to four stages: (1) determination of executability by operand matching, (2) instruction fetch, (3) execution of the instruction, and (4) token formatting and routing. One such configuration is depicted in Fig. 3(a). The first and fourth stages make up the graph portion of an operation (or graph overhead), while stages two and three make up the computation portion of the operation.
The cyclic pipeline of Fig. 3(a) represents the generic
D3-Architecture: Runtime environment and execution model
The D3-machine uses hierarchical scheduling policy: Static Scheduling for coarse grain objects and Dynamic (data-driven) for fine grain objects. As mentioned earlier a program is a collection of code blocks called the context blocks, representing functions or loop bodies. Each context block comprises several threads. Each thread has a unique identification number. A group of context blocks that must be active simultaneously is called an activation block. Scheduling of activation blocks is done
Deterministic simulations
The goal of the experimental performance evaluation phase of the work was to provide proof- of-concept by demonstrating that the D3-machine: (i) is scalable; (ii) can tolerate large communication and synchronization latencies; (iii) exploits locality; and (iv) provides an efficient platform for implementing data-driven thread scheduling. The D3-simulation facility gives to the user several degrees of freedom through the following user-defined simulation parameters:
- •
TL: Thread length; Number of
Related work
The D3-machine is a non-blocking multithreading machine that has its origins in the dynamic DF sequencing. Most of the basic features of the D3-machine: decoupling of synchronization and computation, use of virtual space for data-driven synchronization and presence in the cache as a necessary condition for scheduling have been proposed in our earlier work [4]. The D3-machine is the evolution and integration of these ideas into a multithreaded machine. Furthermore, in this paper we present
Concluding remarks and future work
The D3 machine described here is multithreaded machine which supports dynamic DF scheduling. The D3-machine decouples the synchronization portion of a thread from the computational portion.
The decoupled model was conceived for ease of implementation and it has adopted a large portion of the state-of-the-art in processor and compiler technology. Experimental performance evaluation has demonstrated that the D3-machine provides an efficient platform for parallel processing. It uses data-driven
References (21)
Mutlithreaded Computer Architecture a Summary of the State of the Art
(1994)- Panel Discussion, Architectural implementation issues for multithreading, in: R. Iannucci et al. (Eds.), Mutlithreaded...
- J. Dennis, G. Gao, Multithreaded architectures: principles, projects and issues, in: R. Iannucci et al. (Eds.),...
- P. Evripidou, J-L. Gaudiot, The USC decoupled multilevel data-flow execution model, in: Advanced Topics in Data-Flow...
- Arvind, K.P. Gostelow, The U-Interpreter, IEEE Computer (1982)...
- Arvind, R.A. Iannucci, Two fundamentals issues in multiprocessors: the data-flow solution, Technical Report LCS/TM-241,...
- C. Brownhill, A. Nicolau, S. Novack, C. Polychronopoulos. The PROMIS compiler prototype, in: Proceedings of the 1997...
- et al.
Block scheduling of iterative algorithms and graph priority in a simulated data-flow multiprocessor
IEEE Transactions on Parallel and Distributed Systems
(1993) Reduced instruction set computers
Communications of the ACM
(1985)- Arvind, R.S. Nikhil, Executing a program on the MIT tagged-token data flow architecture, in: Parallel Architectures and...
Cited by (9)
Architectural support for data-driven execution
2014, ACM Transactions on Architecture and Code OptimizationData-triggered threads: Eliminating redundant computation
2011, Proceedings - International Symposium on High-Performance Computer ArchitectureChip multiprocessor based on data-driven multithreading model
2007, International Journal of High Performance Systems ArchitectureData-driven multithreading using conventional microprocessors
2006, IEEE Transactions on Parallel and Distributed SystemsA case for chip multiprocessors based on the data-driven multithreading model
2006, International Journal of Parallel ProgrammingHardware budget and runtime system for data-driven multithreaded chip multiprocessor
2006, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)