Special Issue ArticlePerformance Studies of Id on the Monsoon Dataflow System
Abstract
In this paper, we examine the performance of Id, an implicitly parallel language, on Monsoon, an experimental dataflow machine. One of the precepts of our work is that the Id run-time system and compiled Id programs should run on any number of Monsoon processors without change. Our experiments running Id programs on Monsoon show that speedups of more than 7 are easily achieved on 8 processors for most of the applications that we studied. We explain the sources of overhead that limit the speedup of each of our benchmark programs. We also compare the performance of Id on a single Monsoon processor with C/Fortran on a DEC Station 5000 (MIPS R3000 processor), to establish a baseline for the efficiency of Id execution on Monsoon. We find that the execution of Id programs on one Monsoon processor takes up to three times as many cycles as the corresponding C or Fortran programs executing on a MIPS R3000 processor. We identify the sources of inefficiency on Monsoon and suggest improvements, where possible. In many cases, however, improving single processor performance will reduce parallel processor performance.
References (0)
Cited by (30)
On the performance of pure and impure parallel functional programs
1999, Parallel ComputingThis paper reports on the memory performance of parallel scientific algorithms, written in both pure and impure functional styles. The Id programming language is used, since it allows both pure and impure parallel functional programs to be expressed. The non-strict storage model of Id is introduced. The study focuses on two algorithms: the Dongarra Sorensen Eignensolver and the NAS FT three dimensional heat equation solver, based on FFTs.
This study verifies the claim that functional languages allow a composition of programs from modules, exploiting the inter- and intra-module parallelism without the need for rewrinting these modules. But it also shows that memory use of pure functional programs can be excessive, and theat impure functional programs can be as memory-efficient as imperative implementations.
Design of cache memories for dataflow architecture
1998, Journal of Systems ArchitectureThe recent advance in dataflow processing — to combine the dataflow paradigm with the control-flow paradigm — has brought out many new challenging issues. This hybrid organization has made it possible to study and adapt familiar control-flow concepts such as cache memories within the framework of the dataflow architecture.
The concept of cache memory has proven its effectiveness in the von Neumann architecture due to the spatial and temporal localities which govern the organization of the conventional programming execution. A dataflow paradigm, does not informally support locality, since the execution sequence is enforced only by the availability of operands. However, dataflow programs can be reordered based on various criteria to enhance the locality of instruction references. This can be achieved by: (i) careful partitioning of a dataflow program into vertical layers of data dependent instructions; and (ii) proper distribution and allocation of the recurrence portions of the dataflow program. Enhancing the locality of data references in the dataflow architecture is a more challenging problem. This paper studies the design of instruction, data (operand), and I-Structure cache memories using the Explicit Token Store (ETS) model of dataflow system. The performance results obtained using various benchmark programs are presented and analyzed.
A Comparison of Implicitly Parallel Multithreaded and Data-Parallel Implementations of an Ocean Model
1998, Journal of Parallel and Distributed ComputingTwo parallel implementations of a state-of-the-art ocean model are described and analyzed: one is written in the implicitly parallel language Id for the Monsoon multithreaded dataflow architecture, and the other in data-parallel CM Fortran for the CM-5. The multithreaded programming model is inherently more expressive than the data-parallel model but is not especially adapted to regular data structures common to many scientific codes. One goal of this study is to understand what, if any, are the performance penalties of multithreaded execution when implementing a program that is well suited for data-parallel execution. To avoid technology and machine configuration issues, the two implementations are compared in terms of overhead cycles perrequiredfloating point operation. When flows in complex geometries typical of ocean basins are simulated, the data-parallel model only remains efficient if redundant computations are performed over land. The generality of the Id programming model, however, allows one to easily and transparently implement a parallel code that computes only in the ocean. When ocean basins with complex and irregular geometry are simulated the normalized performance on Monsoon is comparable with that of the CM-5. For more regular geometries that map well to the computational domain, the data-parallel approach proves to be a better match. We conclude by examining the extent to which clusters of mainstream symmetric multiprocessor (SMP) systems offer a scientific computing environment which can capitalize on and combine the strengths of the two paradigms.
A visual dataflow programming environment for a real time parallel vision machine
1995, Journal of Visual Languages and ComputingProgramming parallel architectures dedicated to real-time image processing (IP) is often a difficult and error-prone task. This mainly results from the fact that IP algorithms typically involve several distinct processing levels and data representations, and that various execution models as well as complex hardware are needed for handling these processing layers under real-time constraints.
Our goal is to permit an intuitive but still efficient handling of such an architecture by providing a continuous and readable path from the functional specification of an algorithm to its corresponding hardware implementation. For this, we developed a data-flow programming model which can act simultaneously as a functional representation of algorithms and as a structural description of their corresponding implementations on a target computer built up of 3-D interconnected data-driven processing elements (DDPs).
Algorithms are decomposed into functional primitives viewed as top-level nodes of a data-flow graph (DFG). Each node is given a known physical implementation on the target architecture, either as a single DDP or as an encapsulated sub graph of DDPs, making the well known mapping problem a topological one.
The target computer was built at ETCA and embeds 1024 custom data-driven processors and 12 transputers in a 3-D interconnected network. Concurrently with the machine, a complete programming environment has been developed. Relying upon a functional compiler, a large library of IP primitives and automatic place-and-route facilities, it also includes various X-Window based tools aiming at visual and efficient access to all intermediary program representations.
In terms of visual languages, we try to share the burden between all the layers of this programming environment. Rather than including some display facilities in existing software environment, we have taken advantage of the intuitiveness of functional representations, even textual, and of the hardware efficiency that provides immediate results, ultimately supporting hierarchical constructs.
Exploiting data structure locality in the dataflow model
1995, Journal of Parallel and Distributed ComputingAlthough the dataflow model has been shown to allow the exploitation of parallelism at all levels, research of the past decade has revealed several fundamental problems. Synchronization at the instruction level, token matching, coloring, and re-labeling operations have a negative impact on performance by significantly increasing the number of non-compute "overhead" cycles. Recently, many novel hybrid von-Neumann data driven machines have been proposed to alleviate some of these problems. The major objective has been to reduce or eliminate unnecessary synchronization costs through simplified operand matching schemes and increased task granularity. Moreover, the results from recent studies quantifying locality suggest sufficient spatial and temporal locality is present in dataflow execution to merit its exploitation. In this paper we present a data structure for exploiting locality in a data driven environment: the vector cell. A vector cell consists of a number of fixed length chunks of data elements. Each chunk is tagged with a presence bit, providing intra-chunk strictness and inter-chunk non-strictness to data structure access. We describe the semantics of the model, processor architecture and instruction set as well as a Sisal to dataflow vectorizing compiler back-end. The vector cell model is evaluated by comparing its performance to those of both a classical fine-grain dataflow processor employing I-structures and a conventional pipelined vector processor. Results indicate that the model is surprisingly resilient to long memory and communication latencies and is able to dynamically exploit the underlying parallelism across multiple processing elements at run time.
The Q100 Database Processing Unit
2015, IEEE Micro