## Couillard: Parallel Programming via Coarse-Grained Data-Flow Compilation

Leandro A. J. Marzulo, Tiago A. O. Alves, Felipe M. G. Frana Universidade Federal do Rio de Janeiro Programa de Engenharia de Sistemas e Computao, COPPE Rio de Janeiro, RJ, Brasil {tiagoaoa, lmarzulo, felipe}@cos.ufrj.br

> Vtor Santos Costa Universidade do Porto Departamento de Cincia de Computadores Porto, Portugal vsc@dcc.fc.up.pt

## Resumo

Data-flow is a natural approach to parallelism. However, describing dependencies and control between finegrained data-flow tasks can be complex and present unwanted overheads. TALM (TALM is an Architecture and Language for Multi-threading) introduces a user-defined coarse-grained parallel data-flow model, where programmers identify code blocks, called superinstructions, to be run in parallel and connect them in a data-flow graph. TALM has been implemented as a hybrid Von Neumann/data-flow execution system: the Trebuchet. We have observed that TALM's usefulness largely depends on how programmers specify and connect super-instructions. Thus, we present Couillard, a full compiler that creates, based on an annotated C-program, a data-flow graph and C-code corresponding to each super-instruction. We show that our toolchain allows one to benefit from data-flow execution and explore sophisticated parallel programming techniques, with small effort. To evaluate our system we have executed a set of real applications on a large multi-core machine. Comparison with popular parallel programming methods shows competitive speedups, while providing an easier parallel programing approach.

## 1. Introduction

Data-flow programming provides a natural approach to parallelism, where instructions execute as soon as their input operands are available [12, 21, 18, 23]. Actually in dynamic data-flow, we may even have independent instructions from multiple iterations on a loop running simultaneously, as parts of the loop may run fast than others and reach next iterations. Therefore it is complex to describe control in data-flow, since instructions must only proceed to execution when operands from the same iteration match. However, this difficulty is compensated by the amount of parallelism exploited this way

TALM (TALM is an Architecture and Language for Multi-threading) [19, 2, 3] is an execution model designed to exploit the advantages of data-flow in multithread programming. A program in TALM is comprised of code blocks called *super-instructions* and simple instructions connected in a graph according to their data dependencies (i.e. a data-flow graph). To parallelize a program using the TALM model, the programmer marks portions of code that are to become superinstructions and describe their dependencies. With this approach, parallelism comes naturally from data-flow execution.

The major advantage of TALM is that it provides a coarse-grained parallel model that can take advantage of data-flow. It is also a very flexible model, as the main data-flow instructions are available, thus allowing full compilation of control in a data-flow fashion. This gives the programmer the latitude to choose from coarser to more fine-grained execution strategies. This approach contrasts with previous work in dataflow programming [12, 21, 23], which often aimed at hiding data-flow execution from the programmer.

A first implementation of TALM, the Trebuchet system, has been developed as a hybrid Von Neumann/data-flow execution system for thread-based architectures in shared memory platforms. Trebuchet emulates a data-flow machine that supports both simple instructions and super-instructions. Superinstructions are compiled as separate functions that are called by the runtime environment, while regular instructions are interpreted upon execution. Although Trebuchet needs to emulate data-flow instructions, experience showed most running time is within our superinstructions. Initial results show the parallel engine to be competitive with state-of-the-art parallel applications using OpenMP, both in terms of base performance, and in terms of speedups [19, 2, 3]. On the other hand, parallelism for for simple SPMD (Single-Program Multiple Data) applications can be explored quite well with tools such as OpenMP. The main benefits exploited by TALM become apparent when experimenting with applications that require more complex techniques, such as software pipelining or speculative execution.

The usefulness of TALM clearly depends on how the programmer can specify and connect super-instructions together, including the complex task of describing control using data-flow instructions. We therefore introduce *Couillard*, a C-compiler designed to compile TALM annotated C-programs into a data-flow graph, including the description of program control using dynamic data-flow. *Couillard* is designed to insulate the programmer from the details of data-flow programming. By requiring the programmer to just annotate the code with the super-instruction definitions and their dependencies, *Couillard* greatly simplifies the task of parallelizing applications with TALM.

This work makes two contributions:

- We define the TALM language, as an extension of ANSI C and present a full implementation of the *Couillard* Compiler, which generates data-flow graphs and super-instruction code for TALM.
- We evaluate the performance of *Couillard* on two state-of-the-art PARSEC [6] benchmarks. We demonstrate that *Trebuchet* and *Couillard* allows one to explore complex parallel programing techniques, such as non-linear software pipelines and hiding I/O latency. Comparison with popular parallel programming models, such as Pthreads [8],

OpenMP [9] and Intel Thread Building Blocks [22] shows that our approach is not just competitive with state-of-the-art technology, but that in fact can achieve better speedups by allowing one to easily exploit a sophisticated design space for parallel programs.

The paper is organized as follows. In Sect. 2 we briefly review TALM architecture and its implementation, the *Trebuchet*. In Sect. 3 we describe TALM language and Couillard implementation. In Sect. 4 we present performance results on the two PARSEC benchmarks. In Sect. 5 we discuss some related works. Last, we present our conclusions and discuss future work.

## 2. TALM and Trebuchet

TALM [19, 2, 3] allows application developers to take advantage of the possibilities available in the dataflow model in current Von Neumann architectures, in order to explore TLP in a more flexible way. TALM ISA sees applications in the form of a data-flow graph that can be run in parallel.

A main contribution of TALM is that it enables programmers to introduce user-defined instructions, the so called *super-instructions*. TALM assumes a contract with the programmer whether she or he guarantees that execution of the super-instruction can start if all inputs are available, and where she or he guarantees to make output arguments available as soon as possible, but not sooner. Otherwise, TALM has no information on the semantics of individual super-instructions, and indeed imposes no restrictions. Thus, a programmer can use shared memory in super-instructions without having to inform TALM. Although this requires extra care from the programmer, the advantage is that TALM allows easy porting of imperative programs and easily allows program refinement.

TALM has been implemented for multi-cores as a hybrid Von Neumann/data-flow execution system: the *Trebuchet*. *Trebuchet* is in fact a data-flow virtual machine that has a set of data-flow processing elements (PEs) connected in a virtual network. Each PE is associated with a thread at the host (Von Neumann) machine. When a program is executed on *Trebuchet*, instructions are loaded into the individual PEs and fired according to the Data-flow model. Independent instructions will run in parallel if they are mapped to different PEs and there are available cores at the host machine to run those PEs' threads simultaneously.

*Trebuchet* is a Posix-threads based implementation of TALM. It loads super-instructions as a dynami-

cally linked library. At run-time, execution of superinstructions is fired by the virtual machine, according to the data-flow model, but their interpretation comes down to a procedure call resulting in the direct execution of the related block.



Figure 1. Work-flow to follow when writing parallel applications with *Trebuchet*.

Trebuchet may either rely solely on static scheduling of instructions among PEs or may also use workstealing as a tool against imbalance. The work-stealing algorithm employed by *Trebuchet* is based on the ABP algorithm [4], the main difference being that the algorithm developed for *Trebuchet* provides a FIFO doubleended queue (deque) instead of a LIFO one, as is the case for the ABP algorithm. The FIFO order is chosen so that older instructions have execution priority, which is desirable for the applications we target at this moment.

Figure 1 shows the work-flow to be followed in order to parallelize a sequential program and execute on *Trebuchet*. Initially, blocks that will form superinstructions are defined. Then, a super-instruction code extraction is performed to transform all blocks into functions that will collect input operands from *Trebuchet*, process and return output operands. Profiling tools may be used in helping to determine which portions of code are interesting candidates for parallelization.

In the next step, the transformed blocks are compiled into a dynamic library, which will be available to the abstract machine interpreter. Then, a data-flow graph connecting all blocks is defined and the dataflow assembly code is generated. The code may have both super-instructions and simple (fine-grained) instructions. TALM provides all the standard data and control instructions that one would expect in a dynamic data-flow machine.

Last, a data-flow binary is generated from the assembly, processor placement is defined, and the binary code is loaded and executed. As said above, execution of simple instructions requires full interpretation, whereas super-instructions are directly executed on the host machine.

In [2, 3] TALM was used to parallelize a set of 7 applications: a matrix determinant calculation, a matrix multiplication application, a ray tracing application, Equake from SpecOMP 2001, IS from NPB3.0-OMP, and also LU and Mandelbrot from the OpenMP Source code Repository [11]. The achieved speedups for 8 threads, in relation to the sequential versions were, respectively 2.52, 4.16, 4.39, 3.61, 3.00, 2.19 and 7.16. On the other hand, OpenMP versions of those benchmarks have provided speedups of 1.96, 4.15, 4.39, 3.40, 3.11, 2.19 and 7.13. These results are very promising, and show that *Trebuchet* can be very competitive with OpenMP for regular applications.

Trebuchet provides a natural platform for experimenting with advanced parallel programming techniques. In [19] a thread-level speculation model based on optimistic transactions with ordered commits was created for TALM and implemented in *Trebuchet*. Execution of speculative instructions is done within transactions, each one formed by one speculative instruction and its related **Commit** instruction. Transactions will have access only to local copies of the used resources. Once they finish running, if no conflicts are found, local changes will be persisted to global state by **Commit** instructions, associated with each speculative instruction. In case conflicts are found in a speculative instruction I, local changes will be discarded and I will have to be re-executed.

Using speculative execution liberates the programmer to consider only explicit dependencies while guaranteeing correct execution of coarse-grained tasks. Moreover, the speculation mechanism does not demand centralized control, which is a key feature for upcoming many-core systems, where scalability has become an important concern. To evaluate the speculation system, a bank server simulator artificial application was implemented to simulate scenarios varying computation load, transaction size, speculation depth, and contention. Results of execution of this application with up to 24 threads in a 24-core machine suggest that there is a wide range of situations where speculation can be very effective and indeed achieve speedups close to the ideal case.

## 3. Compilation

The data-flow model exposes thread-level parallelism by taking advantage of how data is exchanged between processing elements. In this vein, programming in TALM is about identifying parallel tasks and how data is consumed and produced between them. The initial *Trebuchet* implementation provided an execution environment for multi-cores, plus an assembler and loader. It was up to the programmer to code superinstructions in the library and to write TALM assembly code linking the different instructions together and specifying control trough data-flow instructions, not always a trivial task.

In this work we propose *Couillard*, a C-compiler for data-flow style execution. With *Couillard*, the programmer annotates blocks of code that are going to become super-instructions, and further annotates the program variables that correspond to their inputs and outputs. Couillard then produces the C-code corresponding to each super-instruction to be next compiled as a shared object to the target architecture and loaded by Trebuchet. Moreover, Couillard generates TALM assembly code to connect all super-instructions according to the user's specification. This assembly code represents the actual data-flow graph of the program. Moreover, control constructs such as loops and if-thenelse statements that are not within super-instruction will also be compiled to TALM assembly code. This assembly code will then be used by *Trebuchet* to guide execution, following the data-flow rules.

*Couillard* front-end uses PLY (Python Lex-Yacc) [5] and a grammar that is a subset of ANSI-C extended with super-instruction constructs. *Couillard* back-end, to generates TALM assembly code for TALM, superinstructions C-code (to be compiled into a dynamically linked library) and a graph representation of the program, using Graphviz notation [1].

#### 3.1. Front-end

We assume that super-instructions take most of the running time of an application, as regular instructions are mostly used to describe the data and control relations between super-instructions. Since superinstruction code will be compiled using a regular Ccompiler and regular instructions tend to be simple, *Couillard* does not need to support the full ANSI-C grammar. *Couillard*, therefore adopts a subset of the ANSI-C grammar extended to support data-flow directives relative to super-instructions and their dependencies. We have also changed the syntax of variable declaration and access, which is necessary to parallelize superinstructions. The compiler front-end produces an AST (Auxiliary Syntax Tree) that will be processed to generate a data-flow graph representation.

#### 3.1.1 Blocks and Super-Instructions

The annotation pair **#BEGINBLOCK** and **#ENDBLOCK** is used to mark blocks of code that will *not* be compiled to data-flow. Those blocks usually contain include files, auxiliary function definitions, and global variables declarations, to be used by super-instruction code in the dynamic library.

Super-instruction annotation is performed according to the following syntax:

## 

# #BEGINSUPER

#ENDSUPER

Super-instructions declared as single will always have only one instance in the data-flow graph, while instructions declared as parallel may have multiple instances that can run in parallel, depending on the placement and availability of resources at the host machine. In the example of Fig. 3 (described in more details in Section 3.4), we have single super-instructions at the beginning and end of the computation. In contrast, the inner code corresponds to parallel superinstructions.

#### 3.1.2 Variables

*Couillard* requires the programmer to specify how variables connect the different super-instructions. More precisely, all variables used as inputs or outputs of super-instructions must be previously declared to guarantee that data will be exchanged correctly between instructions (without loss due to wrong type castings). Also, output variables used on parallel super-instructions must be declared as follows:

#### treb\_parout <type> <identifier>;

The Storage Classifier treb\_parout is used because parallel super-instructions, in general, have multiple instances, Therefore, output variables of parallel superinstructions will also have multiple instance, one for each instance of the parallel super-instruction. When using a treb\_parout variable as input to another super-instruction (or even in external C-code) it is necessary to specify the instance that is being referenced. To do so, *Couillard* provides the following syntax:

```
<identifier>::< NUMBER |
* |
mytid |
(mytid + NUMBER) |
(mytid - NUMBER) |
lattid>
```

Consider a variable named x. The notation x := 0refers to instance 0 of variable x, while x :: \* refers to all instances of this variable (this provides an useful abstraction when a super-instruction can receive input from a number of sources). Also, it is often convenient to refer to the instance for the current (parallel) superinstruction. If x is used as input to another parallel super-instruction, we can select x through the expression x :: mytid. To illustrate this situation, in the example of Fig. 3, each instance  $k \ (0 \le k \le 1$ , since there are 2 instances of each parallel super-instruction) of **Proc-2A** receives as input c :: k, produced by **Proc-1**. Expression with + and - are also allowed with *mytid*. For example, if a parallel super-instruction X produces operand a and another parallel super-instructions Y uses specifies a :: (mytid - 1) as input, it means that for a task i, Y.i will receive a from X.(i-1). Last, the reserved word *lasttid* refers to the last instance of a parallel super-instruction and can be used to specify inputs to parallel and single super-instruction.

For the cases were there are dependencies between instances of the same parallel super-instructions we can specify input variables using the following construct:

```
local.<identifier>::<(mytid + NUMBER) |
    (mytid - NUMBER)>
```

For example, if we state that a parallel superinstruction s produces operand o and receives *local.o*:: (mytid-2), it means that s.i (instance i of s) depends on s.(i-2). Moreover, it means that s.0 and s.1 do not have local dependencies. We can also specify operands that will be sent only to those independent instances of s. We use the following syntax:

In the former example if we also define *starter.c* as an input of s, only s.0 and s.1 will receive this operand. A practical example of use of this constructs is to serialize distributed I/O operation to hide I/O latency, explained in Section 3.4.

The rationale to describe parallel code in superinstructions is simple. The developer first divides the code in blocks that can be run in parallel. Initialization and termination blocks will most often be single, whereas most of the parallel work will be in parallel blocks. The programmer next specifies how the blocks communicate. If the communication is purely controlbased the programmer should further add an extra variable to specify this connection (a common technique in parallel programming). Note that the programmer still has to prevent data races between blocks unless speculative execution is used (which is not yet supported by the compiler).

## 3.2. Back-end

After generating an Abstract Syntax Tree (AST) of a program, *Couillard* produces its corresponding dataflow graph. From this graph, it generates three output files:

- 1. A .dot file describing the graph in the Graphviz [1] notation. This file will be used to create an image of that graph, using the Graphviz toolchain. Although a Graphviz graph is not needed by *Trebuchet*, it may be useful for academic purposes or to provide a more intelligible look of the produced graph to the programmers that want to and perform manual adjustments to their applications.
- 2. A .fl file describing the graph using TALM's ISA. This file will be the input to *Trebuchet*'s Assembler, producing the .flb binary file that will be loaded into *Trebuchet*'s Virtual Machine.
- 3. A .lib.c file describing the super-instructions as functions, in C-code, to be compiled as a dynamically linked library, using any regular C-compiler. All inputs and outputs variables described with *Couillard* syntax are automatically declared and initialized within the generated function. Notice also that the super-instruction body does not need to parsed by *Couillard*. It is just treated as the value of a super-instruction node at the AST representation. This allowed us to focus only on the instructions necessary to connect superinstructions in a coarse-grained data-flow graph.

## 3.3. Auxiliary Functions and Command Line Arguments

The functions, treb\_get\_tid() and treb\_get\_n\_tasks(), have been added to *Trebuchet* virtual machine and they can be called inside super-instructions code. The former returns the *thread id* of that super-instruction's instance, while the later returns the *number of threads*. Those functions can be used to identify the portion of work to be done by each instance.



Figure 2. Example of how to hide I/O latency with TALM.

In our system, applications are executed within the *Trebuchet* virtual machine. Therefore, command line argument variables cannot be declared within the application's code. They need to be passed trough *Trebuchet*'s command line. Thus, *Trebuchet* stores a vector of command line arguments and the number of arguments at treb\_superargv and treb\_superargc variables, respectively. Then, *Couillard* declares those variables as extern when generating the .lib.c file, meaning that programmers can access those arguments within super-instructions' body.

## 3.4. Illustrative Examples

Figure 2 provides an example of how TALM highlevel language is used to hide I/O latency in a parallel application. In this example we assume that 300 elements need to be read from a file, processed and then the result must be written in an output file. In pane A we can see the different steps to be performed by superinstructions (inner code not shown): (i) initialization of variables and FILE pointers, (ii) reading, (iii) processing, (iv) writing and (v) closing of files. Pane B shows the associated data-flow graph, generated by Couillard.



Figure 3. Example of non-linear parallel pipeline with TALM.

One can notice that reading and writing stages are described as parallel super-instructions, but since there are local inputs, they will be executed serially (although spread among different PEs). This construct allows the execution of each processing task to start as soon as the corresponding read operation has finished, instead of waiting for the hole read. It also allows writing the results of each processing task *i* without having to wait for tasks *x*, where x < i, to finish.

Figure 3 provides an example of how to use TALM high-level language to describe a non-linear parallel pipeline. The example is a skeleton code of an application that reads a file containing a bag of tasks to be processed and writes the results to another file. The processing phase can be divided in 3 stages (Proc-1, Proc-2 and Proc-3). The processing task, Proc-2, was

divided in two different tasks (Proc-2A and Proc-2B), that are executed conditionally. Figure 3 (pane A) shows TALM annotations, while the corresponding data-flow graph for 2 threads, generated by the *Couillard* compiler, is shown in Fig. 3 (pane B).

## 4. Experiments and Results

Our goal is to obtain good performance in real applications and evaluate the TALM for complex parallel programming. We study how our model performs on two state-of-the-art benchmarks from the PARSEC [6] suite: Blackscholes and Ferret. The experiments were executed 5 times in order to remove discrepancies in the execution time. We used as parallel platform a machine with four AMD Six-Core Opteron<sup>TM</sup>8425 HE (2100 MHz) chips (24 cores) and 64 GB of DDR-2 667MHz (16x4GB) RAM, running GNU/Linux (kernel 2.6.31.5-127 64 bits). The machine was running in multi-user mode, but no other users were in the machine.



Figure 4. Blackscholes results.

We started our study with a regular application: Blackscholes. It calculates the prices for a portfolio of European options analytically with the Black-Scholes partial differential equation (PDE). There is no closedform expression for the Black-Scholes equation, and as such it must be computed numerically. The application reads a file containing the portfolio. Black-Scholes partial differential equation for each option in the portfolio can be calculated independently. The application is parallelized with multiple instances of the processing thread that will be responsible for a group of options. Results are then written sequentially to an output file. The PARSEC suite already comes with 3 parallel versions of the Blackscholes benchmark: OpenMP. Pthreads and TBB. We have produced a Trebuchet version of Blacksholes, following the same patterns present

in the PARSEC versions to exploit parallelism. However, we observed that we can hide I/O latency and increase memory locality if we have multiple instances of the input and output threads. Thus, we have also implemented Blackscholes according to the example shown at Section 3.4, Figure 2.

Figure 4 shows the results obtained for the Blackscholes benchmark. Using TALM language, it is possible to obtain good performance (comparable to Pthreads implementations) in a simple fashion. However, the flexibility of the language enables the programmer to achieve even greater results employing more complex techniques of parallelization.

The second benchmark we considered is an irregular application called *Ferret*. This application is based on the Ferret toolkit which is used for content-based similarity search. It was developed at Princeton University, and represents emerging next-generation search engines for non-text document data types. Ferret is parallelized using the pipeline model and only a Pthreads version is provided with PARSEC. However, we had access to a TBB version of Ferret [20] which is also used in this experiment.

First, we have observed that the task size in Ferret is quite small, and would result in high interpretation overheads by the virtual machine, specially when using a large number of cores, where the communication costs become more apparent. Therefore, we have adapted the application to process blocks of five images per task, instead of one.



Figure 5. Ferret results.

Our parallel version of ferret uses a pipeline pattern where the I/O stages are single super-instructions and processing stages are parallel. We relied at our work-stealing mechanism (described in Section 2) to perform dynamic load balancing. Results presented in Fig. 5 show that our implementation with work stealing (*Treb Couillard (WS)* at the graphic) obtains close to linear speedups, for up to 24 cores, and in fact performs better than the TBB version, and very close to the speedups achieved by the Pthreads version. Also one can note that work stealing added a significant contribution to the application performance (speedups for *Treb Couillard (no WS)* are lower).

Moreover, we have also prepared a manually finetuned version of ferret, using over-subscription to rely on the operating system to perform load balancing. We run *Trebuchet* with 3 times more PEs than the number of used cores and adjust *Trebuchet*'s scheduling affinity mechanism to use only the cores necessary for each scenario. Results show that it is possible to overcome Pthreads' performance. Nevertheless, this minor performance gap between a high-level and a manual TALM implementation could be reduced with improvements on the work stealing mechanism and addition of code optimization features on *Couillard*.

## 5. Related Work

Data-flow is an long standing idea in the parallel computing community, with a vast amount of work on both pure and hybrid architectures [23, 13, 7]. Dataflow techniques are widely used in areas such as internal computer design and stream processing. Swanson's WaveScalar Architecture [23] was an important influence in our work, as it was a Data-flow architecture but also showed that it is possible to respect sequential semantics in the data-flow model, and therefore run programs written in imperative languages, such as C and C++. The key idea in WaveScalar is to decouple the execution model from the memory interface, so that the memory requests are issued according to the program order. To do so, WaveScalar relied on compiler to process memory access instructions to guarantee the program semantics. However, the WaveScalar approach requires a full data-flow hardware, that has not been achieved in practice.

Threading Building Blocks (TBB) [22] is a C++ library designed to provide an abstract layer to help programmers develop multi-threaded code. TBB enables the programmer to specify parallel tasks, which leads to a more high-level programming than implementing directly the code for threads. Another feature of TBB is the use of templates to instantiate mechanisms such as pipelines. The templates, however, have limitations. For instance, only *linear* pipelines can be described using the pipeline template.

Another project that relies on code augmentation for parallelization is DDMCPP [24]. DDMCPP is a preprocessor for the Data Driven Multithreading model [17], which, like TALM, is based on dynamic data-flow.

HMPP [10] is "an Heterogeneous Multi-core Parallel Programming environment that allows the integration of heterogeneous hardware accelerators in a seamless intrusive manner while preserving the legacy code". It provides a run time environment, a set of compilation directives and a preprocessor, so that the programmer can specify portions of accelerator codes, called codelets, that can run at GPGPU, FPGAs, a remote machine (using MPI) or the CPU itself. Codelets are pure functions, without side-effects. Multiple codelets implemented for different hardware can exist and the runtime environment will chose which codelet will run, according to hardware availability and compile directives previously specified. The runtime environment will also be responsible for the data transfers to/from the hardware components involved in the computation.

The Galois System [16, 15, 14] is an "object-based optimistic parallelization system for irregular applications". It comprises: (i) syntactic constructs for packing optimistic parallelism as iteration over ordered and unordered sets, (ii) a runtime system to detect unsafe accesses to shared memory and perform the necessary recovery operations and *(iii)* assertions about methods in class libraries. Instead of tracking memory addresses accessed by optimistic code, Galois tracks highlevel semantics violation on abstract data types. For each method that will perform accesses to shared memory, the programmer needs to describe which methods can be commuted without conflicts (and under which circumstances). Gallois also introduces an alternative method to the commutative checks, since it may be costly [15]. Shared data is partitioned, attributed do the different processing cores and the system monitors if partitions are being "touched" by concurrent threads (which would raise a conflict). Despite the detection method used, the programmer needs to describe, for each method that access shared objects, an inverse method that will be executed in case of rollback. The runtime system is in charge of detecting conflicts, calling inverse methods and commanding re-execution.

## 6. Conclusions and Future Work

We have presented the *Couillard* compiler, that compiles an extension of the C-language into TALM code. Initial evaluation on state-of-the-art parallel applications showed TALM code, generated by *Couillard* and running on Trebuchet (a TALM implementation for multicores), to be competitive with handcrafted Pthreads and TBB code, up to 24 processors. Evaluation also shows that we can significantly improve performance by simply experimenting with the connectivity and grain of the building-blocks, supporting our claim that *Couillard* provides a flexible and scalable framework for parallel computing.

Work on improving *Trebuchet* continues. Flexible scheduling is an important requirement in irregular applications, we thus have been working on improving the work stealing mechanism for *Trebuchet* runtime environment. Moreover, placement has a strong impact on applications performance and scalability. We are therefore studying efficient ways to perform automatic placement on *Trebuchet*.

We are also working on refining *Couillard* and on introducing new features to the support library. Extending *Couillard* to allow the use of templates to describe application that fit well known parallel patterns and to enable the use of *Trebuchet*'s memory speculation mechanisms [19] are subject of ongoing research. This work is based in our experience with porting actual applications to the framework. Thus, finding applications that are interesting candidates to be parallelized with *Couillard* is constantly within our research goals.

TALM's super-instructions could also be implemented to different hardware, using different languages, as in HMPP [10], as long as there is a way to call them from our virtual machine. Currently, superinstructions are compiled as functions in a dynamically linked library, but a interface to call GPGPU or FPGA accelerators and perform data-transfers could also be created in our environment. This is subject to on-going work.

## 7. \*

Acknowledgements To CAPES and Euro-Brazilian Windows consortium for the financial support given to the authors of this work.

## References

- [1] Graphviz web-site. http://www.graphviz.org.
- [2] T. A. Alves, L. A. Marzulo, F. M. Franca, and V. S. Costa. Trebuchet: exploring TLP with dataflow virtualisation. *International Journal of High Performance Systems Architecture*, 3(2/3):137, 2011.
- [3] T. A. Alves, L. A. J. Marzulo, F. M. G. França, and V. S. Costa. Trebuchet: Explorando TLP com Virtualização DataFlow. In WSCAD-SSC'09, pages 60–67, São Paulo, Oct. 2009. SBC.
- [4] N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. *Theory of Computing Systems*, 34(2):115–144, Jan. 2001.
- [5] D. Beazley. PLY Python Lex-Yacc. http://www.dabeaz.com/ply/.

- [6] C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, Jan. 2011.
- [7] D. Burger, S. Keckler, K. McKinley, M. Dahlin, L. John, C. Lin, C. Moore, J. Burrill, R. McDonald, and W. Yoder. Scaling to the end of silicon with EDGE architectures. *Computer*, 37(7):44–55, July 2004.
- [8] D. R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997.
- [9] L. Dagum and R. Menon. OpenMP: an industry standard API for shared-memory programming. *IEEE Computational Science and Engineering*, 5(1):46–55, 1998.
- [10] R. Dolbeau, S. Bihan, and F. Bodin. HMPP: a hybrid multi-core parallel programming environment. In *First* Workshop on General Purpose Processing on Graphics Processing Units, 2007.
- [11] A. Dorta, C. Rodriguez, F. de Sande, and A. Gonzalez-Escribano. The OpenMP Source Code Repository. In 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing, pages 244–250, Washington, DC, USA, 2005. IEEE.
- [12] J. R. Gurd, C. C. Kirkham, and I. Watson. The Manchester prototype dataflow computer. *Communications of the ACM*, 28(1):34–52, Jan. 1985.
- [13] K. M. Kavi, R. Giorgi, and J. Arul. Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation. *IEEE Transactions on Computers*, 50(8):834–846, 2001.
- [14] M. Kulkarni, P. Carribault, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. P. Chew. Scheduling strategies for optimistic parallel execution of irregular programs. In *Proceedings of the twentieth* annual symposium on Parallelism in algorithms and architectures - SPAA '08, SPAA '08, page 217, New York, New York, USA, 2008. ACM Press.
- [15] M. Kulkarni, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. P. Chew. Optimistic parallelism benefits from data partitioning. In *Proceedings of the* 13th international conference on Architectural support for programming languages and operating systems, AS-PLOS XIII, pages 233–243, New York, NY, USA, 2008. ACM.
- [16] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic parallelism requires abstractions. In *Proceedings of the 2007* ACM SIGPLAN conference on Programming language design and implementation - PLDI '07, PLDI '07, page 211, New York, New York, USA, 2007. ACM Press.
- [17] C. Kyriacou, Paraskevas Evripodou, and P. Trancoso. Data-Driven Multithreading Using Conventional Microprocessors. *IEEE Transactions on Parallel and Distributed Systems*, 17(10):1176–1188, Oct. 2006.
- [18] L. A. Marzulo, F. M. Franca, and V. S. Costa. Transactional WaveCache: Towards Speculative and Outof-Order DataFlow Execution of Memory Operations. 2008 20th International Symposium on Computer Architecture and High Performance Computing, 0:183– 190, Oct. 2008.

- [19] L. A. J. Marzulo, T. A. Alves, F. M. G. Franca, and V. S. Costa. TALM: A Hybrid Execution Model with Distributed Speculation Support. *Computer Archi*tecture and High Performance Computing Workshops, International Symposium on, 0:31–36, 2010.
- [20] A. Navarro, R. Asenjo, S. Tabik, and C. Ca\cscaval. Load balancing using work-stealing for pipeline parallelism in emerging applications. In *Proceedings of the* 23rd international conference on Supercomputing, ICS '09, pages 517–518, New York, NY, USA, 2009. ACM.
- [21] R. Nikhil. Executing a program on the MIT taggedtoken dataflow architecture. *IEEE Transactions on Computers*, 39(3):300–318, Mar. 1990.
- [22] J. Reinders. Intel threading building blocks : outfitting C++ for multi-core processor parallelism. O'Reilly, 2007.
- [23] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. WaveScalar. In *Microarchitecture*, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International Symposium on, pages 291–302. IEEE Comput. Soc, 2003.
- [24] P. Trancoso, K. Stavrou, and P. Evripidou. DDMCPP: The Data-Driven Multithreading C Pre-Processor. In The 11th Workshop on Interaction between Compilers and Computer Architectures, page 32. Citeseer, 2007.