Quaff: efficient C++ design for parallel skeletons

doi:10.1016/j.parco.2006.06.001

Parallel Computing

Volume 32, Issues 7–8, September 2006, Pages 604-615

https://doi.org/10.1016/j.parco.2006.06.001 Get rights and content

Abstract

We present Quaff, a new skeleton-based parallel programming library. Its main originality is to rely on C++ template meta-programming techniques to achieve high efficiency. In particular, by performing most of skeleton instantiation and optimization at compile-time, Quaff can keep the overhead traditionally associated to object-oriented implementations of skeleton-based parallel programming libraries very small. This is not done at the expense of expressivity. This is demonstrated in this paper by several applications, including a full-fledged, realistic real-time vision application.

Introduction

Modern parallel programming on MIMD machines is usually carried out using message-passing libraries. These libraries, such as PVM or MPI, provide a standardized, platform independent way to build parallel applications. However, manipulating such low level frameworks is difficult and error prone. Deadlocks and other common undesired behaviors make parallel software development very slow compared to classic, sequential one. Algorithmic skeletons [5], [6] have been proposed as a solution to these problems. They are typical parallel patterns that can be implemented once on a given platform. Building parallel software using algorithmic skeletons boils down to combining skeletons and sequential code fragments. In practice, approaches based upon skeletons can be divided into three main classes:

•
New languages embedding skeleton patterns in their syntax. This approach can offer good performances but requires the programmer to learn a new language, which can viewed as an obstacle to the adoption of this paradigm.
•
Parallel compilers for an existing language. Such compilers aim at identifying parallel structures in existing sequential code and use specific parallel implementation for those structures [20].
•
High-level libraries for an existing language. Such solutions are more easily accepted by developers as they allow them to reuse existing code and to work with a familiar environment.

We focus here on approaches based upon high-level libraries for an existing language. This is the way taken, for example, by the BSMLLib [17], Lithium [9], [2], eSkel [6] and Muesli [15] projects.¹ The most challenging issue, for such a library is to find a good trade-off between readability and efficiency. By readability we mean the ability to express parallelism with a minimum concern for implementation details. By efficiency we mean the ability to produce code whose performances can stay on the par with those obtained using a low-level message passing like MPI. These features are clearly in tension with each other. For example, eSkel can produce very efficient code but at the expense of a rather low-level API (with a lot of MPI-specific idioms visible). By contrast, a library such as Muesli exposes a much more abstract and simple API but incurs a significantly higher overhead at runtime. This overhead can be mostly explained by the fact that this library uses an abstract class hierarchy to embed user-defined tasks within the skeleton structure of the application, resulting in repeated virtual function calls at run-time. This is unfortunate because, within skeleton-based parallel programming models, the overall structure of the application, i.e., the combination of parallel skeletons and user-defined sequential functions is essentially static.

The library described in this paper, named Quaff, aims at reducing the aforementioned tension. For this, it relies on C++ compilation techniques such as template meta-programming to reduce the runtime overhead of classical object-oriented implementations of skeleton-based parallel programming libraries to the strict minimum, while keeping a high-level of expressivity and readability. Quaff also enforces the re-usability of legacy code or third party libraries by limiting its impact on existing code.

This paper is organized as follows. The Quaff programming model is introduced in Section 2, with simple examples. Section 3 presents the implementation, with quaff, of a full-fledged vision application, showing its ability to handle complex, realistic situations while still producing very efficient code. Section 4 details the implementation techniques used for turning Quaff programs into optimized MPI code. Section 5 is a short review of related work and Section 6 concludes the paper.

Section snippets

A first example

In this section, we describe the encoding, with Quaff, of a very simple application which performs matrix multiplication using a classical domain decomposition approach with a scm (split-compute-merge) skeleton. The Quaff code of the application appears on listing 1. The corresponding task graph is given on Fig. 1.

Listing 1: Sample Quaff application

1 // User-defined tasks registration
2 typedef task<CSlice, none_t, matrix>	slice;
3 typedef function(FMatMul, matrix, matrix)	mul;
4 typedef task<

Experimental results

Section 2 has demonstrated the ability for quaff to produce efficient code for simple applications. We now focus on expressivity issues. In this section the implementation of a realistic application with Quaff is presented. By realistic, we mean an application solving a “real” problem, by contrast to the code samples in Section 2, whose goal is only to illustrate and demonstrate programming features. This application, taken from the computer vision domain, performs real-time 3D reconstruction

Quaff implementation

As stated in Section 2, the major issue with class-based libraries is the high overhead induced by virtual function calls. Recent compilers are able to reduce this overhead by performing various aggressive optimizations. However, none of them is able to optimize code across function or method boundaries. In the classic polymorphic library model, this leads to very efficient code at the function level, but poor performances at the program level.

One solution is to write code that forces the

Related work

As stated in Section 1, Quaff is a library-based approach to skeleton-based parallel programming. The most significant projects related to this approach are BSMLlib, Lithium, eSkel and Muesli.

BSMLlib [17] is a library for integrating Bulk Synchronous Parallel (BSP) programming in a functional language (Objective Caml). It extends the underlying lambda-calculus with parallel operations on parallel data structures. Being based on a formal operational semantics, it can be used to predict execution

Conclusion

In this paper, we have introduced Quaff, a skeleton-based parallel programming library written in C++. Compared to similar works, Quaff drastically reduces the runtime overhead by performing most of skeleton expansion and optimization at compile time. This is carried out by relying on template-based meta-programming techniques and without compromising expressivity nor readability. This has been demonstrated on several realistic, complex vision applications, including a real-time 3D

References (24)

M. Aldinucci et al.
An advanced environment supporting structured parallel programming in Java
Future Generation Computer Systems
(2003)
M. Cole
Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming
Parallel Computing
(2004)
D. Abrahams et al.
C++ Template Metaprogramming: Concepts, Tools and Techniques from Boost and Beyond
(2004)
A. Alexandrescu
Modern C++ Design: Generic Programming and Design Patterns Applied
(2001)
B. Bacci et al.
P3L: a structured high level programming language and its structured support
Concurrency: Practice and Experience
(1995)
M. Cole
M. Cole, A. Benoit, Using eSkel to implement the multiple baseline stereo application, in: ParCo 2005, Malaga, Spain,...
F. Dabrowski et al.
Functional Bulk Synchronous Programming in C++
M. Danelutto, P. Teti, Lithium: a structured parallel programming enviroment in Java, in: Proceedings of Computational...
J. Falcou, J. Sérot, T. Chateau, J.-T. Lapresté, Real time parallel implementation of a particle filter based visual...

A. Fusiello et al.

A compact algorithm for rectification of stereo pairs

Machine Vision and Application

(2000)

M. Hamdan, G. Michaelson, P. King, A scheme for nesting algorithmic skeletons, in: Proceedings of the 10th...

Cited by (54)

Remote sensing big data computing: Challenges and opportunities
2015, Future Generation Computer Systems
Citation Excerpt :
Furthermore, the polymorphism is resolved at compile time because of its usual type genericity. The QUAFF [83] skeleton-based library offers generic algorithms. It relies on C++ templates to resolve polymorphism by means of type definitions processed at the compile time.
As we have entered an era of high resolution earth observation, the RS data are undergoing an explosive growth. The proliferation of data also give rise to the increasing complexity of RS data, like the diversity and higher dimensionality characteristic of the data. RS data are regarded as RS “Big Data”. Fortunately, we are witness the coming technological leapfrogging. In this paper, we give a brief overview on the Big Data and data-intensive problems, including the analysis of RS Big Data, Big Data challenges, current techniques and works for processing RS Big Data.
A framework for argument-based task synchronization with automatic detection of dependencies
2013, Parallel Computing
Citation Excerpt :
This is thanks to the fact that DepSpawn tasks wait for all the preceding tasks with which they have dependencies and their descendants. Programs developed under the usual imperative languages can also avoid explicit synchronizations by resorting to data-parallel approaches [45,1,3,46] and parallel skeletons [47,48,8,49,33], at the cost again of restricting the patterns of parallelization. Because of their limitations, users are often forced to resort to lower level approaches [50–52,7,8,53,54,9] seeking more flexibility.
Synchronization in parallel applications can be achieved either implicitly or explicitly. Implicit synchronization is typical of programming environments that provide predefined, and often simple, patterns of parallelism such as data-parallel libraries and languages and skeletal operations. Nevertheless, more flexible approaches that allow to express arbitrary task-level parallel computations without a predefined structure request in turn that the user explicitly specifies the synchronization needed among the parallel tasks.
In this paper we present a library-based approach that enables arbitrary patterns of parallelism with minimal effort for the user. Our proposal is the first generic approach to express parallelism we know of that requires neither explicit synchronizations nor a detail of the dependencies of the parallel tasks. Our strategy relies on expressing the parallel tasks as functions that convey their dependencies implicitly by means of their arguments. These function arguments are analyzed by our library, called DepSpawn, when a parallel task is spawned in order to enforce its dependencies. Our experiments indicate that DepSpawn is very competitive, both in terms of performance and programmability, with respect to a widespread high-level approach like OpenMP.
Optimization techniques for efficient HTA programs
2012, Parallel Computing
Citation Excerpt :
These are metaprogrammed self-optimizing libraries that play an active role in the compilation process to improve performance by means such as expression templates [11], programmable syntax macros [34] and meta-object protocols [35]. For example, the MTL [36] and QUAFF [37] have in common with the HTA that they rely on C++ template-based metaprogramming, which enables among other optimizations, the usage of static polymorphism, thus avoiding the costly overhad of dynamic dispatch. As illustration of the importance of this optimization, the C++ skeleton library MUESLI reports in [38] an overhead due to dynamic polymorphism between 20 and 110 percent for simple applications.
Object oriented languages can be easily extended with new data types, which facilitate prototyping new language extensions. A very challenging problem is the development of data types encapsulating data parallel operations, which could improve parallel programming productivity. However, the use of class libraries to implement data types, particularly when they encapsulate parallelism, comes at the expense of performance overhead.
This paper describes our experience with the implementation of a C++ data type called hierarchically tiled array (HTA). This object includes data parallel operations and allows the manipulation of tiles to facilitate developing efficient parallel codes and codes with high degree of locality. The initial performance of the HTA programs we wrote was lower than that of their conventional MPI-based counterparts. The overhead was due to factors such as the creation of temporary HTAs and the inability of the compiler to properly inline index computations, among others. We describe the performance problems and the optimizations applied to overcome them as well as their impact on programmability. After the optimization process, our HTA-based implementations run only slightly slower than the MPI-based codes while having much better programmability metrics.
Balancing Tracking Granularity and Parallelism in Many-Task Systems: The Horizons Approach
2024, SN Computer Science
DSParLib: A C++ Template Library for Distributed Stream Parallelism
2022, International Journal of Parallel Programming
The Celerity High-level API: C++20 for Accelerator Clusters
2022, International Journal of Parallel Programming

View all citing articles on Scopus

View full text

Quaff: efficient C++ design for parallel skeletons

Abstract

Introduction

Section snippets

A first example

Experimental results

Quaff implementation

Related work

Conclusion

Future Generation Computer Systems

Parallel Computing

C++ Template Metaprogramming: Concepts, Tools and Techniques from Boost and Beyond

Modern C++ Design: Generic Programming and Design Patterns Applied

P3L: a structured high level programming language and its structured support

Concurrency: Practice and Experience

Functional Bulk Synchronous Programming in C++

A compact algorithm for rectification of stereo pairs

Machine Vision and Application