Elsevier

Parallel Computing

Volume 32, Issues 7–8, September 2006, Pages 604-615
Parallel Computing

Quaff: efficient C++ design for parallel skeletons

https://doi.org/10.1016/j.parco.2006.06.001Get rights and content

Abstract

We present Quaff, a new skeleton-based parallel programming library. Its main originality is to rely on C++ template meta-programming techniques to achieve high efficiency. In particular, by performing most of skeleton instantiation and optimization at compile-time, Quaff can keep the overhead traditionally associated to object-oriented implementations of skeleton-based parallel programming libraries very small. This is not done at the expense of expressivity. This is demonstrated in this paper by several applications, including a full-fledged, realistic real-time vision application.

Introduction

Modern parallel programming on MIMD machines is usually carried out using message-passing libraries. These libraries, such as PVM or MPI, provide a standardized, platform independent way to build parallel applications. However, manipulating such low level frameworks is difficult and error prone. Deadlocks and other common undesired behaviors make parallel software development very slow compared to classic, sequential one. Algorithmic skeletons [5], [6] have been proposed as a solution to these problems. They are typical parallel patterns that can be implemented once on a given platform. Building parallel software using algorithmic skeletons boils down to combining skeletons and sequential code fragments. In practice, approaches based upon skeletons can be divided into three main classes:

  • New languages embedding skeleton patterns in their syntax. This approach can offer good performances but requires the programmer to learn a new language, which can viewed as an obstacle to the adoption of this paradigm.

  • Parallel compilers for an existing language. Such compilers aim at identifying parallel structures in existing sequential code and use specific parallel implementation for those structures [20].

  • High-level libraries for an existing language. Such solutions are more easily accepted by developers as they allow them to reuse existing code and to work with a familiar environment.

We focus here on approaches based upon high-level libraries for an existing language. This is the way taken, for example, by the BSMLLib [17], Lithium [9], [2], eSkel [6] and Muesli [15] projects.1 The most challenging issue, for such a library is to find a good trade-off between readability and efficiency. By readability we mean the ability to express parallelism with a minimum concern for implementation details. By efficiency we mean the ability to produce code whose performances can stay on the par with those obtained using a low-level message passing like MPI. These features are clearly in tension with each other. For example, eSkel can produce very efficient code but at the expense of a rather low-level API (with a lot of MPI-specific idioms visible). By contrast, a library such as Muesli exposes a much more abstract and simple API but incurs a significantly higher overhead at runtime. This overhead can be mostly explained by the fact that this library uses an abstract class hierarchy to embed user-defined tasks within the skeleton structure of the application, resulting in repeated virtual function calls at run-time. This is unfortunate because, within skeleton-based parallel programming models, the overall structure of the application, i.e., the combination of parallel skeletons and user-defined sequential functions is essentially static.

The library described in this paper, named Quaff, aims at reducing the aforementioned tension. For this, it relies on C++ compilation techniques such as template meta-programming to reduce the runtime overhead of classical object-oriented implementations of skeleton-based parallel programming libraries to the strict minimum, while keeping a high-level of expressivity and readability. Quaff also enforces the re-usability of legacy code or third party libraries by limiting its impact on existing code.

This paper is organized as follows. The Quaff programming model is introduced in Section 2, with simple examples. Section 3 presents the implementation, with quaff, of a full-fledged vision application, showing its ability to handle complex, realistic situations while still producing very efficient code. Section 4 details the implementation techniques used for turning Quaff programs into optimized MPI code. Section 5 is a short review of related work and Section 6 concludes the paper.

Section snippets

A first example

In this section, we describe the encoding, with Quaff, of a very simple application which performs matrix multiplication using a classical domain decomposition approach with a scm (split-compute-merge) skeleton. The Quaff code of the application appears on listing 1. The corresponding task graph is given on Fig. 1.

Listing 1: Sample Quaff application

1 // User-defined tasks registration
2 typedef task<CSlice, none_t, matrix>slice;
3 typedef function(FMatMul, matrix, matrix)mul;
4 typedef task<

Experimental results

Section 2 has demonstrated the ability for quaff to produce efficient code for simple applications. We now focus on expressivity issues. In this section the implementation of a realistic application with Quaff is presented. By realistic, we mean an application solving a “real” problem, by contrast to the code samples in Section 2, whose goal is only to illustrate and demonstrate programming features. This application, taken from the computer vision domain, performs real-time 3D reconstruction

Quaff implementation

As stated in Section 2, the major issue with class-based libraries is the high overhead induced by virtual function calls. Recent compilers are able to reduce this overhead by performing various aggressive optimizations. However, none of them is able to optimize code across function or method boundaries. In the classic polymorphic library model, this leads to very efficient code at the function level, but poor performances at the program level.

One solution is to write code that forces the

Related work

As stated in Section 1, Quaff is a library-based approach to skeleton-based parallel programming. The most significant projects related to this approach are BSMLlib, Lithium, eSkel and Muesli.

BSMLlib [17] is a library for integrating Bulk Synchronous Parallel (BSP) programming in a functional language (Objective Caml). It extends the underlying lambda-calculus with parallel operations on parallel data structures. Being based on a formal operational semantics, it can be used to predict execution

Conclusion

In this paper, we have introduced Quaff, a skeleton-based parallel programming library written in C++. Compared to similar works, Quaff drastically reduces the runtime overhead by performing most of skeleton expansion and optimization at compile time. This is carried out by relying on template-based meta-programming techniques and without compromising expressivity nor readability. This has been demonstrated on several realistic, complex vision applications, including a real-time 3D

References (24)

  • M. Aldinucci et al.

    An advanced environment supporting structured parallel programming in Java

    Future Generation Computer Systems

    (2003)
  • M. Cole

    Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming

    Parallel Computing

    (2004)
  • D. Abrahams et al.

    C++ Template Metaprogramming: Concepts, Tools and Techniques from Boost and Beyond

    (2004)
  • A. Alexandrescu

    Modern C++ Design: Generic Programming and Design Patterns Applied

    (2001)
  • B. Bacci et al.

    P3L: a structured high level programming language and its structured support

    Concurrency: Practice and Experience

    (1995)
  • M. Cole
  • M. Cole, A. Benoit, Using eSkel to implement the multiple baseline stereo application, in: ParCo 2005, Malaga, Spain,...
  • F. Dabrowski et al.

    Functional Bulk Synchronous Programming in C++

  • M. Danelutto, P. Teti, Lithium: a structured parallel programming enviroment in Java, in: Proceedings of Computational...
  • J. Falcou, J. Sérot, T. Chateau, J.-T. Lapresté, Real time parallel implementation of a particle filter based visual...
  • A. Fusiello et al.

    A compact algorithm for rectification of stereo pairs

    Machine Vision and Application

    (2000)
  • M. Hamdan, G. Michaelson, P. King, A scheme for nesting algorithmic skeletons, in: Proceedings of the 10th...
  • Cited by (54)

    • Remote sensing big data computing: Challenges and opportunities

      2015, Future Generation Computer Systems
      Citation Excerpt :

      Furthermore, the polymorphism is resolved at compile time because of its usual type genericity. The QUAFF [83] skeleton-based library offers generic algorithms. It relies on C++ templates to resolve polymorphism by means of type definitions processed at the compile time.

    • A framework for argument-based task synchronization with automatic detection of dependencies

      2013, Parallel Computing
      Citation Excerpt :

      This is thanks to the fact that DepSpawn tasks wait for all the preceding tasks with which they have dependencies and their descendants. Programs developed under the usual imperative languages can also avoid explicit synchronizations by resorting to data-parallel approaches [45,1,3,46] and parallel skeletons [47,48,8,49,33], at the cost again of restricting the patterns of parallelization. Because of their limitations, users are often forced to resort to lower level approaches [50–52,7,8,53,54,9] seeking more flexibility.

    • Optimization techniques for efficient HTA programs

      2012, Parallel Computing
      Citation Excerpt :

      These are metaprogrammed self-optimizing libraries that play an active role in the compilation process to improve performance by means such as expression templates [11], programmable syntax macros [34] and meta-object protocols [35]. For example, the MTL [36] and QUAFF [37] have in common with the HTA that they rely on C++ template-based metaprogramming, which enables among other optimizations, the usage of static polymorphism, thus avoiding the costly overhad of dynamic dispatch. As illustration of the importance of this optimization, the C++ skeleton library MUESLI reports in [38] an overhead due to dynamic polymorphism between 20 and 110 percent for simple applications.

    • DSParLib: A C++ Template Library for Distributed Stream Parallelism

      2022, International Journal of Parallel Programming
    • The Celerity High-level API: C++20 for Accelerator Clusters

      2022, International Journal of Parallel Programming
    View all citing articles on Scopus
    View full text