Elsevier

Parallel Computing

Volume 40, Issue 2, February 2014, Pages 116-135
Parallel Computing

A Generate-Test-Aggregate parallel programming library for systematic parallel programming

https://doi.org/10.1016/j.parco.2013.11.002Get rights and content

Highlights

  • We implement a language library (in Scala) for systematic parallel programming.

  • This library provides a Generate-Test-Aggregate DSL as the programming interface.

  • The GTA programs are automatically transformed to efficient MapReduce programs.

  • Our library can significantly decrease the difficulties in parallel programming.

  • Evaluation of various GTA examples shows high performance and good scalability.

Abstract

The Generate-Test-Aggregate (GTA for short) algorithm is modeled following a simple and straightforward programming pattern, for combinatorial problems. First, generate all candidates; second, test and filter out invalid ones; finally, aggregate valid ones to make the final result. These three processing steps can be specified by three building blocks namely, generator, tester, and aggregator. Despite the simplicity of algorithm design, implementing the GTA algorithm naively following the three processing steps, i.e., brute-force, will result in an exponential-cost computation, and thus it is impractical for processing large data. The theory of GTA illustrates that if the definitions of generator, tester, and aggregator satisfy certain conditions, an efficient (usually near-linear cost) MapReduce program can be automatically derived from the GTA algorithm.

The principle of GTA is attractive but how to make it being practically useful, remains as an important and challenge problem due to the complexity of GTA program transformations. In this paper, we report on our studying and implementation of a practical GTA library (written in the functional language Scala) which provides a systematic parallel programming approach for big-data analysis with MapReduce. The library provides a simple functional style programming interface and hides all the internal transformations. With this library, users can write parallel programs in a sequential manner in terms of the GTA algorithm, and the efficiency of the generated MapReduce programs is guaranteed systematically. Therefore, parallel programming for many problems could become no more a tough job. We demonstrate the usefulness of our GTA library on some interesting problems involving large data and show that lots of applications can be easily and efficiently solved by using our library.

Introduction

Google’s MapReduce [1] is a famous parallel programing model that simplifies the parallel and distributed processing of large scale data. Despite the simplicity of MapReduce, developing efficient MapReduce programs is still a challenge for certain optimization problems, because users are required to make particular divide and conquer algorithms that must fit the execution model of MapReduce.

As an example, consider the well-known 0–1 Knapsack problem: fill a knapsack with items, each of certain value vi and weight wi, such that the total value of packed items is maximal while adhering to the weight restriction W of the knapsack. This problem can be formulated as:maximizei=1nvixisubject toi=1nwixiW,xi{0,1}However, designing an efficient MapReduce algorithm for the Knapsack problem is difficult for many programmers because the above formula does not directly match MapReduce model. Moreover, designing an algorithm for the Knapsack problem with additional conditions is even more difficult.

The theory of GTA has been proposed [2], [3] to remedy this situation. It synthesizes efficient MapReduce programs (i.e., parallel and scalable programs) for a general class of problems that can be specified in terms of generate,test and aggregate in a naive way by first generating all possible solution candidates, keeping those candidates that have passed a test of certain conditions, and finally selecting the best solution or making a summary of valid solutions with an aggregating computation. For instance, the Knapsack problem could be specified by a GTA program like this: generate all possible selections of items, keep those that satisfy the constraint of total weight, and then select the one which has the maximum sum of values. Note that directly implementing such an algorithm by MapReduce is not practical, because given n items, the naive program will generate O(2n) possible selections. The theory of GTA gives an algorithmic way to synthesize from such a naive program to a fully parallelized MapReduce program that has O(n) work efficiency.1

The previous work [2], [3] described the GTA programming style and the GTA fusion theorems theoretically, but it did not mention any about the implementation: because of the gap between mathematical concepts and practical programming languages, it is non-trivial to implement the GTA theory in such a way that it yields both a powerful optimization and a nice programming interface. Moreover, there has to be more work on GTA in order to identify its real capabilities for practical parallel programming and to make a sufficient guide for new users.

In this paper we present our implementation of a lightweight GTA library (in Scala [4]) that is a functional programming platform allowing users to write GTA programs and execute them on local machines or large computer clusters. Our main technical contribution is two fold. First, we design a generic program interface that allows users to write their programs in a sequential manner following the GTA programming style, without special requirements for knowing theoretical details of GTA or parallel programming. The GTA library takes the responsibility of transforming user-specified programs to efficient MapReduce programs, and executing them on practical MapReduce engines. Second, we demonstrate the usefulness of our GTA library with many interesting examples and show that lots of problems can be easily and efficiently solved by using our library.

The rest of the paper is organized as follows. After explaining the background in Section 2, we introduce the programming interface of our GTA library in Section 3. More examples and details about GTA programming are introduced in Section 5. Section 4 describes the implementation of the library in detail. Then, we describe the experimental results in Section 6. The related work is discussed in Section 7. Finally, we conclude the paper and highlight the future work in Section 8. The source code used for our experiments is available online.2

Section snippets

Background

In this section we briefly review the concepts of GTA [2], [3] as well as its background knowledge, list homomorphism [5], [6], [7] and MapReduce [1]. The notation we use to formally describe algorithms is based on the functional programming language Haskell [5]. Function application can be written without parentheses, i.e., fa equals f(a). Functions are curried [5], and function application is left associative, thus, fab equals (fa)b. Function application has higher precedence than operators,

Programming interface

Our library provides an easy-to-use programming interface for users to write GTA expressions using the GTA components i.e., generators, testers, aggregators. Fig. 2 shows a schematic of how the library can automatically transform a user-specified GTA program to an efficient MapReduce program and execute it. The transformation has two phases: in the first phase, a user-specified GTA program is transformed to an instance of MapReduceable that is a Scala trait adapting the list homomorphism to the

Implementations

We chose Scala to implement our library not only because it is a functional language with a flexible syntax and strong type system, but also because of its performance and portability (Scala is JVM based so it is compatible with most popular Java systems). We used Spark [10] as the MapReduce engine because it is implemented in Scala and can be seen as an alternative to the Hadoop [9] framework.

More examples for GTA programming

In this section we illustrate GTA programming techniques in more details. The examples presented below use more complex generators,testers, and aggregators.

Maximum segment sum problem. Let us consider the famous maximum segment sum (MSS for short) problem [13], [14], [15], [16], [17], [18]: given a list of integers, find the maximum sum of its all segments (contiguous sublists). This is a simplified problem of finding an optimal period in a history of changing values.

The GTA algorithm for MSS

Performance evaluation

We evaluated our GTA library on sequential and parallel (distributed) models, and results proved the efficiency and scalability of GTA programs.

Related work

The research on parallelization via derivation of list homomorphisms has gained great interest [6], [23], [24]. The main approaches include the function composition based method [25], [26], [27], the third homomorphism theorem based method [28], [15], and the matrix multiplication based method [29]. It has been shown that homomorphism-based approaches can be used in systematic programming of MapReduce [11].

GTA [2], [3] is a new approach to systematic development of efficient parallel programs

Conclusions

We showed that the theory of Generate-Test-Aggregate for systematic derivation of efficient parallel programs can be implemented on MapReduce in a concise and effective way. Our GTA library provides a convenient GTA programming interface for users to express their problems in the GTA pattern easily, and to run these GTA-programs efficiently in multi-thread, Spark, and Hadoop environments. Our initial experimental results on several interesting examples indicate the GTA library’s usefulness in

Acknowledgement

This work was partially supported by JSPS KAKENHI Grant No. 24700025.

References (31)

  • J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, in: 6th Symposium on Operating System...
  • K. Emoto et al.

    Generate, test, and aggregate – a calculation-based framework for systematic parallel programming with mapreduce

  • K. Emoto et al.

    Filter-embedding semiring fusion for programming with mapreduce

    Form. Aspects Comput.

    (2012)
  • M. Odersky, et al., The scala language specification, version 2.9, Tech. rep., EPFL Lausanne, Switzerland, 2011....
  • R.S. Bird

    Introduction to Functional Programming using Haskell

    (1998)
  • M. Cole

    Parallel programming with list homomorphisms

    Parallel Process. Lett.

    (1995)
  • S. Gorlatch

    Systematic extraction and implementation of divide-and-conquer parallelism

  • Z. Hu et al.

    Formal derivation of efficient parallel programs by construction of list homomorphisms

    ACM Trans. Program. Lang. Syst.

    (1997)
  • Apache software foundation, Hadoop....
  • M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica, Resilient...
  • Y. Liu et al.

    Towards systematic parallel programming over mapreduce

    Proceedings of the 17th International Euro-Par Conference on Parallel Processing: Part II, Euro-Par’11

    (2011)
  • M. Cole, List homomorphic parallel algorithms for bracket matching, Department of Computer Science, University of...
  • Z. Hu et al.

    Formal derivation of parallel program for 2-dimensional maximum segment sum problem

  • S.-C. Mu

    Maximum segment sum is back: deriving algorithms for two segment problems with bounded lengths

    Proceedings of the 2008 ACM SIGPLAN symposium on Partial evaluation and semantics-based program manipulation, PEPM ’08

    (2008)
  • K. Morita et al.

    Automatic inversion generates divide-and-conquer parallel programs

    (2007)
  • Cited by (0)

    View full text