A Generate-Test-Aggregate parallel programming library for systematic parallel programming
Introduction
Google’s MapReduce [1] is a famous parallel programing model that simplifies the parallel and distributed processing of large scale data. Despite the simplicity of MapReduce, developing efficient MapReduce programs is still a challenge for certain optimization problems, because users are required to make particular divide and conquer algorithms that must fit the execution model of MapReduce.
As an example, consider the well-known 0–1 Knapsack problem: fill a knapsack with items, each of certain value and weight , such that the total value of packed items is maximal while adhering to the weight restriction W of the knapsack. This problem can be formulated as:However, designing an efficient MapReduce algorithm for the Knapsack problem is difficult for many programmers because the above formula does not directly match MapReduce model. Moreover, designing an algorithm for the Knapsack problem with additional conditions is even more difficult.
The theory of GTA has been proposed [2], [3] to remedy this situation. It synthesizes efficient MapReduce programs (i.e., parallel and scalable programs) for a general class of problems that can be specified in terms of and in a naive way by first generating all possible solution candidates, keeping those candidates that have passed a test of certain conditions, and finally selecting the best solution or making a summary of valid solutions with an aggregating computation. For instance, the Knapsack problem could be specified by a GTA program like this: generate all possible selections of items, keep those that satisfy the constraint of total weight, and then select the one which has the maximum sum of values. Note that directly implementing such an algorithm by MapReduce is not practical, because given n items, the naive program will generate possible selections. The theory of GTA gives an algorithmic way to synthesize from such a naive program to a fully parallelized MapReduce program that has work efficiency.1
The previous work [2], [3] described the GTA programming style and the GTA fusion theorems theoretically, but it did not mention any about the implementation: because of the gap between mathematical concepts and practical programming languages, it is non-trivial to implement the GTA theory in such a way that it yields both a powerful optimization and a nice programming interface. Moreover, there has to be more work on GTA in order to identify its real capabilities for practical parallel programming and to make a sufficient guide for new users.
In this paper we present our implementation of a lightweight GTA library (in Scala [4]) that is a functional programming platform allowing users to write GTA programs and execute them on local machines or large computer clusters. Our main technical contribution is two fold. First, we design a generic program interface that allows users to write their programs in a sequential manner following the GTA programming style, without special requirements for knowing theoretical details of GTA or parallel programming. The GTA library takes the responsibility of transforming user-specified programs to efficient MapReduce programs, and executing them on practical MapReduce engines. Second, we demonstrate the usefulness of our GTA library with many interesting examples and show that lots of problems can be easily and efficiently solved by using our library.
The rest of the paper is organized as follows. After explaining the background in Section 2, we introduce the programming interface of our GTA library in Section 3. More examples and details about GTA programming are introduced in Section 5. Section 4 describes the implementation of the library in detail. Then, we describe the experimental results in Section 6. The related work is discussed in Section 7. Finally, we conclude the paper and highlight the future work in Section 8. The source code used for our experiments is available online.2
Section snippets
Background
In this section we briefly review the concepts of GTA [2], [3] as well as its background knowledge, list homomorphism [5], [6], [7] and MapReduce [1]. The notation we use to formally describe algorithms is based on the functional programming language Haskell [5]. Function application can be written without parentheses, i.e., equals . Functions are curried [5], and function application is left associative, thus, equals . Function application has higher precedence than operators,
Programming interface
Our library provides an easy-to-use programming interface for users to write GTA expressions using the GTA components i.e., generators, testers, aggregators. Fig. 2 shows a schematic of how the library can automatically transform a user-specified GTA program to an efficient MapReduce program and execute it. The transformation has two phases: in the first phase, a user-specified GTA program is transformed to an instance of that is a Scala trait adapting the list homomorphism to the
Implementations
We chose Scala to implement our library not only because it is a functional language with a flexible syntax and strong type system, but also because of its performance and portability (Scala is JVM based so it is compatible with most popular Java systems). We used Spark [10] as the MapReduce engine because it is implemented in Scala and can be seen as an alternative to the Hadoop [9] framework.
More examples for GTA programming
In this section we illustrate GTA programming techniques in more details. The examples presented below use more complex , and .
Maximum segment sum problem. Let us consider the famous maximum segment sum ( for short) problem [13], [14], [15], [16], [17], [18]: given a list of integers, find the maximum sum of its all segments (contiguous sublists). This is a simplified problem of finding an optimal period in a history of changing values.
The GTA algorithm for
Performance evaluation
We evaluated our GTA library on sequential and parallel (distributed) models, and results proved the efficiency and scalability of GTA programs.
Related work
The research on parallelization via derivation of list homomorphisms has gained great interest [6], [23], [24]. The main approaches include the function composition based method [25], [26], [27], the third homomorphism theorem based method [28], [15], and the matrix multiplication based method [29]. It has been shown that homomorphism-based approaches can be used in systematic programming of MapReduce [11].
GTA [2], [3] is a new approach to systematic development of efficient parallel programs
Conclusions
We showed that the theory of Generate-Test-Aggregate for systematic derivation of efficient parallel programs can be implemented on MapReduce in a concise and effective way. Our GTA library provides a convenient GTA programming interface for users to express their problems in the GTA pattern easily, and to run these GTA-programs efficiently in multi-thread, Spark, and Hadoop environments. Our initial experimental results on several interesting examples indicate the GTA library’s usefulness in
Acknowledgement
This work was partially supported by JSPS KAKENHI Grant No. 24700025.
References (31)
- J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, in: 6th Symposium on Operating System...
- et al.
Generate, test, and aggregate – a calculation-based framework for systematic parallel programming with mapreduce
- et al.
Filter-embedding semiring fusion for programming with mapreduce
Form. Aspects Comput.
(2012) - M. Odersky, et al., The scala language specification, version 2.9, Tech. rep., EPFL Lausanne, Switzerland, 2011....
Introduction to Functional Programming using Haskell
(1998)Parallel programming with list homomorphisms
Parallel Process. Lett.
(1995)Systematic extraction and implementation of divide-and-conquer parallelism
- et al.
Formal derivation of efficient parallel programs by construction of list homomorphisms
ACM Trans. Program. Lang. Syst.
(1997) - Apache software foundation, Hadoop....
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M.J. Franklin, S. Shenker, I. Stoica, Resilient...