On scalable parallel recursive backtracking

https://doi.org/10.1016/j.jpdc.2015.07.006Get rights and content

Highlights

  • A simple framework for parallelizing exact search-tree algorithms.

  • An indexing scheme for simple task transmission and efficient communication.

  • Efficient and effective extraction of heavy tasks for dynamic load balancing.

  • The presented method scales almost linearly to a record number of processing elements.

Abstract

Supercomputers are equipped with an increasingly large number of cores to use computational power as a way of solving problems that are otherwise intractable. Unfortunately, getting serial algorithms to run in parallel to take advantage of these computational resources remains a challenge for several application domains. Many parallel algorithms can scale to only hundreds of cores. The limiting factors of such algorithms are usually communication overhead and poor load balancing. Solving NP-hard graph problems to optimality using exact algorithms is an example of an area in which there has so far been limited success in obtaining large scale parallelism. Many of these algorithms use recursive backtracking as their core solution paradigm. In this paper, we propose a lightweight, easy-to-use, scalable approach for transforming almost any recursive backtracking algorithm into a parallel one. Our approach incurs minimal communication overhead and guarantees a load-balancing strategy that is implicit, i.e., does not require any problem-specific knowledge. The key idea behind our approach is the use of efficient traversal operations on an indexed search tree that is oblivious to the problem being solved. We test our approach with parallel implementations of algorithms for the well-known Vertex Cover and Dominating Set problems. On sufficiently hard instances, experimental results show nearly linear speedups for thousands of cores, reducing running times from days to just a few minutes.

Introduction

Parallel computation is becoming increasingly important as performance levels out in terms of delivering parallelism within a single processor due to Moore’s law. This paradigm shift means that to attain speedup, software that implements algorithms that can run in parallel on multiple processors/cores is required. Today we have a growing list of supercomputers with tremendous processing power. Some of these systems include more than a million computing cores and can achieve up to 30 Petaflop/s. The constant increase in the number of processors/cores per supercomputer motivates the development of parallel algorithms that can efficiently utilize such processing infrastructures. Unfortunately, migrating known serial algorithms to exploit parallelism while maintaining scalability is not straightforward. The overheads introduced by parallelism are very often hard to evaluate, and fair load balancing is possible only when accurate estimates of task “hardness” or “weight” can be calculated on-the-fly. Providing such estimates usually requires problem-specific knowledge, rendering the techniques developed for a certain problem useless when trying to parallelize an algorithm for another.

As it is not likely that polynomial-time algorithms can be found for NP-hard problems, the search for fast deterministic algorithms could benefit greatly from the processing capabilities of supercomputers. Researchers working in the area of exact algorithms have developed algorithms yielding lower and lower running times [5], [15], [6], [14], [21]. However the major focus has been on improving the asymptotic worst-case behavior of algorithms. The practical aspects of the possibility of exploiting parallel infrastructures have received much less attention.

Most existing exact algorithms for NP-hard graph problems follow the well-known branch-and-reduce paradigm. A branch-and-reduce algorithm searches the complete solution space of a given problem for an optimal solution. Simple enumeration is usually prohibitively expensive due to the exponentially increasing number of potential solutions. To prune parts of the solution space, an algorithm uses reduction rules derived from bounds on the function to be optimized and the value of the current best solution. The reader is referred to Woeginger’s excellent survey paper on exact algorithms for further details  [25]. At the implementation level, branch-and-reduce algorithms translate to search-tree-based recursive backtracking algorithms. The search tree size usually grows exponentially with either the size of the input instance n or some integer parameter k when the problem is fixed-parameter tractable  [11].

Nevertheless, search trees are good candidates for parallel decomposition. While most divide-and-conquer methods for parallel algorithms aim at partitioning a problem instance among the cores, we partition the search space of the problem instead. Given c cores or processing elements, a brute-force parallel solution would divide a search tree into c subtrees and assign each subtree to a separate core for sequential processing. One might hope to thus reduce the overall running time by a factor of c. However, this intuitive approach suffers from several drawbacks, including the obvious lack of load balancing.

Even though our focus is on NP-hard graph problems, we note that recursive backtracking is a widely-used technique for solving a very long list of practical problems. This justifies the need for a general strategy to simplify the migration from serial to parallel algorithms. One example of a successful parallel framework for solving different types of problems is MapReduce  [8]. The success of the MapReduce model can be attributed to its simplicity, transparency, and scalability, all of which are properties essential for any efficient parallel algorithm. In this paper, we propose a simple, lightweight, scalable approach for transforming almost any recursive backtracking algorithm into a parallel one with minimal communication overhead and a load balancing strategy that is implicit, i.e., does not require any problem-specific knowledge. The key idea behind our approach is the use of efficient traversal operations on an indexed search tree that is oblivious to the problem being solved. To test our approach, we implement parallel exact algorithms for the well-known Vertex Cover and Dominating Set problems. Experimental results show that for sufficiently hard instances, we obtain nearly linear speedups on at least 32,768 cores.

Section snippets

Preliminaries

Typically, a recursive backtracking algorithm exhaustively explores a search tree T using depth-first search traversal. Each node of T (a search node) maintains some data structures required for completing the search. We denote a search node by Nd,p, where d is the depth of Nd,p in T and p is the position of Nd,p in the left-to-right ordering of all search nodes at depth d. The root of T is thus N0,0. We use T(Nd,p) to denote the subtree rooted at node Nd,p. We say T has branching factor b if

Communication overhead

The most evident overhead in parallel algorithms is that of communication. Several models have already been presented in the literature including centralized (i.e. the master–worker(s) model where most of the communication and task distribution duties are assigned to a single core)  [4], decentralized  [12], [3], or a hybrid of both  [23]. Although each model has its pros and cons, centralization rapidly becomes a bottleneck when the number of computing cores exceeds a certain threshold  [4].

Addressing the challenges

In this section, we show how to incrementally transform Serial-RB into a parallel algorithm. First, we discuss indexed search trees and their use in a generic and compact task-encoding scheme. As a byproduct of this encoding, we show how we can efficiently extract heavy (if not heaviest) unprocessed tasks for dynamic load balancing. We provide pseudocode to illustrate the simplicity of transforming serial algorithms to parallel ones. The end result is a parallel algorithm, Parallel-RB, which

Implementation

We tested our approach with parallel implementations of algorithms for the well-known Vertex Cover and Dominating Set problems.

Vertex Cover
Input:A graph G=(V,E)
Question:Find a set CV such that |C| is
minimized and the graph induced by VC
is edgeless

Dominating Set
Input:A graph G=(V,E)
Question:Find a set DV such that |D| is
minimized and every vertex in G is either
in D or is adjacent to a vertex in D

Both problems have received considerable attention in the areas of exact and fixed parameter

Experimental results

Our code, written in the standard C language, utilizes the Message Passing Interface (MPI)  [10] and has no other dependencies. Computations were performed on the BGQ supercomputer at the SciNet HPC Consortium.1 The BGQ production system is a 3rd generation Blue Gene IBM supercomputer built around

Interpreting the results

In almost all cases, the algorithms achieve near linear speedup on at least 32,768 cores. Not surprisingly, whenever the time required to solve an instance drops to just a few minutes, the overall performance of the algorithms decreases as we add more cores to the computation. More surprising might be the super-linear speedups attained for the 60-graph. This is mainly due to some sort of cooperation among the various cores: when one core finds a solution of a certain “improved” size, the value

Conclusions and future work

Combining indexed search trees and (local) heaviest task extraction with a decentralized communication model, we have showed how any serial recursive backtracking algorithm, with some ordered branching, can be modified to run in parallel. Some of the key advantages of our approach are:

  • The migration from serial to parallel entails very little additional coding. Implementing each of our parallel algorithms took less than two days.

  • It completely eliminates the need for buffering multiple tasks and

Acknowledgments

The authors would like to thank Chris Loken and the SciNet team for providing access to the BGQ production system and for their support throughout the experiments. Research was supported by the Natural Science and Engineering Research Council of Canada.

Faisal N. Abu-Khzam is a faculty member in the Department of Computer Science and Mathematics at the Lebanese American University. His research interests include High Performance Computing, Graph Algorithms, Parameterized Complexity and Computational Biology.

References (26)

  • J. Chen et al.

    Vertex cover: further observations and further improvements

    J. Algorithms

    (2001)
  • J. Chen et al.

    Improved upper bounds for vertex cover

    Theoret. Comput. Sci.

    (2010)
  • V. Kumar et al.

    Scalable load balancing techniques for parallel computers

    J. Parallel Distrib. Comput.

    (1994)
  • F.N. Abu-Khzam, M.A. Langston, A.E. Mouawad, C.P. Nolan, A hybrid graph representation for recursive backtracking...
  • F.N. Abu-Khzam et al.

    Scalable parallel algorithms for FPT problems

    Algorithmica

    (2006)
  • F.N. Abu-Khzam, A.E. Mouawad, A decentralized load balancing approach for parallel search-tree optimization, in:...
  • F.N. Abu-Khzam, M.A. Rizk, D.A. Abdallah, N.F. Samatova, The buffered work-pool approach for search-tree based...
  • W.F. Clocksin et al.

    A method for efficiently executing horn clause programs using multiple processors

    New Gen. Comput.

    (1988)
  • J. Dean et al.

    MapReduce: simplified data processing on large clusters

    Commun. ACM

    (2008)
  • S. Debroni et al.

    Maximum independent sets of the 120-cell and other regular polyhedral

    Ars Mathematica Contemporanea

    (2013)
  • J.J. Dongarra et al.

    MPI: a message-passing interface standard

    Int. J. Supercomput. Appl.

    (1994)
  • R.G. Downey et al.

    Parameterized Complexity

    (1997)
  • G.D. Fatta et al.

    Decentralized load balancing for highly irregular search problems

    Microprocess. Microsyst.

    (2007)
  • Cited by (22)

    • Comparing deep learning models for low-light natural scene image enhancement and their impact on object detection and classification: Overview, empirical evaluation, and challenges

      2022, Signal Processing: Image Communication
      Citation Excerpt :

      LLCNN seems to produce a more natural-enhanced image with vivid colors and without apparent artifacts or over exposure. While they usually require expensive training time and effort [34,35], yet various reasons have contributed to the leap of DL algorithms and their applications, including [28]: (i) the substantial increase in computational capabilities (e.g., GPUs), (ii) the lower costs of computing hardware, (iii) the significant advances of machine learning algorithms [36–38], and (iv) the increasing availability of training data. In the following section, we describe and categorize the recent DL models for LLI enhancement.

    • Replicable parallel branch and bound search

      2018, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      However, for some instances, to achieve best performance one may need to split work at much lower levels [37]. An alternative would be to use dynamic work generation techniques where the parallel coordination layer manages load in the system [1]. Dynamic work generation can cause difficulty for maintaining a global task ordering in a distributed environment such as in the case of the Ordered skeleton.

    • FPT-Inspired Approximations

      2024, Lecture Notes Series, Institute for Mathematical Sciences
    View all citing articles on Scopus

    Faisal N. Abu-Khzam is a faculty member in the Department of Computer Science and Mathematics at the Lebanese American University. His research interests include High Performance Computing, Graph Algorithms, Parameterized Complexity and Computational Biology.

    Khuzaima Daudjee is a faculty member in the David R. Cheriton School of Computer Science at the University of Waterloo. His research interests are in Distributed Systems and Database Systems.

    Amer E. Mouawad is a Ph.D. student at the University of Waterloo, Canada, working under the supervision of Prof. Naomi Nishimura. His research interests are in Graph Theory, Parameterized Complexity, Combinatorial Optimization, and High Performance Computing.

    Naomi Nishimura has been on the faculty at the David R. Cheriton School of Computer Science at the University of Waterloo since 1991. Her main research interests include Graph Algorithms and Parameterized Complexity.

    View full text