On scalable parallel recursive backtracking

doi:10.1016/j.jpdc.2015.07.006

Journal of Parallel and Distributed Computing

Volume 84, October 2015, Pages 65-75

https://doi.org/10.1016/j.jpdc.2015.07.006 Get rights and content

Highlights

•
A simple framework for parallelizing exact search-tree algorithms.
•
An indexing scheme for simple task transmission and efficient communication.
•
Efficient and effective extraction of heavy tasks for dynamic load balancing.
•
The presented method scales almost linearly to a record number of processing elements.

Abstract

Supercomputers are equipped with an increasingly large number of cores to use computational power as a way of solving problems that are otherwise intractable. Unfortunately, getting serial algorithms to run in parallel to take advantage of these computational resources remains a challenge for several application domains. Many parallel algorithms can scale to only hundreds of cores. The limiting factors of such algorithms are usually communication overhead and poor load balancing. Solving NP-hard graph problems to optimality using exact algorithms is an example of an area in which there has so far been limited success in obtaining large scale parallelism. Many of these algorithms use recursive backtracking as their core solution paradigm. In this paper, we propose a lightweight, easy-to-use, scalable approach for transforming almost any recursive backtracking algorithm into a parallel one. Our approach incurs minimal communication overhead and guarantees a load-balancing strategy that is implicit, i.e., does not require any problem-specific knowledge. The key idea behind our approach is the use of efficient traversal operations on an indexed search tree that is oblivious to the problem being solved. We test our approach with parallel implementations of algorithms for the well-known Vertex Cover and Dominating Set problems. On sufficiently hard instances, experimental results show nearly linear speedups for thousands of cores, reducing running times from days to just a few minutes.

Introduction

Parallel computation is becoming increasingly important as performance levels out in terms of delivering parallelism within a single processor due to Moore’s law. This paradigm shift means that to attain speedup, software that implements algorithms that can run in parallel on multiple processors/cores is required. Today we have a growing list of supercomputers with tremendous processing power. Some of these systems include more than a million computing cores and can achieve up to 30 Petaflop/s. The constant increase in the number of processors/cores per supercomputer motivates the development of parallel algorithms that can efficiently utilize such processing infrastructures. Unfortunately, migrating known serial algorithms to exploit parallelism while maintaining scalability is not straightforward. The overheads introduced by parallelism are very often hard to evaluate, and fair load balancing is possible only when accurate estimates of task “hardness” or “weight” can be calculated on-the-fly. Providing such estimates usually requires problem-specific knowledge, rendering the techniques developed for a certain problem useless when trying to parallelize an algorithm for another.

As it is not likely that polynomial-time algorithms can be found for NP-hard problems, the search for fast deterministic algorithms could benefit greatly from the processing capabilities of supercomputers. Researchers working in the area of exact algorithms have developed algorithms yielding lower and lower running times [5], [15], [6], [14], [21]. However the major focus has been on improving the asymptotic worst-case behavior of algorithms. The practical aspects of the possibility of exploiting parallel infrastructures have received much less attention.

Most existing exact algorithms for NP-hard graph problems follow the well-known branch-and-reduce paradigm. A branch-and-reduce algorithm searches the complete solution space of a given problem for an optimal solution. Simple enumeration is usually prohibitively expensive due to the exponentially increasing number of potential solutions. To prune parts of the solution space, an algorithm uses reduction rules derived from bounds on the function to be optimized and the value of the current best solution. The reader is referred to Woeginger’s excellent survey paper on exact algorithms for further details [25]. At the implementation level, branch-and-reduce algorithms translate to search-tree-based recursive backtracking algorithms. The search tree size usually grows exponentially with either the size of the input instance $n$ or some integer parameter $k$ when the problem is fixed-parameter tractable [11].

Nevertheless, search trees are good candidates for parallel decomposition. While most divide-and-conquer methods for parallel algorithms aim at partitioning a problem instance among the cores, we partition the search space of the problem instead. Given $c$ cores or processing elements, a brute-force parallel solution would divide a search tree into $c$ subtrees and assign each subtree to a separate core for sequential processing. One might hope to thus reduce the overall running time by a factor of $c$ . However, this intuitive approach suffers from several drawbacks, including the obvious lack of load balancing.

Even though our focus is on NP-hard graph problems, we note that recursive backtracking is a widely-used technique for solving a very long list of practical problems. This justifies the need for a general strategy to simplify the migration from serial to parallel algorithms. One example of a successful parallel framework for solving different types of problems is MapReduce [8]. The success of the MapReduce model can be attributed to its simplicity, transparency, and scalability, all of which are properties essential for any efficient parallel algorithm. In this paper, we propose a simple, lightweight, scalable approach for transforming almost any recursive backtracking algorithm into a parallel one with minimal communication overhead and a load balancing strategy that is implicit, i.e., does not require any problem-specific knowledge. The key idea behind our approach is the use of efficient traversal operations on an indexed search tree that is oblivious to the problem being solved. To test our approach, we implement parallel exact algorithms for the well-known Vertex Cover and Dominating Set problems. Experimental results show that for sufficiently hard instances, we obtain nearly linear speedups on at least 32,768 cores.

Section snippets

Preliminaries

Typically, a recursive backtracking algorithm exhaustively explores a search tree $T$ using depth-first search traversal. Each node of $T$ (a search node) maintains some data structures required for completing the search. We denote a search node by $N_{d, p}$ , where $d$ is the depth of $N_{d, p}$ in $T$ and $p$ is the position of $N_{d, p}$ in the left-to-right ordering of all search nodes at depth $d$ . The root of $T$ is thus $N_{0, 0}$ . We use $T (N_{d, p})$ to denote the subtree rooted at node $N_{d, p}$ . We say $T$ has branching factor $b$ if

Communication overhead

The most evident overhead in parallel algorithms is that of communication. Several models have already been presented in the literature including centralized (i.e. the master–worker(s) model where most of the communication and task distribution duties are assigned to a single core) [4], decentralized [12], [3], or a hybrid of both [23]. Although each model has its pros and cons, centralization rapidly becomes a bottleneck when the number of computing cores exceeds a certain threshold [4].

Addressing the challenges

In this section, we show how to incrementally transform Serial-RB into a parallel algorithm. First, we discuss indexed search trees and their use in a generic and compact task-encoding scheme. As a byproduct of this encoding, we show how we can efficiently extract heavy (if not heaviest) unprocessed tasks for dynamic load balancing. We provide pseudocode to illustrate the simplicity of transforming serial algorithms to parallel ones. The end result is a parallel algorithm, Parallel-RB, which

Implementation

We tested our approach with parallel implementations of algorithms for the well-known Vertex Cover and Dominating Set problems.

Vertex Cover
Input:	A graph $G = (V, E)$
Question:	Find a set $C \subseteq V$ such that $\| C \|$ is
	minimized and the graph induced by $V ∖ C$
	is edgeless

Dominating Set
Input:	A graph $G = (V, E)$
Question:	Find a set $D \subseteq V$ such that $\| D \|$ is
	minimized and every vertex in $G$ is either
	in $D$ or is adjacent to a vertex in $D$

Both problems have received considerable attention in the areas of exact and fixed parameter

Experimental results

Our code, written in the standard C language, utilizes the Message Passing Interface (MPI) [10] and has no other dependencies. Computations were performed on the BGQ supercomputer at the SciNet HPC Consortium.¹ The BGQ production system is a 3rd generation Blue Gene IBM supercomputer built around

Interpreting the results

In almost all cases, the algorithms achieve near linear speedup on at least 32,768 cores. Not surprisingly, whenever the time required to solve an instance drops to just a few minutes, the overall performance of the algorithms decreases as we add more cores to the computation. More surprising might be the super-linear speedups attained for the 60-graph. This is mainly due to some sort of cooperation among the various cores: when one core finds a solution of a certain “improved” size, the value

Conclusions and future work

Combining indexed search trees and (local) heaviest task extraction with a decentralized communication model, we have showed how any serial recursive backtracking algorithm, with some ordered branching, can be modified to run in parallel. Some of the key advantages of our approach are:

–
The migration from serial to parallel entails very little additional coding. Implementing each of our parallel algorithms took less than two days.
–
It completely eliminates the need for buffering multiple tasks and

Acknowledgments

The authors would like to thank Chris Loken and the SciNet team for providing access to the BGQ production system and for their support throughout the experiments. Research was supported by the Natural Science and Engineering Research Council of Canada.

Faisal N. Abu-Khzam is a faculty member in the Department of Computer Science and Mathematics at the Lebanese American University. His research interests include High Performance Computing, Graph Algorithms, Parameterized Complexity and Computational Biology.

References (26)

J. Chen et al.
Vertex cover: further observations and further improvements
J. Algorithms
(2001)
J. Chen et al.
Improved upper bounds for vertex cover
Theoret. Comput. Sci.
(2010)
V. Kumar et al.
Scalable load balancing techniques for parallel computers
J. Parallel Distrib. Comput.
(1994)
F.N. Abu-Khzam, M.A. Langston, A.E. Mouawad, C.P. Nolan, A hybrid graph representation for recursive backtracking...
F.N. Abu-Khzam et al.
Scalable parallel algorithms for FPT problems
Algorithmica
(2006)
F.N. Abu-Khzam, A.E. Mouawad, A decentralized load balancing approach for parallel search-tree optimization, in:...
F.N. Abu-Khzam, M.A. Rizk, D.A. Abdallah, N.F. Samatova, The buffered work-pool approach for search-tree based...
W.F. Clocksin et al.
A method for efficiently executing horn clause programs using multiple processors
New Gen. Comput.
(1988)
J. Dean et al.
MapReduce: simplified data processing on large clusters
Commun. ACM
(2008)
S. Debroni et al.
Maximum independent sets of the 120-cell and other regular polyhedral
Ars Mathematica Contemporanea
(2013)

J.J. Dongarra et al.

MPI: a message-passing interface standard

Int. J. Supercomput. Appl.

(1994)

R.G. Downey et al.

Parameterized Complexity

(1997)

G.D. Fatta et al.

Decentralized load balancing for highly irregular search problems

Microprocess. Microsyst.

(2007)

Cited by (22)

A lightweight semi-centralized strategy for the massive parallelization of branching algorithms
2023, Parallel Computing
Several NP-hard problems are solved exactly using exponential-time branching strategies, whether it be branch-and-bound algorithms, or bounded search trees in fixed-parameter algorithms. The number of tractable instances that can be handled by sequential algorithms is usually small, whereas massive parallelization has been shown to significantly increase the space of instances that can be solved exactly. However, previous centralized approaches require too much communication to be efficient, whereas decentralized approaches are more efficient but have difficulty keeping track of the global state of the exploration.
In this work, we propose to revisit the centralized paradigm while avoiding previous bottlenecks. In our strategy, the center has lightweight responsibilities, requires only a few bits for every communication, but is still able to keep track of the progress of every worker. In particular, the center never holds any task but is able to guarantee that a process with no work always receives the highest priority task globally.
Our strategy was implemented in a generic C++ library called GemPBA, which allows a programmer to convert a sequential branching algorithm into a parallel version by changing only a few lines of code. An experimental case study on the vertex cover problem demonstrates that some of the toughest instances from the DIMACS challenge graphs that would take months to solve sequentially can be handled within two hours with our approach.
Comparing deep learning models for low-light natural scene image enhancement and their impact on object detection and classification: Overview, empirical evaluation, and challenges
2022, Signal Processing: Image Communication
Citation Excerpt :
LLCNN seems to produce a more natural-enhanced image with vivid colors and without apparent artifacts or over exposure. While they usually require expensive training time and effort [34,35], yet various reasons have contributed to the leap of DL algorithms and their applications, including [28]: (i) the substantial increase in computational capabilities (e.g., GPUs), (ii) the lower costs of computing hardware, (iii) the significant advances of machine learning algorithms [36–38], and (iv) the increasing availability of training data. In the following section, we describe and categorize the recent DL models for LLI enhancement.
Low-light image (LLI) enhancement is an important image processing task that aims at improving the illumination of images taken under low-light conditions. Recently, a remarkable progress has been made in utilizing deep learning (DL) approaches for LLI enhancement. This paper provides a concise and comprehensive review and comparative study of the most recent DL models used for LLI enhancement. To our knowledge, this is the first comparative study dedicated to DL-based models for LLI enhancement. We address LLI enhancement in two ways: (i) standalone, as a separate task, and (ii) end-to-end, as a pre-processing stage embedded within another high-level computer vision task, namely object detection and classification. The paper consists of six logical parts. First, we provide an overview of the background and literature in LLI enhancement. Second, we describe the test data and experimental setup of the study. Third, we present a quantitative and qualitative comparison of the visual and perceptual quality achieved by 10 of the most recent DL-based LLI enhancement models. Fourth, we present a comparative analysis for object detection and classification performance achieved by 4 different object detection models applied on LLIs and their enhanced counterparts. Fifth, we perform a feature analysis of DL feature maps extracted from normal, low-light, and enhanced images, and perform the occlusion experiment to better understand the effect of LLI enhancement on the object detection and classification task. Finally, we provide our conclusions and highlight future steps and potential directions.
Efficient parallel algorithms for parameterized problems
2019, Theoretical Computer Science
A parameterized problem is fixed-parameter parallelizable (FPP) if it can be solved in $O (f (k) \cdot {(\log N)}^{α})$ time using $O (g (k) \cdot N^{β})$ processors, where N is the input size, k is the parameter, f and g are arbitrary computable functions, and α, β are constants independent of N and k. We re-examine the k-vertex cover problem from a parameterized parallel complexity standpoint and present a parallel algorithm that outperforms the previous known algorithm: using $O (m)$ instead of $O (n^{2})$ processors, the running time improves from $O (k^{k})$ to $O (k^{3} \log n + {1.2738}^{k})$ , where n and m are the number of vertices and edges of the input graph, respectively. This is achieved by first showing that vertex cover kernelization that is based on crown decomposition is in FPP as well. Finally, we consider the use of the recently introduced modular-width parameter. In particular, we show that the weighted maximum clique problem is FPP when parameterized by this auxiliary parameter.
Replicable parallel branch and bound search
2018, Journal of Parallel and Distributed Computing
Citation Excerpt :
However, for some instances, to achieve best performance one may need to split work at much lower levels [37]. An alternative would be to use dynamic work generation techniques where the parallel coordination layer manages load in the system [1]. Dynamic work generation can cause difficulty for maintaining a global task ordering in a distributed environment such as in the case of the Ordered skeleton.
Combinatorial branch and bound searches are a common technique for solving global optimisation and decision problems. Their performance often depends on good search order heuristics, refined over decades of algorithms research. Parallel search necessarily deviates from the sequential search order, sometimes dramatically and unpredictably, e.g. by distributing work at random. This can disrupt effective search order heuristics and lead to unexpected and highly variable parallel performance. The variability makes it hard to reason about the parallel performance of combinatorial searches.
This paper presents a generic parallel branch and bound skeleton, implemented in Haskell, with replicable parallel performance. The skeleton aims to preserve the search order heuristic by distributing work in an ordered fashion, closely following the sequential search order. We demonstrate the generality of the approach by applying the skeleton to 40 instances of three combinatorial problems: Maximum Clique, 0/1 Knapsack and Travelling Salesperson. The overheads of our Haskell skeleton are reasonable: giving slowdown factors of between 1.9 and 6.2 compared with a class-leading, dedicated, and highly optimised C++ Maximum Clique solver. We demonstrate scaling up to 200 cores of a Beowulf cluster, achieving speedups of 100x for several Maximum Clique instances. We demonstrate low variance of parallel performance across all instances of the three combinatorial problems and at all scales up to 200 cores, with median Relative Standard Deviation (RSD) below 2%. Parallel solvers that do not follow the sequential search order exhibit far higher variance, with median RSD exceeding 85% for Knapsack.
An autonomic parallel strategy for exhaustive search tree algorithms on shared or heterogeneous systems
2024, Concurrency and Computation: Practice and Experience
FPT-Inspired Approximations
2024, Lecture Notes Series, Institute for Mathematical Sciences

View all citing articles on Scopus

Khuzaima Daudjee is a faculty member in the David R. Cheriton School of Computer Science at the University of Waterloo. His research interests are in Distributed Systems and Database Systems.

Amer E. Mouawad is a Ph.D. student at the University of Waterloo, Canada, working under the supervision of Prof. Naomi Nishimura. His research interests are in Graph Theory, Parameterized Complexity, Combinatorial Optimization, and High Performance Computing.

Naomi Nishimura has been on the faculty at the David R. Cheriton School of Computer Science at the University of Waterloo since 1991. Her main research interests include Graph Algorithms and Parameterized Complexity.

View full text

On scalable parallel recursive backtracking

Highlights

Abstract

Introduction

Section snippets

Preliminaries

Communication overhead

Addressing the challenges

Implementation

Experimental results

Interpreting the results

Conclusions and future work

Acknowledgments

J. Algorithms

Theoret. Comput. Sci.

J. Parallel Distrib. Comput.

Scalable parallel algorithms for FPT problems

Algorithmica

A method for efficiently executing horn clause programs using multiple processors

New Gen. Comput.

MapReduce: simplified data processing on large clusters

Commun. ACM

Maximum independent sets of the 120-cell and other regular polyhedral

Ars Mathematica Contemporanea

MPI: a message-passing interface standard

Int. J. Supercomput. Appl.

Parameterized Complexity

Decentralized load balancing for highly irregular search problems

Microprocess. Microsyst.