On scalable parallel recursive backtracking
Introduction
Parallel computation is becoming increasingly important as performance levels out in terms of delivering parallelism within a single processor due to Moore’s law. This paradigm shift means that to attain speedup, software that implements algorithms that can run in parallel on multiple processors/cores is required. Today we have a growing list of supercomputers with tremendous processing power. Some of these systems include more than a million computing cores and can achieve up to 30 Petaflop/s. The constant increase in the number of processors/cores per supercomputer motivates the development of parallel algorithms that can efficiently utilize such processing infrastructures. Unfortunately, migrating known serial algorithms to exploit parallelism while maintaining scalability is not straightforward. The overheads introduced by parallelism are very often hard to evaluate, and fair load balancing is possible only when accurate estimates of task “hardness” or “weight” can be calculated on-the-fly. Providing such estimates usually requires problem-specific knowledge, rendering the techniques developed for a certain problem useless when trying to parallelize an algorithm for another.
As it is not likely that polynomial-time algorithms can be found for NP-hard problems, the search for fast deterministic algorithms could benefit greatly from the processing capabilities of supercomputers. Researchers working in the area of exact algorithms have developed algorithms yielding lower and lower running times [5], [15], [6], [14], [21]. However the major focus has been on improving the asymptotic worst-case behavior of algorithms. The practical aspects of the possibility of exploiting parallel infrastructures have received much less attention.
Most existing exact algorithms for NP-hard graph problems follow the well-known branch-and-reduce paradigm. A branch-and-reduce algorithm searches the complete solution space of a given problem for an optimal solution. Simple enumeration is usually prohibitively expensive due to the exponentially increasing number of potential solutions. To prune parts of the solution space, an algorithm uses reduction rules derived from bounds on the function to be optimized and the value of the current best solution. The reader is referred to Woeginger’s excellent survey paper on exact algorithms for further details [25]. At the implementation level, branch-and-reduce algorithms translate to search-tree-based recursive backtracking algorithms. The search tree size usually grows exponentially with either the size of the input instance or some integer parameter when the problem is fixed-parameter tractable [11].
Nevertheless, search trees are good candidates for parallel decomposition. While most divide-and-conquer methods for parallel algorithms aim at partitioning a problem instance among the cores, we partition the search space of the problem instead. Given cores or processing elements, a brute-force parallel solution would divide a search tree into subtrees and assign each subtree to a separate core for sequential processing. One might hope to thus reduce the overall running time by a factor of . However, this intuitive approach suffers from several drawbacks, including the obvious lack of load balancing.
Even though our focus is on NP-hard graph problems, we note that recursive backtracking is a widely-used technique for solving a very long list of practical problems. This justifies the need for a general strategy to simplify the migration from serial to parallel algorithms. One example of a successful parallel framework for solving different types of problems is MapReduce [8]. The success of the MapReduce model can be attributed to its simplicity, transparency, and scalability, all of which are properties essential for any efficient parallel algorithm. In this paper, we propose a simple, lightweight, scalable approach for transforming almost any recursive backtracking algorithm into a parallel one with minimal communication overhead and a load balancing strategy that is implicit, i.e., does not require any problem-specific knowledge. The key idea behind our approach is the use of efficient traversal operations on an indexed search tree that is oblivious to the problem being solved. To test our approach, we implement parallel exact algorithms for the well-known Vertex Cover and Dominating Set problems. Experimental results show that for sufficiently hard instances, we obtain nearly linear speedups on at least 32,768 cores.
Section snippets
Preliminaries
Typically, a recursive backtracking algorithm exhaustively explores a search tree using depth-first search traversal. Each node of (a search node) maintains some data structures required for completing the search. We denote a search node by , where is the depth of in and is the position of in the left-to-right ordering of all search nodes at depth . The root of is thus . We use to denote the subtree rooted at node . We say has branching factor if
Communication overhead
The most evident overhead in parallel algorithms is that of communication. Several models have already been presented in the literature including centralized (i.e. the master–worker(s) model where most of the communication and task distribution duties are assigned to a single core) [4], decentralized [12], [3], or a hybrid of both [23]. Although each model has its pros and cons, centralization rapidly becomes a bottleneck when the number of computing cores exceeds a certain threshold [4].
Addressing the challenges
In this section, we show how to incrementally transform Serial-RB into a parallel algorithm. First, we discuss indexed search trees and their use in a generic and compact task-encoding scheme. As a byproduct of this encoding, we show how we can efficiently extract heavy (if not heaviest) unprocessed tasks for dynamic load balancing. We provide pseudocode to illustrate the simplicity of transforming serial algorithms to parallel ones. The end result is a parallel algorithm, Parallel-RB, which
Implementation
We tested our approach with parallel implementations of algorithms for the well-known Vertex Cover and Dominating Set problems.
Vertex Cover Input: A graph Question: Find a set such that is minimized and the graph induced by is edgeless
Dominating Set Input: A graph Question: Find a set such that is minimized and every vertex in is either in or is adjacent to a vertex in
Both problems have received considerable attention in the areas of exact and fixed parameter
Experimental results
Our code, written in the standard C language, utilizes the Message Passing Interface (MPI) [10] and has no other dependencies. Computations were performed on the BGQ supercomputer at the SciNet HPC Consortium.1 The BGQ production system is a 3rd generation Blue Gene IBM supercomputer built around
Interpreting the results
In almost all cases, the algorithms achieve near linear speedup on at least 32,768 cores. Not surprisingly, whenever the time required to solve an instance drops to just a few minutes, the overall performance of the algorithms decreases as we add more cores to the computation. More surprising might be the super-linear speedups attained for the 60-graph. This is mainly due to some sort of cooperation among the various cores: when one core finds a solution of a certain “improved” size, the value
Conclusions and future work
Combining indexed search trees and (local) heaviest task extraction with a decentralized communication model, we have showed how any serial recursive backtracking algorithm, with some ordered branching, can be modified to run in parallel. Some of the key advantages of our approach are:
- –
The migration from serial to parallel entails very little additional coding. Implementing each of our parallel algorithms took less than two days.
- –
It completely eliminates the need for buffering multiple tasks and
Acknowledgments
The authors would like to thank Chris Loken and the SciNet team for providing access to the BGQ production system and for their support throughout the experiments. Research was supported by the Natural Science and Engineering Research Council of Canada.
Faisal N. Abu-Khzam is a faculty member in the Department of Computer Science and Mathematics at the Lebanese American University. His research interests include High Performance Computing, Graph Algorithms, Parameterized Complexity and Computational Biology.
References (26)
- et al.
Vertex cover: further observations and further improvements
J. Algorithms
(2001) - et al.
Improved upper bounds for vertex cover
Theoret. Comput. Sci.
(2010) - et al.
Scalable load balancing techniques for parallel computers
J. Parallel Distrib. Comput.
(1994) - F.N. Abu-Khzam, M.A. Langston, A.E. Mouawad, C.P. Nolan, A hybrid graph representation for recursive backtracking...
- et al.
Scalable parallel algorithms for FPT problems
Algorithmica
(2006) - F.N. Abu-Khzam, A.E. Mouawad, A decentralized load balancing approach for parallel search-tree optimization, in:...
- F.N. Abu-Khzam, M.A. Rizk, D.A. Abdallah, N.F. Samatova, The buffered work-pool approach for search-tree based...
- et al.
A method for efficiently executing horn clause programs using multiple processors
New Gen. Comput.
(1988) - et al.
MapReduce: simplified data processing on large clusters
Commun. ACM
(2008) - et al.
Maximum independent sets of the 120-cell and other regular polyhedral
Ars Mathematica Contemporanea
(2013)
MPI: a message-passing interface standard
Int. J. Supercomput. Appl.
Parameterized Complexity
Decentralized load balancing for highly irregular search problems
Microprocess. Microsyst.
Cited by (22)
A lightweight semi-centralized strategy for the massive parallelization of branching algorithms
2023, Parallel ComputingComparing deep learning models for low-light natural scene image enhancement and their impact on object detection and classification: Overview, empirical evaluation, and challenges
2022, Signal Processing: Image CommunicationCitation Excerpt :LLCNN seems to produce a more natural-enhanced image with vivid colors and without apparent artifacts or over exposure. While they usually require expensive training time and effort [34,35], yet various reasons have contributed to the leap of DL algorithms and their applications, including [28]: (i) the substantial increase in computational capabilities (e.g., GPUs), (ii) the lower costs of computing hardware, (iii) the significant advances of machine learning algorithms [36–38], and (iv) the increasing availability of training data. In the following section, we describe and categorize the recent DL models for LLI enhancement.
Efficient parallel algorithms for parameterized problems
2019, Theoretical Computer ScienceReplicable parallel branch and bound search
2018, Journal of Parallel and Distributed ComputingCitation Excerpt :However, for some instances, to achieve best performance one may need to split work at much lower levels [37]. An alternative would be to use dynamic work generation techniques where the parallel coordination layer manages load in the system [1]. Dynamic work generation can cause difficulty for maintaining a global task ordering in a distributed environment such as in the case of the Ordered skeleton.
An autonomic parallel strategy for exhaustive search tree algorithms on shared or heterogeneous systems
2024, Concurrency and Computation: Practice and ExperienceFPT-Inspired Approximations
2024, Lecture Notes Series, Institute for Mathematical Sciences
Faisal N. Abu-Khzam is a faculty member in the Department of Computer Science and Mathematics at the Lebanese American University. His research interests include High Performance Computing, Graph Algorithms, Parameterized Complexity and Computational Biology.
Khuzaima Daudjee is a faculty member in the David R. Cheriton School of Computer Science at the University of Waterloo. His research interests are in Distributed Systems and Database Systems.
Amer E. Mouawad is a Ph.D. student at the University of Waterloo, Canada, working under the supervision of Prof. Naomi Nishimura. His research interests are in Graph Theory, Parameterized Complexity, Combinatorial Optimization, and High Performance Computing.
Naomi Nishimura has been on the faculty at the David R. Cheriton School of Computer Science at the University of Waterloo since 1991. Her main research interests include Graph Algorithms and Parameterized Complexity.