Orchestrating parallel detection of strongly connected components on GPUs☆
Introduction
Strongly connected component (SCC) detection is a fundamental graph analysis problem that is pervasively present in many application domains. Tarjan’s algorithm [1] is an efficient sequential method to solve the SCC detection. However, parallelizing Tarjan’s algorithm is challenging as it applies an inherently sequential DFS (depth-first search) traversal of the graph. To accelerate the SCC detection for large-scale graphs, parallel algorithms using the BFS (breadth-first search) traversal have been proposed. The Forward-Backward (FB) algorithm [2] and its enhancement FB-Trim [3] are practical algorithms that bring in performance improvement.
Barnat et al. [4] implemented the FB-Trim algorithm using the CUDA programming model on the GPU. Due to the parallelization, their implementation achieves high performance for some randomly generated graphs. However, their implementation does not fully take graph properties into consideration, and therefore the implementation cannot work well for different types of graphs, especially for the real-world graphs [5], [6].
On the other hand, Hong et al. [7] improved the FB-Trim algorithm with an efficient parallel CPU SCC detection method specifically for processing real-world graphs. They used a two-phase method to handle small-world graphs, and got tremendous speedup on multicore CPUs. Hong’s work implies that graph algorithms should be aware of graph properties and make adjustment to handle different situations. Graph properties are also critical for GPU implementations as we evaluated.
Real-world graphs in social networks usually exhibit the small-world property with a power-law degree distribution, and therefore graphs usually include a giant SCC and a lot of small-sized nontrivial SCCs (i.e. skewed component sizes, the static graph property). In addition, the graph structure is dynamically changed when performing the SCC detection. That is to say, once an SCC is detected, it is removed from the original graphs (i.e. the dynamic graph property due to the graph algorithm). Therefore, after the giant SCC is detected and removed, the remaining graph contains a large amount of disconnected small subgraphs. Previous GPU implementations (e.g. Barnat’s implementation) cannot efficiently handle such cases as they becomes almost serialized when processing the remaining subgraphs.
In this work, we propose an efficient, hybrid SCC detection method on the GPU to overcome the limitation of existing GPU implementations. Our method is designed by taking graph properties into account. First, to deal with the static property, we decompose the SCC detection into two phases: processing the giant SCC and processing the remaining small-sized nontrivial SCCs. The two phases utilizes different parallelism approaches. The single giant SCC is full of data parallelism while the large amount of small-sized SCCs can benefit from task parallelism. To enable efficient task parallelism in the second phase, we examine optimizations that previously utilized in CPU SCC and port them to the GPU. Second, to deal with the dynamic property, we further devise different BFS traversal strategies and choose the suitable one for each phase. By using the two-phase hybrid method and by customizing the graph traversal strategies, our method is able to achieve high performance for a large variety of synthetic and real-world graphs.
We validate the effectiveness and efficiency of our hybrid method using CUDA on the NVIDIA GPU. Evaluation with diverse synthetic as well as real-world graphs shows that our method significantly outperforms existing GPU implementations. We also compare our method with the state-of-the-art sequential and OpenMP implementations on the CPU, and we achieve an average speedup of 5.6 × and 1.5 × , respectively.
The main contributions in this work are:
(1) We propose a hybrid SCC detection method that decomposes the SCC detection into two phases and enables different parallelism approaches for different phases to deal with graph irregularities.
(2) We examine the state-of-the-art graph traversal strategies and apply the best-performing strategy to fit the graph properties of each SCC phase.
(3) We port optimization techniques proposed in CPU SCC detection to our GPU implementation to exploit more parallelism.
(4) We demonstrate the effectiveness and efficiency of our hybrid method by implementing and evaluating the proposed method with different types of synthetic and real-world graphs.
The rest of the paper is organized as follows: Section 2 introduces the existing parallel algorithms as well as the state-of-the-art GPU implementations. Section 3 details our proposed design. The experimental evaluation is present in Section 4. We discuss related work in Section 5, and we conclude the paper in Section 6.
Section snippets
Background and motivation
A strongly connected component in a directed graph refers to a maximal subgraph where there exists a path between any two vertices in the subgraph. SCC detection which decomposes a given directed graph into a set of disjoint SCCs is widely used in many graph analytics applications, including web and social network analysis [8], formal verification [9], reinforcement learning [10], mesh refinement [3], computer-aided design [11] and scientific computing [12].
The classic sequential SCC detection
Design and implementation
Despite the irregularity, recent studies [16], [17], [18], [19], [20] demonstrate that GPUs can substantially accelerate graph algorithms with careful design and optimization. In this section, we present our design and implementation of SCC detection that can make good use of the GPU hardware.
Evaluation
In this section, we evalute our proposed method with various graph datasets (listed in Table 1). We use the R-MAT [27] graph generator GTGraph [28] to generate rmat-er by using the parameters (0:25; 0:25; 0:25; 0:25). We choose kron21 from the 10th DIMACS Implementation Challenge (generated by the kronecker generator). We also pick real-world graphs from the University of Florida Sparse Matrix Collection [29], the SNAP database [30], and the Koblenz Network Collection [31]. These graphs are
Related work
Parallel SCC detection is an important graph analysis algorithm that has been intensively studied previously. As mentioned, Hong et al. were the first to use the WCC method to handle small-world graphs, and Barnat et al. were the first to implement FB-Trim algorithm on GPUs. Inspired by Hong’s work, Slota et al. [21] proposed a Multistep strategy to deal with small-world graphs on CPUs. Their approach combines BFS and coloring-based methods and uses them in different algorithm steps. Slota
Conclusion
SCC detection is an important graph algorithm that has been applied in many application domains. Existing GPU implementations cannot efficiently process different types of graphs because the implementations are not aware of the graph properties. In this paper, we demonstrate that it is of great importance to understand the graph properties for accelerating SCC detection. There are two types of properties: (1) the static property, i.e. the small-world and power-law property which leads to skewed
Acknowledgment
We thank the anonymous reviewers for the insightful comments and suggestions. This work is partly supported by the National Natural Science Foundation of China (NSFC) No. 61502514, No. 61402488, and No. 61602501, and the National Key Research and Development Program of China under grant No. 2016YFB0200400.
References (33)
- et al.
Graph structure in the web
Comput. Networks
(2000) - NVIDIA, 2015, CUDA C Programming Guide...
- et al.
Bfs and coloring-based parallel algorithms for strongly connected components and related problems
Proceedings of IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS)
(2014) Depth-first search and linear graph algorithms
SIAM J. Comput.
(1972)- et al.
On identifying strongly connected components in parallel
Proceedings of the 15th IPDPS Workshops, IPDPS ’00
(2000) - et al.
Finding strongly connected components in distributed graphs
J. Parallel Distributed Comput. (JPDC)
(2005) - et al.
Computing strongly connected components in parallel on cuda
Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS), IPDPS ’11
(2011) - et al.
Measurement and analysis of online social networks
Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, IMC ’07
(2007) - et al.
On fast parallel detection of strongly connected components (scc) in small-world graphs
Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), SC ’13
(2013) - et al.
Structure and evolution of online social networks
Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), KDD ’06
(2006)
Bdd-based debugging of design using language containment and fair ctl
Proceedings of the 5th International Conference on Computer Aided Verification, CAV ’93
Automatic discovery of subgoals in reinforcement learning using strongly connected components
Proceedings of the 15th International Conference on Advances in Neuro-information Processing - Volume Part I, ICONIP’08
Implicit enumeration of strongly connected components and an application to formal verification
Trans. Comp.-Aided Des. Integ. Cir. Sys.
Computing the block triangular form of a sparse matrix
ACM Trans. Math. Softw. (TOMS)
Depth-first search is inherently sequential
Inf. Process Lett.
Cited by (0)
- ☆
The source code of this work can be found at https://github.com/chenxuhao/gardenia