Elsevier

Parallel Computing

Volume 78, October 2018, Pages 101-114
Parallel Computing

Orchestrating parallel detection of strongly connected components on GPUs

https://doi.org/10.1016/j.parco.2017.11.001Get rights and content

Highlights

  • Analysis on graph properties: skewed SCC sizes and dynamically changing structure.

  • A hybrid parallelism method is proposed to deal with skewed SCC sizes.

  • Traversal strategies are customized to fit dynamically changing graph structure.

  • Our method outperforms existing GPU and OpenMP implementations.

Abstract

Detecting strongly connected components (SCC) is a practical graph analytics algorithm widely used in many application domains. To accelerate SCC detection, parallel algorithms have been proposed and implemented on GPUs. However, existing GPU implementations show unstable performance for various graphs, especially for real-world graphs, as these implementations do not have a clear understanding of the graph properties. In this paper, we analyze that graphs in SCC detection usually exhibit (1) skewed component sizes (the static property) and (2) dynamically changed graph structure (the dynamic property). To deal with these irregular graph properties, we propose a hybrid method that divides the algorithm into two phases and exploits different levels of parallelism for different-sized components. We also customize the graph traversal strategies for each phase to handle the dynamically changed graph structure. Our method is carefully implemented to take advantage of the GPU hardware. Evaluation with diverse synthetic and real-world graphs shows that our method substantially improves existing GPU implementations, both performance-wise and applicability-wise. It also achieves an average speedup of 5.6 ×  and 1.5 ×  over the sequential and OpenMP implementations on the CPU respectively.

Introduction

Strongly connected component (SCC) detection is a fundamental graph analysis problem that is pervasively present in many application domains. Tarjan’s algorithm [1] is an efficient sequential method to solve the SCC detection. However, parallelizing Tarjan’s algorithm is challenging as it applies an inherently sequential DFS (depth-first search) traversal of the graph. To accelerate the SCC detection for large-scale graphs, parallel algorithms using the BFS (breadth-first search) traversal have been proposed. The Forward-Backward (FB) algorithm [2] and its enhancement FB-Trim  [3] are practical algorithms that bring in performance improvement.

Barnat et al. [4] implemented the FB-Trim algorithm using the CUDA programming model on the GPU. Due to the parallelization, their implementation achieves high performance for some randomly generated graphs. However, their implementation does not fully take graph properties into consideration, and therefore the implementation cannot work well for different types of graphs, especially for the real-world graphs [5], [6].

On the other hand, Hong et al. [7] improved the FB-Trim algorithm with an efficient parallel CPU SCC detection method specifically for processing real-world graphs. They used a two-phase method to handle small-world graphs, and got tremendous speedup on multicore CPUs. Hong’s work implies that graph algorithms should be aware of graph properties and make adjustment to handle different situations. Graph properties are also critical for GPU implementations as we evaluated.

Real-world graphs in social networks usually exhibit the small-world property with a power-law degree distribution, and therefore graphs usually include a giant SCC and a lot of small-sized nontrivial SCCs (i.e. skewed component sizes, the static graph property). In addition, the graph structure is dynamically changed when performing the SCC detection. That is to say, once an SCC is detected, it is removed from the original graphs (i.e. the dynamic graph property due to the graph algorithm). Therefore, after the giant SCC is detected and removed, the remaining graph contains a large amount of disconnected small subgraphs. Previous GPU implementations (e.g. Barnat’s implementation) cannot efficiently handle such cases as they becomes almost serialized when processing the remaining subgraphs.

In this work, we propose an efficient, hybrid SCC detection method on the GPU to overcome the limitation of existing GPU implementations. Our method is designed by taking graph properties into account. First, to deal with the static property, we decompose the SCC detection into two phases: processing the giant SCC and processing the remaining small-sized nontrivial SCCs. The two phases utilizes different parallelism approaches. The single giant SCC is full of data parallelism while the large amount of small-sized SCCs can benefit from task parallelism. To enable efficient task parallelism in the second phase, we examine optimizations that previously utilized in CPU SCC and port them to the GPU. Second, to deal with the dynamic property, we further devise different BFS traversal strategies and choose the suitable one for each phase. By using the two-phase hybrid method and by customizing the graph traversal strategies, our method is able to achieve high performance for a large variety of synthetic and real-world graphs.

We validate the effectiveness and efficiency of our hybrid method using CUDA on the NVIDIA GPU. Evaluation with diverse synthetic as well as real-world graphs shows that our method significantly outperforms existing GPU implementations. We also compare our method with the state-of-the-art sequential and OpenMP implementations on the CPU, and we achieve an average speedup of 5.6 ×  and 1.5 × , respectively.

The main contributions in this work are:

(1) We propose a hybrid SCC detection method that decomposes the SCC detection into two phases and enables different parallelism approaches for different phases to deal with graph irregularities.

(2) We examine the state-of-the-art graph traversal strategies and apply the best-performing strategy to fit the graph properties of each SCC phase.

(3) We port optimization techniques proposed in CPU SCC detection to our GPU implementation to exploit more parallelism.

(4) We demonstrate the effectiveness and efficiency of our hybrid method by implementing and evaluating the proposed method with different types of synthetic and real-world graphs.

The rest of the paper is organized as follows: Section 2 introduces the existing parallel algorithms as well as the state-of-the-art GPU implementations. Section 3 details our proposed design. The experimental evaluation is present in Section 4. We discuss related work in Section 5, and we conclude the paper in Section 6.

Section snippets

Background and motivation

A strongly connected component in a directed graph refers to a maximal subgraph where there exists a path between any two vertices in the subgraph. SCC detection which decomposes a given directed graph into a set of disjoint SCCs is widely used in many graph analytics applications, including web and social network analysis [8], formal verification [9], reinforcement learning [10], mesh refinement [3], computer-aided design [11] and scientific computing [12].

The classic sequential SCC detection

Design and implementation

Despite the irregularity, recent studies [16], [17], [18], [19], [20] demonstrate that GPUs can substantially accelerate graph algorithms with careful design and optimization. In this section, we present our design and implementation of SCC detection that can make good use of the GPU hardware.

Evaluation

In this section, we evalute our proposed method with various graph datasets (listed in Table 1). We use the R-MAT [27] graph generator GTGraph [28] to generate rmat-er by using the parameters (0:25; 0:25; 0:25; 0:25). We choose kron21 from the 10th DIMACS Implementation Challenge (generated by the kronecker generator). We also pick real-world graphs from the University of Florida Sparse Matrix Collection [29], the SNAP database [30], and the Koblenz Network Collection [31]. These graphs are

Related work

Parallel SCC detection is an important graph analysis algorithm that has been intensively studied previously. As mentioned, Hong et al. were the first to use the WCC method to handle small-world graphs, and Barnat et al. were the first to implement FB-Trim algorithm on GPUs. Inspired by Hong’s work, Slota et al. [21] proposed a Multistep strategy to deal with small-world graphs on CPUs. Their approach combines BFS and coloring-based methods and uses them in different algorithm steps. Slota 

Conclusion

SCC detection is an important graph algorithm that has been applied in many application domains. Existing GPU implementations cannot efficiently process different types of graphs because the implementations are not aware of the graph properties. In this paper, we demonstrate that it is of great importance to understand the graph properties for accelerating SCC detection. There are two types of properties: (1) the static property, i.e. the small-world and power-law property which leads to skewed

Acknowledgment

We thank the anonymous reviewers for the insightful comments and suggestions. This work is partly supported by the National Natural Science Foundation of China (NSFC) No. 61502514, No. 61402488, and No. 61602501, and the National Key Research and Development Program of China under grant No. 2016YFB0200400.

References (33)

  • A. Broder et al.

    Graph structure in the web

    Comput. Networks

    (2000)
  • NVIDIA, 2015, CUDA C Programming Guide...
  • G.M. Slota et al.

    Bfs and coloring-based parallel algorithms for strongly connected components and related problems

    Proceedings of IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS)

    (2014)
  • R. Tarjan

    Depth-first search and linear graph algorithms

    SIAM J. Comput.

    (1972)
  • L. Fleischer et al.

    On identifying strongly connected components in parallel

    Proceedings of the 15th IPDPS Workshops, IPDPS ’00

    (2000)
  • W. McLendon et al.

    Finding strongly connected components in distributed graphs

    J. Parallel Distributed Comput. (JPDC)

    (2005)
  • J. Barnat et al.

    Computing strongly connected components in parallel on cuda

    Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS), IPDPS ’11

    (2011)
  • A. Mislove et al.

    Measurement and analysis of online social networks

    Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, IMC ’07

    (2007)
  • S. Hong et al.

    On fast parallel detection of strongly connected components (scc) in small-world graphs

    Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), SC ’13

    (2013)
  • R. Kumar et al.

    Structure and evolution of online social networks

    Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), KDD ’06

    (2006)
  • R. Hojati et al.

    Bdd-based debugging of design using language containment and fair ctl

    Proceedings of the 5th International Conference on Computer Aided Verification, CAV ’93

    (1993)
  • S.J. Kazemitabar et al.

    Automatic discovery of subgoals in reinforcement learning using strongly connected components

    Proceedings of the 15th International Conference on Advances in Neuro-information Processing - Volume Part I, ICONIP’08

    (2009)
  • A. Xie et al.

    Implicit enumeration of strongly connected components and an application to formal verification

    Trans. Comp.-Aided Des. Integ. Cir. Sys.

    (2006)
  • A. Pothen et al.

    Computing the block triangular form of a sparse matrix

    ACM Trans. Math. Softw. (TOMS)

    (1990)
  • J.H. Reif

    Depth-first search is inherently sequential

    Inf. Process Lett.

    (1985)
  • M. Stuhl, Computing Strongly Connected Components With CUDA. Master Thesis, Masaryk University,...
  • Cited by (0)

    The source code of this work can be found at https://github.com/chenxuhao/gardenia

    View full text