Skip to main content
Log in

I/O efficient: computing SCCs in massive graphs

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

A strongly connected component (\(\mathsf {SCC}\)) is a maximal subgraph of a directed graph \(G\) in which every pair of nodes is reachable from each other in the \(\mathsf {SCC}\). With such a property, a general directed graph can be represented by a directed acyclic graph (DAG) by contracting every \(\mathsf {SCC}\) of \(G\) to a node in DAG. In many real applications that need graph pattern matching, topological sorting, or reachability query processing, the best way to deal with a general directed graph is to deal with its DAG representation. Therefore, finding all \(\mathsf {SCC}\)s in a directed graph \(G\) is a critical operation. The existing in-memory algorithms based on depth first search (DFS) can find all \(\mathsf {SCC}\)s in linear time with respect to the size of a graph. However, when a graph cannot reside entirely in the main memory, the existing external or semi-external algorithms to find all \(\mathsf {SCC}\)s have limitation to achieve high I/O efficiency. In this paper, we study new I/O-efficient semi-external algorithms to find all \(\mathsf {SCC}\)s for a massive directed graph \(G\) that cannot reside in main memory entirely. To overcome the deficiency of the existing DFS-based semi-external algorithm that heavily relies on a total order, we explore a weak order based on which we investigate new algorithms. We propose a new two-phase algorithm, namely, tree construction and tree search. In the tree construction phase, a spanning tree of \(G\) can be constructed in bounded number of sequential scans of \(G\). In the tree search phase, it needs to sequentially scan the graph once to find all \(\mathsf {SCC}\)s. In addition, we propose a new single-phase algorithm, which combines the tree construction and tree search phases into a single phase, with three new optimization techniques. They are early acceptance, early rejection, and batch processing. By the single-phase algorithm with the new optimization techniques, we can significantly reduce the number of I/Os and the CPU cost. We prove the correctness of the algorithms. We conduct extensive experimental studies using 4 real datasets including a massive real dataset and several synthetic datasets to confirm the I/O efficiency of our approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27

Similar content being viewed by others

Notes

  1. http://newsroom.fb.com/.

  2. http://barcelona.research.yahoo.net/webspam/datasets/uk2007/.

  3. The up-edge will be used instead forward-cross-edge in the condition (line 8, Algorithm 1).

  4. There exist in-memory approaches to find all \(\mathsf {SCC}\)s which only need DFS the graph once (e.g., Tarjan algorithm [22]). However, they need additional memory in addition to the memory for holding the graph, which cannot be efficiently used for the limited memory in our problem.

  5. http://snap.stanford.edu/data.

  6. www.uniprot.org.

  7. http://citeseerx.ist.psu.edu.

  8. http://barcelona.research.yahoo.net/webspam/datasets/uk2007/links/.

References

  1. Abello, J., Buchsbaum, A.L., Westbrook, J.: A functional approach to external graph algorithms. Algorithmica 32(3), 437–458 (2002)

  2. Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)

  3. Aho, A.V., Hopcroft, J.E., Ullman, J.D.: Data Structures and Algorithms. Addison-Wesley, Reading (1983)

  4. Ajwani, D., Dementiev, R., Meyer, U.: A computational study of external-memory bfs algorithms. In: Proceedings of SODA’06 (2006)

  5. Ajwani, D., Meyer, U.: Algorithmics of Large and Complex Networks, Chapter 1: Design and Engineering of External Memory Traversal Algorithms for General Graphs. Springer, Berlin (2009)

  6. Ajwani, D., Meyer, U., Osipov, V.: Improved external memory bfs implementation. In: Proceedings of ALENEX’07 (2007)

  7. Buchsbaum, A.L., Goldwasser, M.H., Venkatasubramanian, S., Westbrook, J.: On external memory graph traversal. In: Proceedings of SODA’00 (2000)

  8. Chiang, Y.-J., Goodrich, M.T., Grove, E.F., Tamassia, R., Vengroff, D.E., Vitter, J.S.: External-memory graph algorithms. In: Proceedings of SODA’95 (1995)

  9. Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms. McGraw-Hill, NY (2001)

  10. Cosgaya-Lozano, A., Zeh, N.: A heuristic strong connectivity algorithm for large graphs. In: Proceedings of SEA’09 (2009)

  11. Dementiev, R., Sanders, P., Schultes, D., Sibeyn, J.F.: Engineering an external memory minimum spanning tree algorithm. In: IFIP TCS (2004)

  12. Fan, W., Li, J., Ma, S., Wang, H., Wu, Y.: Graph homomorphism revisited for graph matching. PVLDB 3(1), 1161–1172 (2010)

  13. Hellings, J., Fletcher, G.H., Haverkort, H.: Efficient external-memory bisimulation on dags. In: Proceedings of SIGMOD’12 (2012)

  14. Kumar, V., Schwabe, E.J.: Improved algorithms and data structures for solving graph problems in external memory. In: Proceedings of SPDP’96 (1996)

  15. Kyrola, A., Blelloch, G., Guestrin, C.: Graphchi: large-scale graph computation on just a pc. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI’12, pp. 31–46, Berkeley, CA, USA. USENIX Association (2012)

  16. Mehlhorn, K., Meyer, U.: External-memory breadth-first search with sublinear i/o. In: Proceedings of ESA’02 (2002)

  17. Meyer, U., Osipov, V.: Design and implementation of a practical i/o-efficient shortest paths algorithm. In: Proceedings of ALENEX’09 (2009)

  18. Meyer, U., Zeh, N.: I/O-efficient undirected shortest paths. In: Proceedings of ESA’03 (2003)

  19. Meyer, U., Zeh, N.: I/O-efficient undirected shortest paths with unbounded edge lengths. In: Proceedings of ESA’06 (2006)

  20. Sibeyn, J.F.: External connected components. In: Proceedings of SWAT’04 (2004)

  21. Sibeyn, J.F., Abello, J., Meyer, U.: Heuristics for semi-external depth first search on directed graphs. In: Proceedings of SPAA’02 (2002)

  22. Tarjan, R.E.: Depth-first search and linear graph algorithms. SIAM J. Comput. 1(2), 146–160 (1972)

    Article  MATH  MathSciNet  Google Scholar 

  23. Vitter, J.S.: External memory algorithms and data structures. ACM Comput. Surv. 33(2), 209–271 (2001)

  24. Yildirim, H., Chaoji, V., Zaki, M.J.: Grail: scalable reachability index for large graphs. PVLDB, 3(1), 276–284 (2010)

  25. Zhang, Z., Yu, J.X., Qin, L., Chang, L., Lin, X.: I/o efficient: computing sccs in massive graphs. In: Proceedings of SIGMOD’13 (2013)

Download references

Acknowledgments

The work was partially supported by grant of the Research Grants Council of the Hong Kong SAR, China No. 418512. The third author was supported by ARC DE140100999, and the fifth author was partially supported by NSFC61232006, NSFC61021004, ARCDP110102937 and ARCDP120104168.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jeffrey Xu Yu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Yu, J.X., Qin, L. et al. I/O efficient: computing SCCs in massive graphs. The VLDB Journal 24, 245–270 (2015). https://doi.org/10.1007/s00778-014-0372-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-014-0372-z

Keywords

Navigation