skip to main content
research-article

GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive)

Published: 16 May 2024 Publication History

Abstract

Finding the connected components of a graph is a fundamental problem with uses throughout computer science and engineering. The task of computing connected components becomes more difficult when graphs are very large, or when they are dynamic, meaning the edge set changes over time subject to a stream of edge insertions and deletions. A natural approach to computing the connected components problem on a large, dynamic graph stream is to buy enough RAM to store the entire graph. However, the requirement that the graph fit in RAM is an inherent limitation of this approach and is prohibitive for very large graphs. Thus, there is an unmet need for systems that can process dense dynamic graphs, especially when those graphs are larger than available RAM.
We present a new high-performance streaming graph-processing system for computing the connected components of a graph. This system, which we call GraphZeppelin, uses new linear sketching data structures (CubeSketch) to solve the streaming connected components problem and as a result requires space asymptotically smaller than the space required for a lossless representation of the graph. GraphZeppelin is optimized for massive dense graphs: GraphZeppelin can process millions of edge updates (both insertions and deletions) per second, even when the underlying graph is far too large to fit in available RAM. As a result GraphZeppelin vastly increases the scale of graphs that can be processed.

References

[1]
James Abello, Adam L. Buchsbaum, and Jeffery R. Westbrook. 2002. A functional approach to external graph algorithms. Algorithmica 32, 3 (2002), 437–458.
[2]
Kook Ahn, Sudipto Guha, and Andrew McGregor. 2012. Graph Sketches: Sparsification, spanners, and subgraphs. Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems 1 (2012), 5–14. DOI:
[3]
Kook Jin Ahn, Sudipto Guha, and Andrew McGregor. 2012. Analyzing graph structure via linear measurements. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’12). 459–467.
[4]
Zhiyuan Ai, Mingxing Zhang, Yongwei Wu, Xuehai Qian, Kang Chen, and Weimin Zheng. 2018. Clip: A disk I/O focused parallel out-of-core graph processing system. IEEE Transactions on Parallel and Distributed Systems 30, 1 (2018), 45–62.
[5]
Reka Albert. 2005. Scale-free networks in cell biology. Journal of Cell Science 118, 21 (2005), 4947–4957.
[6]
Stefano Allegretti, Federico Bolelli, Michele Cancilla, and Costantino Grana. 2018. Optimizing GPU-based connected components labeling algorithms. In Proceedings of the 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS’18). 175–180. DOI:
[7]
Khaled Ammar, Frank McSherry, Semih Salihoglu, and Manas Joglekar. 2018. Distributed evaluation of subgraph queries using worst-case optimal low-memory dataflows. Proceedings of the VLDB Endowment 11, 6 (2018), 691–704. DOI:
[8]
J. Ang, Brian W. Barrett, Kyle B. Wheeler, and Richard C. Murphy. 2010. Introducing the graph 500. In Proceedings of the Cray User Group (CUG). 45–74.
[9]
Lars Arge. 1995. The buffer tree: A new technique for optimal I/O-algorithms. In Proceedings of the Workshop on Algorithms and Data Structures (WADS’95). 334–345.
[10]
Tanya Y. Berger-Wolf and Jared Saia. 2006. A framework for analysis of dynamic social networks. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06). 523–528. DOI:
[11]
Kevin Beyer, Rainer Gemulla, Peter J. Haas, Berthold Reinwald, and Yannis Sismanis. 2009. Distinct-value synopses for multiset operations. Communications of the ACM 52, 10 (2009), 87–95.
[12]
Ilaria Bordino and Debora Donato. 2009. Dynamic characterization of a large Web graph. In Proceedings of the Web Science (WebSci’09).
[13]
Gerth Stølting Brodal, Rolf Fagerberg, David Hammer, Ulrich Meyer, Manuel Penschuck, and Hung Tran. 2021. An experimental study of external memory algorithms for connected components. In Proceedings of the19th International Symposium on Experimental Algorithms (SEA’21). 23:1–23:23.
[14]
Libor Buš and Pavel Tvrdik. 2001. A parallel algorithm for connected components on distributed memory machines. In Proceedings of the European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting (EuroMPI’01). 280–287.
[15]
Federico Busato, Oded Green, Nicola Bombieri, and David A. Bader. 2018. Hornet: An efficient data structure for dynamic sparse graphs and matrices on GPUs. In Proceedings of the 2018 IEEE High Performance Extreme Computing Conference (HPEC’18). 1–7.
[16]
Jiefeng Cheng, Qin Liu, Zhenguo Li, Wei Fan, John C. S. Lui, and Cheng He. 2015. VENUS: Vertex-centric streamlined graph computation on a single PC. In Proceedings of the IEEE 31st International Conference on Data Engineering (ICDE’15). 1131–1142.
[17]
Raymond Cheng, Ji Hong, Aapo Kyrola, Youshan Miao, Xuetian Weng, Ming Wu, Fan Yang, Lidong Zhou, Feng Zhao, and Enhong Chen. 2012. Kineograph: Taking the pulse of a fast-changing and connected world. In Proceedings of the 7th ACM European Conference on Computer Systems (EuroSys’12). 85–98.
[18]
Yi-Jen Chiang, Michael T. Goodrich, Edward F. Grove, Roberto Tamassia, Darren Erik Vengroff, and Jeffrey Scott Vitter. 1995. External-memory graph algorithms. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’95). 139–149.
[19]
Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan. 2015. One trillion edges: Graph processing at facebook-scale. Proceedings of the VLDB Endowment 8, 12 (2015), 1804–1815. DOI:
[20]
Yann Collet. 2016. xxHash-Extremely fast non-cryptographic hash algorithm. Retrieved from https://github.com/Cyan4973/xxHash. Accessed March 1, 2023.
[21]
Graham Cormode and Donatella Firmani. 2014. A unifying framework for \(\ell _0\)-sampling algorithms. Distributed and Parallel Databases 32 (2014), 315–335. DOI:
[22]
Graham Cormode and Marios Hadjieleftheriou. 2008. Finding frequent items in data streams. Proceedings of the VLDB Endowment 1, 2 (2008), 1530–1541.
[23]
Laxman Dhulipala, Guy Blelloch, and Julian Shun. 2019. Low-latency graph streaming using compressed purely-functional trees. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).918–934. Retrieved from https://par.nsf.gov/biblio/10137103
[24]
David Ediger, Rob McColl, Jason Riedy, and David A. Bader. 2012. Stinger: High performance data structure for streaming graphs. In Proceedings of the 2012 IEEE Conference on High Performance Extreme Computing (HPEC’12). 1–5.
[25]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’96). 226–231.
[26]
Yixiang Fang, Reynold Cheng, Siqiang Luo, and Jiafeng Hu. 2016. Effective community search for large attributed graphs. Proceedings of the VLDB Endowment 9, 12 (2016), 1233–1244. DOI:
[27]
Joan Feigenbaum, Sampath Kannan, Andrew McGregor, Siddharth Suri, and Jian Zhang. 2005. On graph problems in a semi-streaming model. Theoretical Computer Science 348, 2 (2005), 207–216. DOI:
[28]
Evangelos Georganas, Rob Egan, Steven Hofmeyr, Eugene Goltsman, Bill Arndt, Andrew Tritt, Aydin Buluç, Leonid Oliker, and Katherine Yelick. 2018. Extreme scale de novo metagenome assembly. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’18). 13 pages.
[29]
John Greiner. 1994. A comparison of parallel algorithms for connected components. In Proceedings of the 6th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’94). 16–25.
[30]
Sudipto Guha, Andrew McGregor, and David Tench. 2015. Vertex and hyperedge connectivity in dynamic graph streams. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS’15). 241–247.
[31]
Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, Jeong-Hoon Lee, Min-Soo Kim, Jinha Kim, and Hwanjo Yu. 2013. TurboGraph: A fast parallel graph engine handling billion-scale graphs in a single PC. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13). 77–85.
[32]
Lifeng He, Yuyan Chao, Kenji Suzuki, and Kesheng Wu. 2009. Fast connected-component labeling. Pattern Recognition 42, 9 (2009), 1977–1987.
[33]
Lifeng He, Xiwei Ren, Qihang Gao, Xiao Zhao, Bin Yao, and Yuyan Chao. 2017. The connected-component labeling problem: A review of state-of-the-art algorithms. Pattern Recognition 70 (2017), 25–43. DOI:
[34]
M. Moftah Hossam, Aboul Ella Hassanien, and Mohamoud Shoman. 2010. 3D brain tumor segmentation scheme using K-mean clustering and connected component labeling algorithms. In Proceedings of the10th International Conference on Intelligent Systems Design and Applications (ISDA’10). 320–324. DOI:
[35]
Anand Iyer, Li Erran Li, and Ion Stoica. 2015. CellIQ: Real-time cellular network analytics at scale. In Proceedings of the12th USENIX Symposium on Networked Systems Design and Implementation (NSDI’15). 309–322.
[36]
Anand Padmanabha Iyer, Li Erran Li, Tathagata Das, and Ion Stoica. 2016. Time-evolving graph processing at scale. In Proceedings of the 4th International Workshop on Graph Data Management Experiences and Systems (GRADES’16). 1–6.
[37]
William Johnson and J. Lindenstrauss. 1982. Extensions of lipschitz mappings into a hilbert space. Conference in Modern Analysis and Probability 26 (1982), 189–206.
[38]
Jinhong Jung, Kijung Shin, Lee Sael, and U. Kang. 2016. Random walk with restart on large graphs using block elimination. ACM Transactions on Database Systems 41, 2 (2016), 1–43.
[39]
U. Kang and Christos Faloutsos. 2011. Beyond ‘caveman communities’: Hubs and spokes for graph compression and mining. In Proceedings of the IEEE International Conference on Data Mining (ICDM’11). 300–309. DOI:
[40]
U. Kang, Mary McGlohon, Leman Akoglu, and Christos Faloutsos. 2010. Patterns on the connected components of terabyte-scale graphs. In Proceedings of the IEEE International Conference on Data Mining (ICDM’10). 875–880. DOI:
[41]
Michael Korn, Daniel Sanders, and Josef Pauli. 2017. Moving object detection by connected component labeling of point cloud registration outliers on the GPU. In Proceedings of the International Joint Conference on Computer Vision, Imaging, and Computer Graphics Theory and Applications (VISIGRAPP’17). 499–508.
[42]
Ioannis Koutis, Alex Levin, and Richard Peng. 2015. Faster spectral sparsification and numerical algorithms for SDD matrices. ACM Transactions on Algorithms 12, 2 (2015), 1–16.
[43]
Ioannis Koutis, Gary L. Miller, and Richard Peng. 2011. A nearly-m log n time solver for sdd linear systems. In Proceedings of the 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science. IEEE, 590–598.
[44]
Arvind Krishnamurthy, Steven Lumetta, David E. Culler, and Katherine Yelick. 1997. Connected components on distributed memory machines. Third DIMACS Implementation Challenge 30 (1997), 1–21.
[45]
Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-scale graph computation on just a PC. In Proceedings of the10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). 31–46.
[46]
Wookey Lee, James J. Lee, and Jinho Kim. 2014. Social network community detection using strongly connected components. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’14). 596–604.
[47]
Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. 2007. The dynamics of viral marketing. ACM Transactions on the Web 1, 1 (2007), 5.
[48]
Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. Retrieved from http://snap.stanford.edu/data. Accessed March 1, 2023.
[49]
Yongsub Lim, U. Kang, and Christos Faloutsos. 2014. Slashburn: Graph compression and mining beyond caveman communities. IEEE Transactions on Knowledge and Data Engineering 26, 12 (2014), 3077–3089.
[50]
Yongsub Lim, Won-Jo Lee, Ho-Jin Choi, and U. Kang. 2015. Discovering large subsets with high quality partitions in real world graphs. In Proceedings of the 2015 International Conference on Big Data and Smart Computing (BIGCOMP’15). 186–193.
[51]
Yongsub Lim, Won-Jo Lee, Ho-Jin Choi, and U. Kang. 2017. MTP: Discovering high quality partitions in real world graphs. World Wide Web 20, 3 (2017), 491–514.
[52]
Hang Liu and H. Howie Huang. 2017. Graphene: Fine-grained I/O management for graph computing. In Proceedings of the15th USENIX Conference on File and Storage Technologies (FAST’17). 285–300.
[53]
Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, and Taesoo Kim. 2017. Mosaic: Processing a trillion-edge graph on a single machine. In Proceedings of the 12th European Conference on Computer Systems (EuroSys’17). 527–543.
[54]
Peter Macko, Virendra J. Marathe, Daniel W. Margo, and Margo I. Seltzer. 2015. Llama: Efficient graph analytics using large multiversioned arrays. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering (ICDE’15). 363–374.
[55]
Nishad Manerikar and Themis Palpanas. 2009. Frequent items in streaming data: An experimental evaluation of the state-of-the-art. Data and Knowledge Engineering 68, 4 (2009), 415–430.
[56]
Julian J. McAuley and Jure Leskovec. 2012. Learning to discover social circles in ego networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS’12). 548–56.
[57]
Andrew McGregor, David Tench, Sofya Vorotnikova, and Hoa T. Vu. 2015. Densest subgraph in dynamic graph streams. In Proceedings of the Mathematical Foundations of Computer Science (MFCS’15). 472–482.
[58]
Duccio Medini, Antonello Covacci, and Claudio Donati. 2006. Protein homology network families reveal step-wise diversification of Type III and Type IV secretion systems. PLoS Computational Biology 2, 12 (2006), 173.
[59]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2008. Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. In Proceedings of the 11th International Conference on Extending Database Technology: Advances in Database Technology. 618–629.
[60]
Derek G. Murray, Frank McSherry, Michael Isard, Rebecca Isaacs, Paul Barham, and Martin Abadi. 2016. Incremental, iterative data processing with timely dataflow. Communications of the ACM 59, 10 (2016), 75–83.
[61]
S. Muthukrishnan. 2005. Data streams: Algorithms and applications. Foundations and Trends® in Theoretical Computer Science 1, 2 (2005), 117–236. DOI:
[62]
Jelani Nelson and Huacheng Yu. 2019. Optimal lower bounds for distributed and streaming spanning forest computation. In Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’19). 1844–1860.
[63]
Jaroslav Nešetřil, Eva Milková, and Helena Nešetřilová. 2001. Otakar BorůVka on minimum spanning tree problem translation of both the 1926 papers, comments, history. Discrete Math 233, 1–3 (2001), 3–36. DOI:
[64]
Sergey Nurk, Dmitry Meleshko, Anton Korobeynikov, and Pavel Pevzner. 2017. MetaSPAdes: A new versatile metagenomic assembler. Genome Research 27 (2017), 824–834. DOI:
[65]
OpenMP Architecture Review Board 2018. OpenMP Application Programming Interface (5th ed.). OpenMP Architecture Review Board.
[66]
Prashant Pandey, Brian Wheatman, Helen Xu, and Aydin Buluç. 2021. Terrace: A hierarchical graph container for skewed dynamic graphs. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD’21). 1372–1385.
[67]
Md. Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok Choudhary. 2012. A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12). 1–11. DOI:
[68]
Alex Pothen and Chin-Ju Fan. 1990. Computing the block triangular form of a sparse matrix. ACM Transactions on Mathematical Software 16, 4 (1990), 303–324. DOI:
[69]
Matei Ripeanu and Ian Foster. 2002. Mapping the gnutella network: Macroscopic properties of large-scale peer-to-peer systems. Peer-to-Peer Systems, Peter Druschel, Frans Kaashoek, and Antony Rowstron (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 85–93.
[70]
Ryan A. Rossi and Nesreen K. Ahmed. 2015. The network data repository with interactive graph analytics and visualization. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). 4292–4293. Retrieved from https://networkrepository.com
[71]
Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. 2013. X-Stream: Edge-centric graph processing using streaming partitions. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). 472–488.
[72]
Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and M. Tamer Özsu. 2017. The Ubiquity of large graphs and surprising challenges of graph processing. Proceedings of the VLDB Endowment 11, 4 (2017), 420–431. DOI:
[73]
Dipanjan Sengupta and Shuaiwen Leon Song. 2017. EvoGraph: On-the-fly efficient mining of evolving graphs on GPU. In Proceedings of the International Supercomputing Conference (ISC’17). 97–119.
[74]
Dipanjan Sengupta, Narayanan Sundaram, Xia Zhu, Theodore L. Willke, Jeffrey Young, Matthew Wolf, and Karsten Schwan. 2016. Graphin: An online high performance incremental graph processing framework. In Proceedings of the European Conference on Parallel Processing (Euro-Par’16). 319–333.
[75]
David Tench, Evan West, Victor Zhang, Michael A. Bender, Abiyaz Chowdhury, J. Ahmed Dellas, Martin Farach-Colton, Tyler Seip, and Kenny Zhang. 2022. GraphZeppelin: Storage-friendly sketching for connected components on dynamic graph streams. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD’22). Association for Computing Machinery, New York, NY, USA, 325–339. DOI:
[76]
Heidi Thornquist, Eric Keiter, Robert Hoekstra, David Day, and Erik Boman. 2009. A parallel preconditioning strategy for efficient transistor-level circuit simulation. In Proceedings of the 2009 International Conference on Computer-Aided Design (ICCAD’09). 410–417. DOI:
[77]
Stijn Marinus Van Dongen. 2000. Graph Clustering by Flow Simulation. Ph.D. Dissertation. University Utrecht.
[78]
Jeffrey Scott Vitter. 2001. External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys 33, 2 (2001), 209–271.
[79]
Keval Vora. 2019. LUMOS: Dependency-driven disk-based graph processing. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC’19). 429–442.
[80]
Keval Vora, Guoqing Xu, and Rajiv Gupta. 2016. Load the edges you need: A generic I/O optimization for disk-based graph processing. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC’16). 507–522.
[81]
Dong Wen, Lu Qin, Ying Zhang, Lijun Chang, and Xuemin Lin. 2019. Efficient structural graph clustering: An index-based approach. The VLDB Journal 28, 3 (2019), 377–399. DOI:
[82]
Min Wu, Xiaoli li, Chee-Keong Kwoh, and See-Kiong Ng. 2009. A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinformatics 10 (2009), 169. DOI:
[83]
Pingpeng Yuan, Changfeng Xie, Ling Liu, and Hai Jin. 2016. PathGraph: A path centric graph processing system. IEEE Transactions on Parallel and Distributed Systems 27, 10 (2016), 2998–3012.
[84]
Mingxing Zhang, Yongwei Wu, Youwei Zhuo, Xuehai Qian, Chengying Huan, and Kang Chen. 2018. Wonderland: A novel abstraction-based out-of-core graph processing system. ACM SIGPLAN Notices 53, 2 (2018), 608–621.
[85]
Da Zheng, Disa Mhembere, Randal Burns, Joshua Vogelstein, Carey E. Priebe, and Alexander S. Szalay. 2015. FlashGraph: Processing billion-node graphs on an array of commodity SSDs. In Proceedings of the13th USENIX Conference on File and Storage Technologies (FAST’15). 45–58.
[86]
Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’15). 375–386.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 49, Issue 3
September 2024
154 pages
EISSN:1557-4644
DOI:10.1145/3613640
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2024
Online AM: 20 February 2024
Accepted: 24 January 2024
Revised: 11 October 2023
Received: 16 March 2023
Published in TODS Volume 49, Issue 3

Check for updates

Author Tags

  1. Linear sketching
  2. streaming algorithms
  3. external memory

Qualifiers

  • Research-article

Funding Sources

  • NSF (National Science Foundation)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 313
    Total Downloads
  • Downloads (Last 12 months)273
  • Downloads (Last 6 weeks)47
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media