Graph summarization with quality guarantees

Riondato, Matteo; García-Soriano, David; Bonchi, Francesco

doi:10.1007/s10618-016-0468-8

Graph summarization with quality guarantees

Published: 06 June 2016

Volume 31, pages 314–349, (2017)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Matteo Riondato ORCID: orcid.org/0000-0003-2523-4420¹,
David García-Soriano² &
Francesco Bonchi³

1670 Accesses
Explore all metrics

Abstract

We study the problem of graph summarization. Given a large graph we aim at producing a concise lossy representation (a summary) that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation. In this work we study a very natural type of summary: the original set of vertices is partitioned into a small number of supernodes connected by superedges to form a complete weighted graph. The superedge weights are the edge densities between vertices in the corresponding supernodes. To quantify the dissimilarity between the original graph and a summary, we adopt the reconstruction error and the cut-norm error. By exposing a connection between graph summarization and geometric clustering problems (i.e., k-means and k-median), we develop the first polynomial-time approximation algorithms to compute the best possible summary of a certain size under both measures. We discuss how to use our summaries to store a (lossy or lossless) compressed graph representation and to approximately answer a large class of queries about the original graph, including adjacency, degree, eigenvector centrality, and triangle and subgraph counting. Using the summary to answer queries is very efficient as the running time to compute the answer depends on the number of supernodes in the summary, rather than the number of nodes in the original graph.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable Approximation Algorithm for Graph Summarization

Are Edge Weights in Summary Graphs Useful? - A Comparative Study

Reducing large graphs to small supergraphs: a unified approach

Article 10 March 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

We discuss the case of directed graphs in Sect. 3.5.
A skew-symmetric matrix (also known as antisymmetric or antimetric matrix) is a square matrix A whose transpose is also its negative: $-A = A^\intercal $.
If $v_1, \ldots , v_n \in {\mathbb R}^d$, then $\left\| {v_i - v_j} \right\| _2^2 = \left\| {v_i} \right\| _2^2 + \left\| {v_j} \right\| _2^2 - 2 \langle v_i, v_j \rangle $. Since the quantities $\left\| {v_i} \right\| _2^2$ can be easily precomputed, the problem reduces to computing all inner products $\langle v_i, v_j \rangle $. These form the entries of $A A^\intercal $, where A is the $n\times d$ matrix with rows $v_1, \ldots , v_n$.
For $\ell _2$, we can also use the Johnson-Lindenstrauss transform (Johnson and Lindenstrauss 1984).
We denote as $\left( {\begin{array}{c}X\\ k\end{array}}\right) $ the set of k-subsets of X, i.e., the subsets of X of size k.
Further space-saving can be achieved by storing only densities above a certain threshold using adjacency lists; the superedges removed increase the reconstruction error.
Minor modifications are needed if self-loops are allowed.
http://snap.stanford.edu/data/.
http://irefindex.org.
For speed reasons, we modified the algorithm by Arya et al. (2004) to try only a limited number of local improvements and did not run it to completion. It could otherwise achieve even better approximations.
The implementation is available from https://github.com/rionda/graphsumm.

References

Aggarwal A, Deshpande A, Kannan R (2009) Adaptive sampling for k-means clustering. Approximation, randomization, and combinatorial optimization. Algorithms and techniques, APPROX-RANDOM. Springer, Berlin, pp 15–28
Chapter Google Scholar
Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–248
Article Google Scholar
Alon N, Duke RA, Lefmann H, Rödl V, Yuster R (1994) The algorithmic aspects of the regularity lemma. J Algorithms 16(1):80–109
Article MathSciNet Google Scholar
Alon N, Naor A (2006) Approximating the cut-norm via Grothendieck’s inequality. SIAM J Comput 35(4):787–803
Article MathSciNet Google Scholar
Arthur D, Vassilvitskii S (2007) $k$-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, SIAM, SODA ’07, pp 1027–1035
Arya V, Garg N, Khandekar R, Meyerson A, Munagala K, Pandit V (2004) Local search heuristics for $k$-median and facility location problems. SIAM J Comput 33(3):544–562
Article MathSciNet Google Scholar
Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable $k$-means++. Proc VLDB Endow 5(7):622–633
Article Google Scholar
Boldi P, Santini M, Vigna S (2009) Permuting web and social graphs. Internet Math 6(3):257–283
Article MathSciNet Google Scholar
Boldi P, Rosa M, Santini M, Vigna S (2011) Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks. In: Proceedings of the 20th international conference on World Wide Web, ACM, WWW ’11, pp 587–596
Boldi P, Vigna S (2004) The webgraph framework i: compression techniques. In: Proceedings of the 13th international conference on World Wide Web, ACM, WWW ’04, pp 595–602
Bonchi F, García-Soriano D, Kutzkov K (2013) Local correlation clustering. arXiv preprint arXiv:1312.5105v1
Campan A, Truta TM (2009) Data and structural k-anonymity in social networks. Privacy, security, and trust in KDD. Springer, Berlin, pp 33–54
Chapter Google Scholar
Conlon D, Fox J (2012) Bounds for graph regularity and removal lemmas. Geom Funct Anal 22(5):1191–1256
Article MathSciNet Google Scholar
Cormode G, Srivastava D, Yu T, Zhang Q (2010) Anonymizing bipartite graph data using safe groupings. VLDB J 19(1):115–139
Article Google Scholar
Dasgupta S (2008) The hardness of $k$-means clustering. Tech. Rep. 09-16. University of California, San Diego
Dellamonica DJ, Kalyanasundaram S, Martin DM, Rödl V, Shapira A (2012) A deterministic algorithm for the Frieze-Kannan regularity lemma. SIAM J Discret Math 26(1):15–29
Article MathSciNet Google Scholar
Dellamonica DJ, Kalyanasundaram S, Martin DM, Rödl V, Shapira A (2015) An optimal algorithm for finding Frieze-Kannan regular partitions. Comb Prob Comput 24(02):407–437
Article MathSciNet Google Scholar
Fan W, Li J, Wang X, Wu Y (2012) Query preserving graph compression. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data, ACM, SIGMOD ’12, pp 157–168
Frieze A, Kannan R (1999) Quick approximation to matrices and applications. Combinatorica 19(2):175–220
Article MathSciNet Google Scholar
Gowers WT (1997) Lower bounds of tower type for Szemerédi’s uniformity lemma. Geom Funct Anal 7(2):322–337
Article MathSciNet Google Scholar
Hay M, Miklau G, Jensen D, Towsley D, Li C (2010) Resisting structural re-identification in anonymized social networks. VLDB J 19(6):797–823
Article Google Scholar
Hernández C, Navarro G (2011) Compression of web and social graphs supporting neighbor and community queries. In: Proceedings of the 6th ACM workshop on social network mining and analysis, ACM, SNAKDD ’11
Indyk P (2006) Stable distributions, pseudorandom generators, embeddings, and data stream computation. J ACM 53(3):307–323
Article MathSciNet Google Scholar
Jain K, Vazirani VV (2001) Approximation algorithms for metric facility location and $k$-median problems using the primal-dual schema and Lagrangian relaxation. J ACM 48(2):274–296
Article MathSciNet Google Scholar
Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemp Math 26:189–206
Article MathSciNet Google Scholar
LeFevre K, Terzi E (2010) GraSS: graph structure summarization. In: Proceedings of the 2010 SIAM international conference on data mining, SIAM, SDM ’10, pp 454–465
Chapter Google Scholar
Liu Z, Yu JX, Cheng H (2012) Approximate homogeneous graph summarization. Inf Media Technol 7(1):32–43
Google Scholar
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137
Article MathSciNet Google Scholar
Lovász L (2012) Large networks and graph limits. American Mathematical Society, Providence
Book Google Scholar
Maserrat H, Pei J (2010) Neighbor query friendly compression of social networks. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, KDD ’10, pp 533–542
Megiddo N, Supowit KJ (1984) On the complexity of some common geometric location problems. SIAM J Comput 13(1):182–196
Article MathSciNet Google Scholar
Mettu RR, Plaxton CG (2003) The online median problem. SIAM J Comput 32(3):816–832
Article MathSciNet Google Scholar
Navlakha S, Rastogi R, Shrivastava N (2008) Graph summarization with bounded error. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM, SIGMOD ’08, pp 419–432
Riondato M, García-Soriano D, Bonchi F (2014) Graph summarization with quality guarantees. In: 2014 IEEE international conference on data mining, IEEE, ICDM ’14, pp 947–952
Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27–64
Article Google Scholar
Szemerédi E (1976) Regular partitions of graphs. In: Problèmes Combinatoires et Théorie des Graphes, Colloq. Internat. CNRS, Univ. Orsay., pp 399–401
Tassa T, Cohen DJ (2013) Anonymization of centralized and distributed social networks by sequential clustering. IEEE Trans Knowl Data Eng 25(2):311–324
Article Google Scholar
Tian Y, Hankins RA, Patel JM (2008) Efficient aggregation for graph summarization. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data, ACM, SIGMOD ’08, pp 567–580
Toivonen H, Zhou F, Hartikainen A, Hinkka A (2011) Compression of weighted graphs. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, KDD ’11, pp 965–973
Tsourakakis CE (2008) Fast counting of triangles in large real networks without counting: algorithms and laws. In: 2008 IEEE international conference on data mining, IEEE, ICDM ’08, pp 608–617
Vassilevska Williams V (2011) Breaking the Coppersmith–Winograd barrier, unpublished manuscript
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244
Article MathSciNet Google Scholar
Williams D (1991) Probability with Martingales. Cambridge University Press, Cambridge
Book Google Scholar
Zheleva E, Getoor L (2008) Preserving the privacy of sensitive relationships in graph data. In: Privacy, security, and trust in KDD, Springer, pp 153–171

Download references

Acknowledgments

The authors are thankful to the anonymous reviewers of the journal and of IEEE ICDM’14 for their insightful comments that contributed to improving the quality of this article. Matteo Riondato performed part of the work while affiliated to Brown University. He was supported in part by a summer internship at Yahoo Labs Barcelona and by NSF Grant IIS-1247581 and NIH Grant R01-CA180776.

Author information

Authors and Affiliations

Two Sigma Investments LP, New York, NY, USA
Matteo Riondato
Eurecat, Barcelona, Spain
David García-Soriano
ISI Foundation, Turin, Italy
Francesco Bonchi

Authors

Matteo Riondato
View author publications
You can also search for this author inPubMed Google Scholar
David García-Soriano
View author publications
You can also search for this author inPubMed Google Scholar
Francesco Bonchi
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Matteo Riondato.

Additional information

Responsible editor: G. Karypis.

A preliminary version of this work appeared in the proceedings of IEEE ICDM’14 (Riondato et al. 2014).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Riondato, M., García-Soriano, D. & Bonchi, F. Graph summarization with quality guarantees. Data Min Knowl Disc 31, 314–349 (2017). https://doi.org/10.1007/s10618-016-0468-8

Download citation

Received: 28 July 2015
Accepted: 24 May 2016
Published: 06 June 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s10618-016-0468-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Graph summarization with quality guarantees

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Scalable Approximation Algorithm for Graph Summarization

Are Edge Weights in Summary Graphs Useful? - A Comparative Study

Reducing large graphs to small supergraphs: a unified approach

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now