Abstract
Graph summarization is valuable approach to analyze various real life phenomenon, like communities, influential nodes, and information flow in a big graph. To summarize a graph, nodes having similar neighbors are merged into super nodes and their corresponding edges are compressed into super edges. Existing methods find similar nodes either by nodes ordering or perform pairwise similarity computations. Compression-by-node ordering approaches are scalable but provide lesser compression due to exhaustive similarity computations of their counterparts. In this paper, we propose a novel set-based summarization approach that directly summarizes naturally occurring sets of similar nodes in a graph. Our approach is scalable since we avoid explicit similarity computations with non-similar nodes and merge sets of nodes in each iteration. Similarly, we provide good compression ratio as each set consists of highly similar nodes. To locate sets of similar nodes, we find candidate sets of similar nodes by using locality sensitive hashing. However, member nodes of every candidate set have varying similarities with each other. Therefore, we propose a heuristic based on similarity among degrees of candidate nodes, and a parameter-free pruning technique to effectively identify subset of highly similar nodes from candidate nodes. Through experiments on real world graphs, our approach requires lesser execution time than pairwise graph summarization, with margin of an order of magnitude in graphs containing nodes with highly diverse neighborhood, and produces summary at similar accuracy. Similarly, we observe comparable scalability against the compression-by-node ordering method, while providing better compression ratio.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
http://snap.stanford.edu/data/index.html. Last accessed on 10/28/2014.
References
Boldi P, Vigna S (2004) The webgraph framework i: compression techniques. In: Proceedings of the 13th international conference on World Wide Web, ACM, pp 595–602
Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences 1997. IEEE, pp 21–29
Buttler D (2004) A short survey of document structure similarity algorithms. In: International conference on internet computing, pp 3–9
Chen C, Yan X, Zhu F, Han J, Yu PS (2008) Graph olap: towards online analytical processing on graphs. In: Eighth IEEE international conference on data mining, 2008. ICDM’08, IEEE, pp 103–112
Chierichetti F, Kumar R, Lattanzi S, Mitzenmacher M, Panconesi A, Raghavan P (2009) On compressing social networks. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 219–228
Chum O, Philbin J, Isard M, Zisserman A (2007) Scalable near identical image and shot detection. In: Proceedings of the 6th ACM international conference on Image and video retrieval, ACM, pp 549–556
Ghazizadeh S, Chawathe SS (2002) Seus: structure extraction using summaries. In: Discovery science. Springer, Berlin, pp 71–85
Gorisse D, Cord M, Precioso F (2012) Locality-sensitive hashing for chi2 distance. IEEE Trans Pattern Anal Mach Intell 34(2):402–409
Hernández C, Navarro G (2013) Compressed representations for web and social graphs. In: Knowledge and information systems, pp 1–35
Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces (survey article). ACM Trans Database Syst (TODS) 28(4):517–580
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, ACM, pp 604–613
Kang U, Faloutsos C (2011) Beyond’caveman communities’: Hubs and spokes for graph compression and mining. In: 2011 IEEE 11th international conference on data mining (ICDM), IEEE, pp 300–309
Ketkar NS, Holder LB, Cook DJ (2005) Subdue: compression-based frequent pattern discovery in graph data. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ACM, pp 71–76
Koutra D, Kang U, Vreeken J, Faloutsos C (2014) VOG: summarizing and understanding large graphs. In: Proceedings of the 2014 SIAM international conference on data mining, Philadelphia, Pennsylvania, USA, April 24–26, 2014, pp 91–99. doi:10.1137/1.9781611973440.11
LeFevre K, Terzi E (2010) Grass: graph structure summarization. Proceedings of the SIAM international conference on data mining, SDM 2010, April 29–May 1, 2010. Columbus, Ohio, pp 454–465
Liu S, Chen L, Ni LM, Fan J (2011) Cim: categorical influence maximization. In: Proceedings of the 5th international conference on ubiquitous information management and communication, ACM, p 124
Liu S, Wang S, Zhu F, Zhang J, Krishnan R (2014) Hydra: large-scale social identity linkage via heterogeneous behavior modeling. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data, ACM, pp 51–62
Macropol K, Singh A (2010) Scalable discovery of best clusters on large graphs. Proc VLDB Endowment 3(1–2):693–702
Mu Y, Yan S (2010) Non-metric locality-sensitive hashing. In: AAAI
Nanopoulos A, Manolopoulos Y (2002) Efficient similarity search for market basket data. VLDB J 11(2):138–152
Navlakha S, Rastogi R, Shrivastava N (2008) Graph summarization with bounded error. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data, ACM, pp 419–432
Qu Q, Zhu F, Yan X, Han J, Philip SY, Li H (2011) Efficient topological olap on information networks. In: Database systems for advanced applications. Springer, Berlin, pp 389–403
Qu Q, Liu S, Jensen CS, Zhu F, Faloutsos C (2014) Interestingness-driven diffusion process summarization in dynamic networks. In: Machine learning and knowledge discovery in databases. Springer, Berlin, pp 597–613
Rajaraman A, Ullman JD (2011) Mining of massive datasets. Cambridge University Press, Cambridge
Rissanen J (1978) Modeling by shortest data description. Automatica 14(5):465–471
Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, ACM, pp 785–796
Tian Y, Hankins RA, Patel JM (2008) Efficient aggregation for graph summarization. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM, pp 567–580
Toivonen H, Zhou F, Hartikainen A, Hinkka A (2011) Compression of weighted graphs. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 965–973
Tomar VS, Rose RC (2013) Efficient manifold learning for speech recognition using locality sensitive hashing. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 6995–6999
Wang J, Shen HT, Song J, Ji J (2014) Hashing for similarity search: a survey. arXiv:1408.2927
Xiong Y, Zhu Y, Yu P (2014) Top-k similarity join in heterogeneous information networks. In: IEEE Trans Knowledge Data Eng, PP(99), 1. doi:10.1109/TKDE.2014.2373385
Yan X, Han J (2002) gspan: Graph-based substructure pattern mining. In: 2002 IEEE international conference on data mining, 2002. ICDM 2003. Proceedings, IEEE, pp 721–724
Yan X, Zhu F, Yu PS, Han J (2006) Feature-based similarity search in graph structures. ACM Trans Database Syst (TODS) 31(4):1418–1453
Yin M, Wu B, Zeng Z (2012) Hmgraph olap: a novel framework for multi-dimensional heterogeneous network analysis. In: Proceedings of the fifteenth international workshop on Data warehousing and OLAP, ACM, pp 137–144
Zhao P, Li X, Xin D, Han J (2011) Graph cube: on warehousing and olap multidimensional networks. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, ACM, pp 853–864
Zhu F, Zhang Z, Qu Q (2013) A direct mining approach to efficient constrained graph pattern discovery. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM, pp 821–832
Acknowledgments
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2014R1A2A1A05043734).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Khan, K.U., Nawaz, W. & Lee, YK. Set-based approximate approach for lossless graph summarization. Computing 97, 1185–1207 (2015). https://doi.org/10.1007/s00607-015-0454-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-015-0454-9