Skip to main content
Log in

Set-based approximate approach for lossless graph summarization

  • Published:
Computing Aims and scope Submit manuscript

Abstract

Graph summarization is valuable approach to analyze various real life phenomenon, like communities, influential nodes, and information flow in a big graph. To summarize a graph, nodes having similar neighbors are merged into super nodes and their corresponding edges are compressed into super edges. Existing methods find similar nodes either by nodes ordering or perform pairwise similarity computations. Compression-by-node ordering approaches are scalable but provide lesser compression due to exhaustive similarity computations of their counterparts. In this paper, we propose a novel set-based summarization approach that directly summarizes naturally occurring sets of similar nodes in a graph. Our approach is scalable since we avoid explicit similarity computations with non-similar nodes and merge sets of nodes in each iteration. Similarly, we provide good compression ratio as each set consists of highly similar nodes. To locate sets of similar nodes, we find candidate sets of similar nodes by using locality sensitive hashing. However, member nodes of every candidate set have varying similarities with each other. Therefore, we propose a heuristic based on similarity among degrees of candidate nodes, and a parameter-free pruning technique to effectively identify subset of highly similar nodes from candidate nodes. Through experiments on real world graphs, our approach requires lesser execution time than pairwise graph summarization, with margin of an order of magnitude in graphs containing nodes with highly diverse neighborhood, and produces summary at similar accuracy. Similarly, we observe comparable scalability against the compression-by-node ordering method, while providing better compression ratio.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://snap.stanford.edu/data/index.html. Last accessed on 10/28/2014.

References

  1. Boldi P, Vigna S (2004) The webgraph framework i: compression techniques. In: Proceedings of the 13th international conference on World Wide Web, ACM, pp 595–602

  2. Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences 1997. IEEE, pp 21–29

  3. Buttler D (2004) A short survey of document structure similarity algorithms. In: International conference on internet computing, pp 3–9

  4. Chen C, Yan X, Zhu F, Han J, Yu PS (2008) Graph olap: towards online analytical processing on graphs. In: Eighth IEEE international conference on data mining, 2008. ICDM’08, IEEE, pp 103–112

  5. Chierichetti F, Kumar R, Lattanzi S, Mitzenmacher M, Panconesi A, Raghavan P (2009) On compressing social networks. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 219–228

  6. Chum O, Philbin J, Isard M, Zisserman A (2007) Scalable near identical image and shot detection. In: Proceedings of the 6th ACM international conference on Image and video retrieval, ACM, pp 549–556

  7. Ghazizadeh S, Chawathe SS (2002) Seus: structure extraction using summaries. In: Discovery science. Springer, Berlin, pp 71–85

  8. Gorisse D, Cord M, Precioso F (2012) Locality-sensitive hashing for chi2 distance. IEEE Trans Pattern Anal Mach Intell 34(2):402–409

    Article  Google Scholar 

  9. Hernández C, Navarro G (2013) Compressed representations for web and social graphs. In: Knowledge and information systems, pp 1–35

  10. Hjaltason GR, Samet H (2003) Index-driven similarity search in metric spaces (survey article). ACM Trans Database Syst (TODS) 28(4):517–580

    Article  Google Scholar 

  11. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, ACM, pp 604–613

  12. Kang U, Faloutsos C (2011) Beyond’caveman communities’: Hubs and spokes for graph compression and mining. In: 2011 IEEE 11th international conference on data mining (ICDM), IEEE, pp 300–309

  13. Ketkar NS, Holder LB, Cook DJ (2005) Subdue: compression-based frequent pattern discovery in graph data. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ACM, pp 71–76

  14. Koutra D, Kang U, Vreeken J, Faloutsos C (2014) VOG: summarizing and understanding large graphs. In: Proceedings of the 2014 SIAM international conference on data mining, Philadelphia, Pennsylvania, USA, April 24–26, 2014, pp 91–99. doi:10.1137/1.9781611973440.11

  15. LeFevre K, Terzi E (2010) Grass: graph structure summarization. Proceedings of the SIAM international conference on data mining, SDM 2010, April 29–May 1, 2010. Columbus, Ohio, pp 454–465

    Google Scholar 

  16. Liu S, Chen L, Ni LM, Fan J (2011) Cim: categorical influence maximization. In: Proceedings of the 5th international conference on ubiquitous information management and communication, ACM, p 124

  17. Liu S, Wang S, Zhu F, Zhang J, Krishnan R (2014) Hydra: large-scale social identity linkage via heterogeneous behavior modeling. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data, ACM, pp 51–62

  18. Macropol K, Singh A (2010) Scalable discovery of best clusters on large graphs. Proc VLDB Endowment 3(1–2):693–702

    Article  Google Scholar 

  19. Mu Y, Yan S (2010) Non-metric locality-sensitive hashing. In: AAAI

  20. Nanopoulos A, Manolopoulos Y (2002) Efficient similarity search for market basket data. VLDB J 11(2):138–152

    Article  Google Scholar 

  21. Navlakha S, Rastogi R, Shrivastava N (2008) Graph summarization with bounded error. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data, ACM, pp 419–432

  22. Qu Q, Zhu F, Yan X, Han J, Philip SY, Li H (2011) Efficient topological olap on information networks. In: Database systems for advanced applications. Springer, Berlin, pp 389–403

  23. Qu Q, Liu S, Jensen CS, Zhu F, Faloutsos C (2014) Interestingness-driven diffusion process summarization in dynamic networks. In: Machine learning and knowledge discovery in databases. Springer, Berlin, pp 597–613

  24. Rajaraman A, Ullman JD (2011) Mining of massive datasets. Cambridge University Press, Cambridge

  25. Rissanen J (1978) Modeling by shortest data description. Automatica 14(5):465–471

    Article  MATH  Google Scholar 

  26. Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, ACM, pp 785–796

  27. Tian Y, Hankins RA, Patel JM (2008) Efficient aggregation for graph summarization. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM, pp 567–580

  28. Toivonen H, Zhou F, Hartikainen A, Hinkka A (2011) Compression of weighted graphs. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 965–973

  29. Tomar VS, Rose RC (2013) Efficient manifold learning for speech recognition using locality sensitive hashing. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 6995–6999

  30. Wang J, Shen HT, Song J, Ji J (2014) Hashing for similarity search: a survey. arXiv:1408.2927

  31. Xiong Y, Zhu Y, Yu P (2014) Top-k similarity join in heterogeneous information networks. In: IEEE Trans Knowledge Data Eng, PP(99), 1. doi:10.1109/TKDE.2014.2373385

  32. Yan X, Han J (2002) gspan: Graph-based substructure pattern mining. In: 2002 IEEE international conference on data mining, 2002. ICDM 2003. Proceedings, IEEE, pp 721–724

  33. Yan X, Zhu F, Yu PS, Han J (2006) Feature-based similarity search in graph structures. ACM Trans Database Syst (TODS) 31(4):1418–1453

    Article  Google Scholar 

  34. Yin M, Wu B, Zeng Z (2012) Hmgraph olap: a novel framework for multi-dimensional heterogeneous network analysis. In: Proceedings of the fifteenth international workshop on Data warehousing and OLAP, ACM, pp 137–144

  35. Zhao P, Li X, Xin D, Han J (2011) Graph cube: on warehousing and olap multidimensional networks. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data, ACM, pp 853–864

  36. Zhu F, Zhang Z, Qu Q (2013) A direct mining approach to efficient constrained graph pattern discovery. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM, pp 821–832

Download references

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2014R1A2A1A05043734).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Young-Koo Lee.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khan, K.U., Nawaz, W. & Lee, YK. Set-based approximate approach for lossless graph summarization. Computing 97, 1185–1207 (2015). https://doi.org/10.1007/s00607-015-0454-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-015-0454-9

Keywords

Mathematics Subject Classification

Navigation