Abstract
Rich availability of real world knowledge in a graph based on attributes of each vertex and its interactions, is a valuable source of information. However, it is hard to derive this useful knowledge since either graphs of current era do not fit in main memory or cannot be efficiently processed. In this regard, it is better to create a meaningful summary graph that is compact yet preserves intrinsic properties of its underlying graph. In this paper, we propose a summarization approach for a big graph, where each node is attached with multiple attributes. Main intuition behind our approach is based on a real life concept that tells “friends of friends have many common friends and also have similar likes and preferences”. We use this phenomenon as the basis in our paper to identify sets of nodes having common neighborhood and similar attributes, for summarization. Existing aggregation-based summarization methods use pairwise heuristic to find similar pairs of nodes for compression. Whereas, pairwise similarity computations can check both neighborhood as well as attributes similarities, however, it is impractical to summarize a big graph. For this purpose, we propose a set-based approach for efficient summarization. To identify each set, we adopt Locality Sensitive Hashing (LSH) to restrict similarity computations within candidate similar nodes only. Since, existing LSH techniques only consider neighborhood similarity in a graph, therefore we propose a Unified LSH approach to simultaneously consider both attributes and neighborhood similarities. Further, using Minimum Description Length (MDL) principle, we present a new technique to perform lossless summarization of each set by creating a super node or adding a new virtual node in summary graph. We evaluate our proposed approach with state of the art methods on synthetic and publicly available real world graphs and observe better results in terms of execution time, compression ratio, and number of corrections to restructure the original graph.
Similar content being viewed by others
Notes
Total number of minutes spent on Facebook each month: 640 Million. http://www.statisticbrain.com/facebook-statistics/. Last accessed on 03/07/2016
References
Boldi, P., Vigna, S.: The webgraph framework i: compression techniques. In: Proceedings of the 13th international conference on World Wide Web, pp 595–602. ACM (2004)
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the Web. Comput. Netw. 33(1), 309–320 (2000)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the Web. Computer Networks and ISDN Systems 29(8), 1157–1166 (1997)
Chierichetti, F., Kumar, R., Lattanzi, S., Mitzenmacher, M., Panconesi, A., Raghavan, P.: On compressing social networks. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 219–228. ACM (2009)
Cui, W., Xiao, Y., Wang, H., Wang, W.: Local search of communities in large graphs. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM (991)
Dourisboure, Y., Geraci, F., Pellegrini, M.: Extraction and classification of dense implicit communities in the Web graph. ACM Trans. Web (TWEB) 3(2), 7 (2009)
Elseidy, M., Abdelhamid, E., Skiadopoulos, S., Kalnis, P.: Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment 7(7), 517–528 (2014)
Gibson, D., Kumar, R., Tomkins, A.: Discovering large dense subgraphs in massive graphs. In: Proceedings of the 31st international conference on Very large data bases, VLDB Endowment, pp. 721–732 (2005)
Gionis, A., Indyk, P., Motwani, R., et al.: Similarity search in high dimensions via hashing. In: VLDB, vol 99, pp, 518–529 (1999)
Girvan, M., Newman, M.E.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002)
Hernández, C., Navarro, G.: Compressed representations for Web and social graphs. Knowl. Inf. Syst. 40(2), 279–313 (2014)
Jakawat, W., Favre, C., Loudcher, S.: Olap on information networks: A new framework for dealing with bibliographic data. In: New Trends in Databases and Information Systems, pp 361–370. Springer (2014)
Jeh, G., Widom, J.: Simrank: a measure of structural-context similarity. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 538–543. ACM (2002)
Khan, K.U., Nawaz, W., Lee, Y.K.: Set-based unified approach for attributed graph summarization. In: Proceedings of Big Data and Cloud Computing (BdCloud), 2014 IEEE Fourth International Conference on Social Computing and Networking (SocialCom) . IEEE (2014)
Khan, K.U., Nawaz, W., Lee, Y.K.: Set-based approximate approach for lossless graph summarization. Computing 97(12), 1185–1207 (2015)
Koutra, D., Kang, U., Vreeken, J., Faloutsos, C.: VOG: summarizing and understanding large graphs. In: Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia. doi:10.1137/1.9781611973440.11, pp 91–99 (2014)
Koutra, D., Kang, U., Vreeken, J., Faloutsos, C.: Summarizing and understanding large graphs. Statistical Analysis and Data Mining: The ASA Data Science Journal 8(3), 183–202 (2015). doi:10.1002/sam.11267
LeFevre, K., Terzi, E.: Grass: Graph structure summarization. In: Proceedings of the SIAM International Conference on Data Mining, SDM 2010, Columbus, pp 454–465 (2010)
Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp 177–187. ACM (2005)
Li, Z., Fang, Y., Liu, Q., Cheng, J., Cheng, R., Lui, J.C.S.: Walking in the cloud: Parallel simrank at scale. Proc. VLDB Endow 9(1), 24–35 (2015). doi:10.14778/2850469.2850472
Liakos, P., Papakonstantinopoulou, K., Sioutis, M.: Pushing the envelope in graph compression. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp 1549–1558. ACM (2014)
Lim, Y., Kang, U., Faloutsos, C.: Slashburn: Graph compression and mining beyond caveman communities. IEEE Trans. Knowl. Data Eng. 26(12), 3077–3089 (2014)
Lorrain, F., White, H.C.: Structural equivalence of individuals in social networks. J. Math. Sociol. 1(1), 49–80 (1971)
Macropol, K., Singh, A.: 1–2. Proceedings of the VLDB Endowment 3, 693–702 (2010)
Navlakha, S., Rastogi, R., Shrivastava, N.: Graph summarization with bounded error. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp 419–432. ACM (2008)
Nawaz, W., Han, Y., Khan, K.U., Lee, Y.K.: Personalized email community detection using collaborative similarity measure. arXiv:13061300(2013)
Nawaz, W., Khan, K.U., Lee, Y.K.: Spore: shortest path overlapped regions and confined traversals towards graph clustering. Appl. Intell., 1–25 (2014a)
Nawaz, W., Khan, K.U., Lee, Y.K., Lee, S.: Intra graph clustering using collaborative similarity measure. Distributed and Parallel Databases, 1–21 (2014b)
Newman, M.E., Strogatz, S.H., Watts, D.J.: Random graphs with arbitrary degree distributions and their applications. Phys. rev. E 64(2), 026,118 (2001)
Perozzi, B., Akoglu, L., Iglesias Sánchez, P., Müller, E.: Focused clustering and outlier detection in large attributed graphs. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 1346–1355. ACM (2014)
Qu, Q., Zhu, F., Yan, X., Han, J., Philip, S.Y., Li, H.: Efficient topological olap on information networks. In: Database Systems for Advanced Applications, pp 389–403. Springer (2011)
Qu, Q., Liu, S., Jensen, C.S., Zhu, F., Faloutsos, C.: Interestingness-driven diffusion process summarization in dynamic networks. In: Springer, pp 597–613 (2014)
Rajaraman, A., Ullman, J.D., Ullman, J.D., Ullman, J.D.: Mining of massive datasets, vol, 77. Cambridge University Press, Cambridge (2012)
Riondato, M., Garcia-Soriano, D., Bonchi, F.: Graph summarization with quality guarantees. In: 2014 IEEE International Conference on Data Mining (ICDM), pp 947–952. IEEE (2014)
Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
Ruan, Y., Fuhry, D., Parthasarathy, S.: Efficient community detection in large networks using content and links. In: Proceedings of the 22nd international conference on world wide Web, International World Wide Web Conferences Steering Committee, pp, 1089–1098 (2013)
Satuluri, V., Parthasarathy, S., Ruan, Y.: Local graph sparsification for scalable clustering. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp 721–732. ACM (2011)
Schaeffer, S.E.: Graph clustering. Computer Science Review 1(1), 27–64 (2007)
Seidman, S.B.: Network structure and minimum degree. Soc. Networks 5(3), 269–287 (1983)
Shah, N., Koutra, D., Zou, T., Gallagher, B., Faloutsos, C.: Timecrunch: Interpretable dynamic graph summarization. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 1055–1064. ACM (2015)
Shi, L., Tong, H., Tang, J., Lin, C.: Flow-based influence graph visual summarization. In: 2014 IEEE International Conference on Data Mining (ICDM), pp 983–988. IEEE (2014)
Shi, L., Tong, H., Tang, J., Lin, C.: Vegas: Visual influence graph summarization on citation networks. In: IEEE Transactions on Knowledge and Data Engineering, vol. 27, pp 3417–3431 (2015)
Silva, A., Meira, W. Jr, Zaki, M.J.: Mining attribute-structure correlated patterns in large attributed graphs. Proceedings of the VLDB Endowment 5(5), 466–477 (2012)
Sozio, M., Gionis, A.: The community-search problem and how to plan a successful cocktail party. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 939–948. ACM (2010)
Tian, Y., Hankins, R.A., Patel, J.M.: Efficient aggregation for graph summarization. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp 567–580. ACM (2008)
Toivonen, H., Zhou, F., Hartikainen, A., Hinkka, A.: Compression of weighted graphs. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 965–973. ACM (2011)
Wang, J., Shen, H.T., Song, J., Ji, J.: Hashing for similarity search: A survey. arXiv:14082927 (2014)
Yang, J., McAuley, J., Leskovec, J.: Community detection in networks with node attributes. In: 2013 IEEE 13th international conference on Data Mining (ICDM), pp 1151–1156. IEEE (2013)
Yin, M., Wu, B., Zeng, Z.: Hmgraph olap: a novel framework for multi-dimensional heterogeneous network analysis. In: Proceedings of the fifteenth international workshop on Data warehousing and OLAP, pp 137–144. ACM (2012)
Yu, W., Lin, X., Zhang, W., McCann, J.A.: Fast all-pairs simrank assessment on large graphs and bipartite domains. IEEE Trans. Knowl. Data Eng. 27 (7), 1810–1823 (2015). doi:10.1109/TKDE.2014.2339828
Zhang, J., Hong, X., Peng, Z., Li, Q.: Nestedcube: Towards online analytical processing on information-enhanced multidimensional network. In: Web-Age Information Management, pp 128–139. Springer (2012)
Zhao, P., Li, X., Xin, D., Han, J.: Graph cube: on warehousing and olap multidimensional networks. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp 853–864. ACM (2011)
Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute similarities. Proceedings of the VLDB Endowment 2(1), 718–729 (2009)
Zhu, F., Zhang, Z., Qu, Q.: A direct mining approach to efficient constrained graph pattern discovery. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp 821–832. ACM (2013)
Acknowledgment
This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MEST) (No.2015R1A2A2A01008209).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Khan, K.U., Nawaz, W. & Lee, YK. Set-based unified approach for summarization of a multi-attributed graph. World Wide Web 20, 543–570 (2017). https://doi.org/10.1007/s11280-016-0388-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-016-0388-y