Abstract
How do the k-core structures of real-world graphs look like? What are the common patterns and the anomalies? How can we exploit them for applications? A k-core is the maximal subgraph in which all vertices have degree at least k. This concept has been applied to such diverse areas as hierarchical structure analysis, graph visualization, and graph clustering. Here, we explore pervasive patterns related to k-cores and emerging in graphs from diverse domains. Our discoveries are: (1) Mirror Pattern: coreness (i.e., maximum k such that each vertex belongs to the k-core) is strongly correlated with degree. (2) Core-Triangle Pattern: degeneracy (i.e., maximum k such that the k-core exists) obeys a 3-to-1 power-law with respect to the count of triangles. (3) Structured Core Pattern: degeneracy–cores are not cliques but have non-trivial structures such as core–periphery and communities. Our algorithmic contributions show the usefulness of these patterns. (1) Core-A, which measures the deviation from Mirror Pattern, successfully spots anomalies in real-world graphs, (2) Core-D, a single-pass streaming algorithm based on Core-Triangle Pattern, accurately estimates degeneracy up to 12 \(\times \) faster than its competitor. (3) Core-S, inspired by Structured Core Pattern, identifies influential spreaders up to 17 \(\times \) faster than its competitors with comparable accuracy.
Notes
This paper is an extended version of [49].
Spearman’s rank correlation coefficient \(\rho \) [52] is the standard (Pearson) correlation coefficient r of the ranks. Here, \(\rho \) is equivalent to r between the ranks of vertices in terms of degree and their ranks in terms of coreness. Using \(\rho \) is known to be robust to outlying values than simply using r. We ignored isolated vertices when computing \(\rho \).
The fractional rank of an item is one plus the number of items greater than it plus half the number of items equal to it.
Strength of core–periphery structure. The correlation between the adjacency matrix of the measured graph and that of a graph with perfect core–periphery structure. See [10] for details.
Strength of community structure. The fraction of the edges within communities minus such fraction expected in a randomly connected graph. See [39] for details.
Isolated vertices are ignored when we compute Spearman’s rank correlation coefficient \(\rho \).
We used a machine with 2.67 GHz Intel Xeon E7-8837 CPUs and 1TB RAM.
References
Abello J, Resende MG, Sudarsky S (2002) Massive quasi-clique detection. In: Latin American symposium on theoretical informatics, Springer, pp 598–612
Akoglu L, McGlohon M, Faloutsos C (2010) Oddball: spotting anomalies in weighted graphs. In: Pacific–Asia conference on knowledge discovery and data mining, Springer, pp 410–421
Akoglu L, Tong H, Koutra D (2015) Graph based anomaly detection and description: a survey. Data Min Knowl Discov 29(3):626–688
Albert R, Jeong H, Barabsi AL (1999) Internet: diameter of the world-wide web. Nature 401(6749):130–131
Alvarez-Hamelin JI, Dall’Asta L, Barrat A, Vespignani A (2006) Large scale networks fingerprinting and visualization using the \(k\)-core decomposition. Adv Neural Inf Process Syst 18:41
Alvarez-Hamelin JI, Dall’Asta L, Barrat A, Vespignani A (2008) \(K\)-core decomposition of Internet graphs: hierarchies, self-similarity and measurement biases. Netw Heterog Media 3:371
Bader GD, Hogue CW (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform 4(1):2
Batagelj V, Zaversnik M (2003) An o(m) algorithm for cores decomposition of networks. arXiv:cs/0310049
Beutel A, Xu W, Guruswami V, Palow C, Faloutsos C (2013) Copycatch: stopping group attacks by spotting lockstep behavior in social networks. In: Proceedings of the 22nd international conference on world wide web, ACM, pp 119–130
Borgatti SP, Everett MG (2000) Models of core/periphery structures. Soc Netw 21(4):375–395
Bron C, Kerbosch J (1973) Algorithm 457: finding all cliques of an undirected graph. Commun ACM 16(9):575–577
Brouwer AE, Haemers WH (2001) Spectra of graphs. Springer, Berlin
Charikar M (2000) Greedy approximation algorithms for finding dense components in a graph. In: International Workshop on approximation algorithms for combinatorial optimization, Springer, pp 84–95
Cheng J, Ke Y, Chu S, Özsu MT (2011) Efficient core decomposition in massive networks. In: 2011 IEEE 27th international conference on data engineering, IEEE, pp 51–62
Cohen J (2008) Trusses: cohesive subgraphs for social network analysis. In: National security agency technical report, p 16
Davis J, Goadrich M (2006) The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on machine learning, ACM, pp 233–240
De Stefani L, Epasto A, Riondato M, Upfal E (2016) TRIÈST: counting local and global triangles in fully-dynamic streams with fixed memory size. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 825–834
Erdös P (1963) On the structure of linear graphs. Israel J Math 1(3):156–160
Farach-Colton M, Tsai MT (2014) Computing the degeneracy of large graphs. In: Latin American symposium on theoretical informatics, Springer, pp 250–260
Freuder EC (1982) A sufficient condition for backtrack-free search. J ACM (JACM) 29(1):24–32
Gehrke J, Ginsparg P, Kleinberg J (2003) Overview of the 2003 KDD cup. ACM SIGKDD Explor Newslett 5(2):149–151
Giatsidis C, Malliaros F, Thilikos DM, Vazirgiannis M (2014) Corecluster: a degeneracy based graph clustering framework. In: Twenty-sixth annual conference on innovative applications of artificial intelligence, AAAI, pp 29–31
Hall BH, Jaffe AB, Trajtenberg M (2001) The NBER patent citation data file: lessons, insights and methodological tools. doi:10.3386/w8498
Hooi B, Song HA, Beutel A, Shah N, Shin K, Faloutsos C (2016a) Fraudar: bounding graph fraud in the face of camouflage. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 895–904
Hooi B, Song HA, Papalexakis E, Agrawal R, Faloutsos C (2016b) Matrices, compression, learning curves: formulation, and the GROUPNTEACH algorithms. In: Pacific–Asia conference on knowledge discovery and data mining, Springer, pp 376–387
Huang X, Lu W, Lakshmanan LV (2016) Truss decomposition of probabilistic graphs: semantics and algorithms. In: Proceedings of the 2016 ACM SIGMOD international conference on management of data, ACM, pp 77–90
Jiang M, Beutel A, Cui P, Hooi B, Yang S, Faloutsos C (2015) A general suspiciousness metric for dense blocks in multimodal data. In: 2015 IEEE international conference on data mining, IEEE, pp 781–786
Kempe D, Kleinberg J, Tardos É (2003) Maximizing the spread of influence through a social network. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 137–146
Kitsak M, Gallos LK, Havlin S, Liljeros F, Muchnik L, Stanley HE, Makse HA (2010) Identification of influential spreaders in complex networks. Nat Phys 6(11):888–893
Klimt B, Yang Y (2004) The enron corpus: a new dataset for email classification research. In: European conference on machine learning, Springer, pp 217–226
Kwak H, Lee C, Park H, Moon S (2010) What is twitter, a social network or a news media?. In: Proceedings of the 19th international conference on world wide web, ACM, pp 591–600
Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C (2005) Realistic mathematically tractable graph generation and evolution, using kronecker multiplication. In: European conference on principles of data mining and knowledge discovery, Springer, pp 133–145
Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2009) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123
Lim Y, Kang U (2015) Mascot: memory-efficient and accurate sampling for counting local triangles in graph streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 685–694
Luce RD (1950) Connectivity and generalized cliques in sociometric group structure. Psychometrika 15(2):169–90
Macdonald B, Shakarian P, Howard N, Moores G (2012) Spreaders in the network sir model: an empirical study. arXiv preprint arXiv:1208.4269
Mislove A, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B (2007) Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM conference on internet measurement, ACM, pp 29–42
Mokken RJ (1979) Cliques, clubs and clans. Qual Quant 13(2):161–173
Newman ME (2006) Modularity and community structure in networks. Proc Nat Acad Sci 103(23):8577–8582
Pandit S, Chau DH, Wang S, Faloutsos C (2007) Netprobe: a fast and scalable system for fraud detection in online auction networks. In: Proceedings of the 16th international conference on world wide web, ACM, pp 201–210
Prakash BA, Sridharan A, Seshadri M, Machiraju S, Faloutsos C (2010) Eigenspokes: surprising patterns and scalable community chipping in large graphs. In: Pacific–Asia conference on knowledge discovery and data mining, Springer, pp 435–448
Rossi MEG, Malliaros FD, Vazirgiannis M (2015) Spread it good, spread it fast: identification of influential nodes in social networks. In: Proceedings of the 24th international conference on world wide web (companion volume), ACM, pp 101–102
Saríyüce AE, Gedik B, Jacques-Silva G, Wu KL, Çatalyürek ÜV (2013) Streaming algorithms for \(k\)-core decomposition. Proc VLDB Endow 6(6):433–444
Saríyüce AE, Seshadhri C, Pinar A, Catalyurek UV (2015) Finding the hierarchy of dense subgraphs using nucleus decompositions. In: Proceedings of the 24th international conference on world wide web, ACM, pp 927–937
Schank T (2007) Algorithmic aspects of triangle-based network analysis. Ph.D. thesis, Universitt Karlsruhe (TH), Fakultt fr Informatik
Seidman SB, Foster BL (1978) A graph theoretic generalization of the clique concept. J Math Sociol 6(1):139–154
Seidman SB (1983) Network structure and minimum degree. Soc Netw 5(3):269–287
Shin K, Hooi B, Faloutsos C (2016a) M-zoom: fast dense-block detection in tensors with quality guarantees. In: Joint European conference on machine learning and knowledge discovery in databases, Springer, pp 264–280
Shin K, Eliassi-Rad T, Faloutsos C (2016b) Corescope: graph mining using \(k\)-core analysis—patterns, anomalies and algorithms. In: 2016 16th IEEE international conference on data mining, IEEE, pp 469–478
Shin K, Hooi B, Jisu K, Faloutsos C (2017a) D-cube: dense-block detection in terabyte-scale tensors. In: Proceedings of the Tenth ACM international conference on web search and data mining, ACM, pp 681–690
Shin K, Hooi B, Jisu K, Faloutsos C (2017b) Densealert: incremental dense-subtensor detection in tensor streams. arXiv preprint arXiv:1706.03374
Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101
Tsourakakis CE (2008) Fast counting of triangles in large real networks without counting: algorithms and laws. In: 2008 eighth IEEE international conference on data mining, IEEE, pp 608–617
Tsourakakis CE, Kang U, Miller GL, Faloutsos C (2009) Doulion: counting triangles in massive graphs with a coin. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 837–846
Van Loan CF (2000) The ubiquitous kronecker product. J Comput Appl Math 123(1):85–100
Wang J, Cheng J (2012) Truss decomposition in massive networks. Proc VLDB Endow 5(9):812–823
Wuchty S, Almaas E (2005) Peeling the yeast protein network. Proteomics 5(2):444–449
Zhang S, Zhou D, Yildirim MY, Alcorn S, He J, Davulcu H, Tong H (2017) HiDDen: hierarchical dense subgraph detection with application to financial fraud detection. In: Proceedings of the 2017 SIAM international conference on data mining, SIAM, pp 570–578
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grant Nos. CNS-1314632 and IIS-1408924. Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-09-2-0053. Kijung Shin was supported by KFAS Scholarship. Tina Eliassi-Rad was supported by NSF CNS-1314603 and by DTRA HDTRA1-10-1-0120. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties. The US Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Interpreting sparsity patterns
We explain sparsity patterns and how to interpret them. The sparsity pattern of a graph is a plot with the axes representing the rows and columns of the adjacency matrix. For each nonzero entry (i.e., edge in the graph), a point is plotted, thus displaying sparsity patterns in the adjacency matrix.
Figure 19a shows the sparsity pattern of the degeneracy–core of Caida Dataset. The rows in the plot indicate vertices, and they are divided into two ranges, which correspond to the core and the periphery. The vertices in the core are densely connected with each other, as seen in region A in Fig. 19b. The vertices in the periphery are well connected to the vertices in the core (regions B and C) but rarely connected to each other (region D). The vertices in the core are further divided into three communities, each of which corresponds to a range of the columns in Fig. 19a. The vertices in the same community are particularly well connected to each other, as seen in regions A1, A2, and A3 in Fig. 19c, which correspond to the sparsity patterns of the communities.
Appendix B: Core-D with a small number of samples
Figure 20 presents the accuracy of Core-D with different sample sizes in the two largest datasets. Even with a small number of samples less than the number of vertices, Core-D, especially Overall Model, accurately and reliably estimated degeneracy. Thus, Core-D is still effective even when the amount of available memory space is less than n.
Appendix C: Measuring influence using SIR model simulation
To evaluate influence as a spreader, we simulate spreading processes using SIR model [29], a widely used epidemic model. Initially, a vertex chosen as the seed is in the infectious state (I-state), while the others are in the susceptible state (S-state). Each vertex in the I-state infects each of its neighbors in the S-state with probability \(\beta \) (infection rate) and then enters the recovered state (R-state). This is repeated until no vertex is in the I-state. The influence of a seed, the initially infected vertex, can be quantified by the number of vertices infected at any time during the process. To reduce random effects, we repeat the whole process 100 times and use the average number of infected vertices as the measure of influence. \(\beta \) is set close to the epidemic threshold \(\lambda _{1}^{-1}\), as in previous work [42].
Rights and permissions
About this article
Cite this article
Shin, K., Eliassi-Rad, T. & Faloutsos, C. Patterns and anomalies in k-cores of real-world graphs with applications. Knowl Inf Syst 54, 677–710 (2018). https://doi.org/10.1007/s10115-017-1077-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1077-6