ABSTRACT
The steady growth of graph data in various applications has resulted in wide-spread research in finding significant sub-structures in a graph. In this paper, we address the problem of finding statistically significant connected subgraphs where the nodes of the graph are labeled. The labels may be either discrete where they assume values from a pre-defined set, or continuous where they assume values from a real domain and can be multi-dimensional. We motivate the problem citing applications in spatial co-location rule mining and outlier detection. We use the chi-square statistic as a measure for quantifying the statistical significance. Since the number of connected subgraphs in a general graph is exponential, the naive algorithm is impractical. We introduce the notion of contracting edges that merge vertices together to form a super-graph. We show that if the graph is dense enough to start with, the number of super-vertices is quite low, and therefore, running the naive algorithm on the super-graph is feasible. If the graph is not dense, we provide an algorithm to reduce the number of super-vertices further, thereby providing a trade-off between accuracy and time. Empirically, the chi-square value obtained by this reduction is always within 96% of the optimal value, while the time spent is only a fraction of that for the optimal. In addition, we also show that our algorithm is scalable and it significantly enhances the ability to analyze real datasets.
- A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286(5439):509--512, 1999.Google ScholarCross Ref
- S. Barua and J. Sander. SSCP: Mining statistically significant co-location patterns. In STD, pages 2--20, 2011. Google ScholarDigital Library
- S. Barua and J. Sander. Mining statistically significant co-location and segregation patterns. TKDE, 99(pre):1, 2013.Google Scholar
- Y. Chi, Y. Yang, and R. Muntz. Indexing and mining free trees. In ICDM, pages 509--512, 2003. Google ScholarDigital Library
- A. Denise, M. Régnier, and M. Vandenbogaert. Assessing the statistical significance of overrepresented oligonucleotides. In WABI, pages 537--552, 2001. Google ScholarDigital Library
- N. Durak, A. Pinar, T. G. Kolda, and C. Seshadhri. Degree relations of triangles in real-world networks and graph models. In CIKM, pages 1712--1716, 2012. Google ScholarDigital Library
- E. Edgington and P. Onghena. Randomization Tests. Marcel Dekker, 1995. Google ScholarDigital Library
- P. Erd\Hos and A. Rényi. On the strength of connectedness of a random graph. Acta Mathematica Scientia Hungary, 12:261--267, 1961.Google Scholar
- P. Erdös and A. Rényi. On random graphs, I. Publicationes Mathematicae (Debrecen), 6:290--297, 1959.Google ScholarCross Ref
- R. Frank, W. Jin, and M. Ester. Efficiently mining regional outliers in spatial data. In SSTD, pages 112--129, 2007. Google ScholarDigital Library
- H. He and A. Singh. Graphrank: Statistical modeling and mining of significant subgraphs in the feature space. In ICDM, pages 885--890, 2006. Google ScholarDigital Library
- R. Hogg, A. Craig, and J. McKean. Introduction to Mathematical Statistics. Pearson Education, 2004.Google Scholar
- P. Holme and B. J. Kim. Growing scale-free networks with tunable clustering. Physical Review E, 65(2):026107, 2002.Google ScholarCross Ref
- Y. Huang, J. Pei, and H. Xiong. Mining co-location patterns with rare events from spatial data sets. GeoInformatica, 10(3):239--260, 2006. Google ScholarDigital Library
- H. Jiang, J. Cheng, D. Wang, C. Wang, and G. Tan. A general framework for efficient continuous multidimensional top-k query processing in sensor networks. IEEE Trans. Parallel Distrib. Syst., 23(9):1668--1680, 2012. Google ScholarDigital Library
- Y. Kou, C.-T. Lu, and D. Chen. Spatial weighted outlier detection. In SDM, pages 613--617, 2006.Google ScholarCross Ref
- J. Lijffijt, P. Papapetrou, and K. Puolam\"aki. A statistical significance testing approach to mining the most informative set of patterns. In DMKD, pages 1--26, 2012. Google ScholarDigital Library
- M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Physical Review E, 64(2):026118, 2001.Google ScholarCross Ref
- J. Pei, D. Jiang, and A. Zhang. Mining cross-graph quasi-cliques in gene expression and protein interaction data. In ICDE, pages 353--354, 2005. Google ScholarDigital Library
- L. Popa, A. Rostamizadeh, R. Karp, C. Papadimitriou, and I. Stoica. Balancing traffic load in wireless networks with curveball routing. In MobiHoc, pages 170--179, 2007. Google ScholarDigital Library
- S. Ranu and A. Singh. Graphsig: A scalable approach to mining significant subgraphs in large graph databases. In ICDE, pages 844--855, 2009. Google ScholarDigital Library
- T. Read and N. Cressie. Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer, 1988.Google ScholarCross Ref
- M. Régnier and M. Vandenbogaert. Comparison of statistical significance criteria. J. Bioinf. & Comp. Bio., 4:85--97, 2006.Google ScholarCross Ref
- P. Roy and S. Tomar. Biodiversity characterization at landscape level using geospatial modelling technique. Biological Conservation, 95(1):95--109, 2000.Google ScholarCross Ref
- M. Sachan and A. Bhattacharya. Mining statistically significant substrings using the chi-square statistic. PVLDB, 5(10):1052--1063, 2012. Google ScholarDigital Library
- J. Scott, T. Ideker, R. M. Karp, and R. Sharan. Efficient algorithms for detecting signaling pathways in protein interaction networks. J. Comp. Bio., 13(2):133--144, 2006.Google ScholarCross Ref
- S. Shekhar and Y. Huang. Discovering spatial co-location patterns: A summary of results. In SSTD, pages 236--256, 2001. Google ScholarDigital Library
- S. Shekhar, C.-T. Lu, and P. Zhang. Detecting graph-based spatial outliers: algorithms and applications (a summary of results). In KDD, pages 371--376, 2001. Google ScholarDigital Library
- D. Wang, W. Ding, H. Z. Lo, T. F. Stepinski, J. Salazar, and M. Morabito. Crime hotspot mapping using the crime related factors -- a spatial data mining approach. Appl. Intell., 39(4):772--781, 2013. Google ScholarDigital Library
- D. J. Watts and S. H. Strogatz. Collective dynamics of 'small-world' networks. Nature, 393(6684):409--10, 1998.Google ScholarCross Ref
- K. Wongpanya, K. Sripimanwat, and K. Jenjerapongvej. Simplification of frequency test for random number generation based on chi-square. In AICT, pages 305--308, 2008. Google ScholarDigital Library
- W. Xing and A. A. Ghorbani. Weighted pagerank algorithm. In CNSR, pages 305--314, 2004. Google ScholarDigital Library
- X. Yan, H. Cheng, J. Han, and P. Yu. Mining significant graph patterns by leap search. In SIGMOD, pages 433--444, 2008. Google ScholarDigital Library
- N. Ye and Q. Chen. An anomaly detection technique based on a chi-square statistic for detecting intrusions into information systems. Quality and Reliability Engineering International, 23, 2001.Google Scholar
- C. H. You, L. B. Holder, and D. J. Cook. Temporal and structural analysis of biological networks in combination with microarray data. In CIBCB, pages 62--69, 2008.Google Scholar
Index Terms
- Mining statistically significant connected subgraphs in vertex labeled graphs
Recommendations
Forbidden Subgraphs and Weak Locally Connected Graphs
A graph is called H-free if it has no induced subgraph isomorphic to H. A graph is called $$N^i$$Ni-locally connected if $$G[\{ x\in V(G): 1\le d_G(w, x)\le i\}]$$G[{x?V(G):1≤dG(w,x)≤i}] is connected and $$N_2$$N2-locally connected if $$G[\{uv: \{uw, vw\...
Clique-heavy subgraphs and pancyclicity of 2-connected graphs
Graph G on n vertices is said to be pancyclic if it contains cycles of all lengths k for k ź { 3 , . . . , n } . A vertex v ź V ( G ) is called super-heavy if the number of its neighbours in G is at least ( n + 1 ) / 2 . The complete bipartite graph K 1 ...
Note: Many disjoint dense subgraphs versus large k-connected subgraphs in large graphs with given edge density
It is proved that for all positive integers d,k,s,t with t>=k+1 there is a positive integer M=M(d,k,s,t) such that every graph with edge density at least d+k and at least M vertices contains a k-connected subgraph on at least t vertices, or s pairwise ...
Comments