research-article

Mining statistically significant connected subgraphs in vertex labeled graphs

Authors:
Akhil Arora

Indian Institute of Technology, Kanpur, Kanpur, India

Indian Institute of Technology, Kanpur, Kanpur, India
View Profile

,
Mayank Sachan

Indian Institute of Technology, Kanpur, Kanpur, India

Indian Institute of Technology, Kanpur, Kanpur, India
View Profile

,
Arnab Bhattacharya

Indian Institute of Technology, Kanpur, Kanpur, India

Indian Institute of Technology, Kanpur, Kanpur, India
View Profile

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataJune 2014Pages 1003–1014https://doi.org/10.1145/2588555.2588574

Published:18 June 2014Publication History

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 1003–1014

ABSTRACT

The steady growth of graph data in various applications has resulted in wide-spread research in finding significant sub-structures in a graph. In this paper, we address the problem of finding statistically significant connected subgraphs where the nodes of the graph are labeled. The labels may be either discrete where they assume values from a pre-defined set, or continuous where they assume values from a real domain and can be multi-dimensional. We motivate the problem citing applications in spatial co-location rule mining and outlier detection. We use the chi-square statistic as a measure for quantifying the statistical significance. Since the number of connected subgraphs in a general graph is exponential, the naive algorithm is impractical. We introduce the notion of contracting edges that merge vertices together to form a super-graph. We show that if the graph is dense enough to start with, the number of super-vertices is quite low, and therefore, running the naive algorithm on the super-graph is feasible. If the graph is not dense, we provide an algorithm to reduce the number of super-vertices further, thereby providing a trade-off between accuracy and time. Empirically, the chi-square value obtained by this reduction is always within 96% of the optimal value, while the time spent is only a fraction of that for the optimal. In addition, we also show that our algorithm is scalable and it significantly enhances the ability to analyze real datasets.

References

A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286(5439):509--512, 1999.Google ScholarCross Ref
S. Barua and J. Sander. SSCP: Mining statistically significant co-location patterns. In STD, pages 2--20, 2011. Google ScholarDigital Library
S. Barua and J. Sander. Mining statistically significant co-location and segregation patterns. TKDE, 99(pre):1, 2013.Google Scholar
Y. Chi, Y. Yang, and R. Muntz. Indexing and mining free trees. In ICDM, pages 509--512, 2003. Google ScholarDigital Library
A. Denise, M. Régnier, and M. Vandenbogaert. Assessing the statistical significance of overrepresented oligonucleotides. In WABI, pages 537--552, 2001. Google ScholarDigital Library
N. Durak, A. Pinar, T. G. Kolda, and C. Seshadhri. Degree relations of triangles in real-world networks and graph models. In CIKM, pages 1712--1716, 2012. Google ScholarDigital Library
E. Edgington and P. Onghena. Randomization Tests. Marcel Dekker, 1995. Google ScholarDigital Library
P. Erd\Hos and A. Rényi. On the strength of connectedness of a random graph. Acta Mathematica Scientia Hungary, 12:261--267, 1961.Google Scholar
P. Erdös and A. Rényi. On random graphs, I. Publicationes Mathematicae (Debrecen), 6:290--297, 1959.Google ScholarCross Ref
R. Frank, W. Jin, and M. Ester. Efficiently mining regional outliers in spatial data. In SSTD, pages 112--129, 2007. Google ScholarDigital Library
H. He and A. Singh. Graphrank: Statistical modeling and mining of significant subgraphs in the feature space. In ICDM, pages 885--890, 2006. Google ScholarDigital Library
R. Hogg, A. Craig, and J. McKean. Introduction to Mathematical Statistics. Pearson Education, 2004.Google Scholar
P. Holme and B. J. Kim. Growing scale-free networks with tunable clustering. Physical Review E, 65(2):026107, 2002.Google ScholarCross Ref
Y. Huang, J. Pei, and H. Xiong. Mining co-location patterns with rare events from spatial data sets. GeoInformatica, 10(3):239--260, 2006. Google ScholarDigital Library
H. Jiang, J. Cheng, D. Wang, C. Wang, and G. Tan. A general framework for efficient continuous multidimensional top-k query processing in sensor networks. IEEE Trans. Parallel Distrib. Syst., 23(9):1668--1680, 2012. Google ScholarDigital Library
Y. Kou, C.-T. Lu, and D. Chen. Spatial weighted outlier detection. In SDM, pages 613--617, 2006.Google ScholarCross Ref
J. Lijffijt, P. Papapetrou, and K. Puolam\"aki. A statistical significance testing approach to mining the most informative set of patterns. In DMKD, pages 1--26, 2012. Google ScholarDigital Library
M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Physical Review E, 64(2):026118, 2001.Google ScholarCross Ref
J. Pei, D. Jiang, and A. Zhang. Mining cross-graph quasi-cliques in gene expression and protein interaction data. In ICDE, pages 353--354, 2005. Google ScholarDigital Library
L. Popa, A. Rostamizadeh, R. Karp, C. Papadimitriou, and I. Stoica. Balancing traffic load in wireless networks with curveball routing. In MobiHoc, pages 170--179, 2007. Google ScholarDigital Library
S. Ranu and A. Singh. Graphsig: A scalable approach to mining significant subgraphs in large graph databases. In ICDE, pages 844--855, 2009. Google ScholarDigital Library
T. Read and N. Cressie. Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer, 1988.Google ScholarCross Ref
M. Régnier and M. Vandenbogaert. Comparison of statistical significance criteria. J. Bioinf. & Comp. Bio., 4:85--97, 2006.Google ScholarCross Ref
P. Roy and S. Tomar. Biodiversity characterization at landscape level using geospatial modelling technique. Biological Conservation, 95(1):95--109, 2000.Google ScholarCross Ref
M. Sachan and A. Bhattacharya. Mining statistically significant substrings using the chi-square statistic. PVLDB, 5(10):1052--1063, 2012. Google ScholarDigital Library
J. Scott, T. Ideker, R. M. Karp, and R. Sharan. Efficient algorithms for detecting signaling pathways in protein interaction networks. J. Comp. Bio., 13(2):133--144, 2006.Google ScholarCross Ref
S. Shekhar and Y. Huang. Discovering spatial co-location patterns: A summary of results. In SSTD, pages 236--256, 2001. Google ScholarDigital Library
S. Shekhar, C.-T. Lu, and P. Zhang. Detecting graph-based spatial outliers: algorithms and applications (a summary of results). In KDD, pages 371--376, 2001. Google ScholarDigital Library
D. Wang, W. Ding, H. Z. Lo, T. F. Stepinski, J. Salazar, and M. Morabito. Crime hotspot mapping using the crime related factors -- a spatial data mining approach. Appl. Intell., 39(4):772--781, 2013. Google ScholarDigital Library
D. J. Watts and S. H. Strogatz. Collective dynamics of 'small-world' networks. Nature, 393(6684):409--10, 1998.Google ScholarCross Ref
K. Wongpanya, K. Sripimanwat, and K. Jenjerapongvej. Simplification of frequency test for random number generation based on chi-square. In AICT, pages 305--308, 2008. Google ScholarDigital Library
W. Xing and A. A. Ghorbani. Weighted pagerank algorithm. In CNSR, pages 305--314, 2004. Google ScholarDigital Library
X. Yan, H. Cheng, J. Han, and P. Yu. Mining significant graph patterns by leap search. In SIGMOD, pages 433--444, 2008. Google ScholarDigital Library
N. Ye and Q. Chen. An anomaly detection technique based on a chi-square statistic for detecting intrusions into information systems. Quality and Reliability Engineering International, 23, 2001.Google Scholar
C. H. You, L. B. Holder, and D. J. Cook. Temporal and structural analysis of biological networks in combination with microarray data. In CIBCB, pages 62--69, 2008.Google Scholar

Index Terms

Mining statistically significant connected subgraphs in vertex labeled graphs
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Forbidden Subgraphs and Weak Locally Connected Graphs

A graph is called H-free if it has no induced subgraph isomorphic to H. A graph is called $$N^i$$Ni-locally connected if $$G[\{ x\in V(G): 1\le d_G(w, x)\le i\}]$$G[{x?V(G):1≤dG(w,x)≤i}] is connected and $$N_2$$N2-locally connected if $$G[\{uv: \{uw, vw\...
Read More
Clique-heavy subgraphs and pancyclicity of 2-connected graphs

Graph G on n vertices is said to be pancyclic if it contains cycles of all lengths k for k ź { 3 , . . . , n } . A vertex v ź V ( G ) is called super-heavy if the number of its neighbours in G is at least ( n + 1 ) / 2 . The complete bipartite graph K 1 ...
Read More
Note: Many disjoint dense subgraphs versus large k-connected subgraphs in large graphs with given edge density

It is proved that for all positive integers d,k,s,t with t>=k+1 there is a positive integer M=M(d,k,s,t) such that every graph with edge density at least d+k and at least M vertices contains a k-connected subgraph on at least t vertices, or s pairwise ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
June 2014
1645 pages
ISBN:9781450323765
DOI:10.1145/2588555
General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
chi-square
connected subgraphs
graph mining
significant subgraphs
statistical significance
vertex labels
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 852
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining statistically significant connected subgraphs in vertex labeled graphs

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Forbidden Subgraphs and Weak Locally Connected Graphs

Clique-heavy subgraphs and pancyclicity of 2-connected graphs

Note: Many disjoint dense subgraphs versus large k-connected subgraphs in large graphs with given edge density