Skip to main content
Log in

Self-supervised end-to-end graph local clustering

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Graph clustering is a central and fundamental problem in numerous graph mining applications, especially in spatial-temporal system. The purpose of the graph local clustering is finding a set of nodes (cluster) containing seed node with high internal density. A series of works have been proposed to solve this problem with carefully designing the measuring metric and improving the efficiency-effectiveness trade-off. However, they are unable to provide a satisfying clustering quality guarantee. In this paper, we investigate the graph local clustering task and propose a End-to-End framework LearnedNibble to address the aforementioned limitation. In particular, we propose several techniques, including the practical self-supervised supervision manner with differential soft-mean-sweep operator, effective optimization method with regradient technique, and scalable inference manner with Approximate Graph Propagation (AGP) paradigm and search-selective method. To the best of our knowledge, LearnedNibble is the first attempt to take responsibility for the cluster quality and take both effectiveness and efficiency into consideration in an End-to-End paradigm with self-supervised manner. Extensive experiments on real-world datasets demonstrate the clustering capacity, generalization ability, and approximation compatibility of our LearnedNibble framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1

Similar content being viewed by others

Data availability

The graph datasets that support the findings of this study are available in SNAP project, https://snap.stanford.edu/data/index.html.

References

  1. Girvan, M., Newman, M.E.: Community structure in social and biological networks. Proc. Nat. Acad. Sci. 99(12), 7821–7826 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  2. Wasserman, S., Faust, K., et al.: Social network analysis: Methods and applications (1994)

  3. Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.-U.: Complex networks: Structure and dynamics. physrep 424(4–5), 175–308 (2006). https://doi.org/10.1016/j.physrep.2005.10.009

    MathSciNet  MATH  Google Scholar 

  4. Lu, Z., Wahlström, J., Nehorai, A.: Community detection in complex networks via clique conductance. Sci. Rep. 8(1), 1–16 (2018)

    Google Scholar 

  5. Wang, M., Wang, C., Yu, J.X., Zhang, J.: Community detection in social networks: an in-depth benchmarking study with a procedure-oriented framework. Proc. VLDB Endow. 8(10), 998–1009 (2015)

    Article  Google Scholar 

  6. Fortunato, S.: Community detection in graphs. Phys. Rep. 486 (3–5), 75–174 (2010)

    Article  MathSciNet  Google Scholar 

  7. Leskovec, J., Lang, K.J., Mahoney, M.: Empirical comparison of algorithms for network community detection. In: Proceedings of the 19th International Conference on World Wide Web, pp. 631–640 (2010)

  8. Yi, F., Moon, I.: Image segmentation: A survey of graph-cut methods. In: 2012 International Conference on Systems and Informatics (ICSAI2012), pp. 1936–1941. IEEE (2012)

  9. Vicente, S., Kolmogorov, V., Rother, C.: Graph cut based image segmentation with connectivity priors. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)

  10. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vis. 59(2), 167–181 (2004)

    Article  MATH  Google Scholar 

  11. Tolliver, D.A., Miller, G.L.: Graph partitioning by spectral rounding: Applications in image segmentation and clustering. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 1, pp. 1053–1060. IEEE (2006)

  12. Liao, C. -S., Lu, K., Baym, M., Singh, R., Berger, B.: Isorankn: Spectral methods for global alignment of multiple protein networks. Bioinformatics 25(12), 253–258 (2009)

    Article  Google Scholar 

  13. Voevodski, K., Teng, S. -H., Xia, Y.: Finding local communities in protein networks. BMC Bioinform. 10(1), 1–14 (2009)

    Article  Google Scholar 

  14. Zhou, S., Yang, X., Chang, Q.: Spatial clustering analysis of green economy based on knowledge graph. Journal of Intelligent & Fuzzy Systems (Preprint), 1–10 (2021)

  15. Foysal, K.H., Chang, H.J., Bruess, F., Chong, J.W.: Smartfit: Smartphone application for garment fit detection. Electronics 10(1), 97 (2021)

    Article  Google Scholar 

  16. Zhu, D., Shen, G., Chen, J., Zhou, W., Kong, X.: A higher-order motif-based spatiotemporal graph imputation approach for transportation networks. Wirel. Commun. Mob. Comput., 2022 (2022)

  17. Spielman, D.A., Teng, S. -H.: Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, pp. 81–90 (2004)

  18. Andersen, R., Chung, F., Lang, K.: Local graph partitioning using pagerank vectors. In: 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pp. 475–486. IEEE (2006)

  19. Andersen, R., Peres, Y.: Finding sparse cuts locally using evolving sets. In: Proceedings of the Forty-first Annual ACM Symposium on Theory of Computing, pp. 235–244 (2009)

  20. Spielman, D.A., Teng, S. -H.: A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning. SIAM J. Comput. 42(1), 1–26 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  21. Lovász, L., Simonovits, M.: The mixing rate of markov chains, an isoperimetric inequality, and computing the volume. In: Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science, pp. 346–354. IEEE (1990)

  22. Lovász, L., Simonovits, M.: Random walks in a convex body and an improved volume algorithm. Random Struct. Algor. 4(4), 359–412 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  23. Andersen, R., Chung, F.: Detecting sharp drops in pagerank and a simplified local partitioning algorithm. In: International Conference on Theory and Applications of Models of Computation, pp. 1–12. Springer (2007)

  24. Chung, F.: The heat kernel as the pagerank of a graph. Proc. Natl. Acad. Sci. 104(50), 19735–19740 (2007)

    Article  Google Scholar 

  25. Kloster, K., Gleich, D.F.: Heat kernel based community detection. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1386–1395 (2014)

  26. Li, P., Chien, I., Milenkovic, O.: Optimizing generalized pagerank methods for seed-expansion community detection. Adv. Neural Inf. Process. Syst., 32 (2019)

  27. Wang, H., He, M., Wei, Z., Wang, S., Yuan, Y., Du, X., Wen, J.-R.: Approximate graph propagation. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 1686–1696 (2021)

  28. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the Web. Stanford InfoLab, Technical report (1999)

    Google Scholar 

  29. Chung, F., Simpson, O.: Solving linear systems with boundary conditions using heat kernel pagerank. In: International Workshop on Algorithms and Models for the Web-Graph, pp. 203–219. Springer (2013)

  30. Yang, R., Xiao, X., Wei, Z., Bhowmick, S.S., Zhao, J., Li, R. -H.: Efficient estimation of heat kernel pagerank for local clustering. In: Proceedings of the 2019 International Conference on Management of Data, pp. 1339–1356 (2019)

  31. Flake, G.W., Lawrence, S., Giles, C.L.: Efficient identification of web communities. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–160 (2000)

  32. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)

    Article  Google Scholar 

  33. Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying communities in networks. Proc. Nat. Acad. Sci. 101(9), 2658–2663 (2004)

    Article  Google Scholar 

  34. Newman, M.E.: Modularity and community structure in networks. Proc. Nat. Acad. Sci. 103(23), 8577–8582 (2006)

    Article  Google Scholar 

  35. Kobourov, S.G., Pupyrev, S., Simonetto, P.: Visualizing graphs as maps with contiguous regions. In: EuroVis (Short Papers) (2014)

  36. Cheeger, J.: A lower bound for the smallest eigenvalue of the Laplacian. Probl. Anal. 625(195-199), 110 (1970)

    MATH  Google Scholar 

  37. Cox, I.J., Rao, S.B., Zhong, Y.: “ratio regions”: a technique for image segmentation. In: Proceedings of 13th International Conference on Pattern Recognition, vol. 2, pp. 557–564. IEEE (1996)

  38. Sharon, E., Galun, M., Sharon, D., Basri, R., Brandt, A.: Hierarchy and adaptivity in segmenting visual scenes. Nature 442(7104), 810–813 (2006)

    Article  Google Scholar 

  39. Yang, J., Leskovec, J.: Defining and evaluating network communities based on ground-truth. Knowl. Inf. Syst. 42(1), 181–213 (2015)

    Article  Google Scholar 

  40. Benson, A.R., Gleich, D.F., Leskovec, J.: Higher-order organization of complex networks. Science 353(6295), 163–166 (2016)

    Article  Google Scholar 

  41. Tsourakakis, C.E., Pachocki, J., Mitzenmacher, M.: Scalable motif-aware graph clustering. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1451–1460 (2017)

  42. Yin, H., Benson, A.R., Leskovec, J., Gleich, D.F.: Local higher-order graph clustering. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 555–564 (2017)

  43. Ma, W., Cai, L., He, T., Chen, L., Cao, Z., Li, R.: Local expansion and optimization for higher-order graph clustering. IEEE Internet Things J. 6(5), 8702–8713 (2019)

    Article  Google Scholar 

  44. Huang, S., Li, Y., Bao, Z., Li, Z.: Towards efficient motif-based graph partitioning: An adaptive sampling approach. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 528–539. IEEE (2021)

  45. Zhou, D., Zhang, S., Yildirim, M.Y., Alcorn, S., Tong, H., Davulcu, H., He, J.: High-order structure exploration on massive graphs: A local graph clustering perspective. ACM Trans. Knowl. Discov. Data (TKDD) 15(2), 1–26 (2021)

    Article  Google Scholar 

  46. Chhabra, A., Faraj, M.F., Schulz, C.: Local motif clustering via (hyper) graph partitioning. arXiv:2205.06176 (2022)

  47. Emmons, S., Kobourov, S., Gallant, M., Börner, K.: Analysis of network clustering algorithms and cluster quality metrics at scale. PloS one 11(7), 0159161 (2016)

    Article  Google Scholar 

  48. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)

    Article  MathSciNet  MATH  Google Scholar 

  49. Meilă, M.: Comparing clusterings—an information based distance. J. Multivar. Anal. 98(5), 873–895 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  50. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)

    MathSciNet  MATH  Google Scholar 

  51. Avron, H., Horesh, L.: Community detection using time-dependent personalized pagerank. In: International Conference on Machine Learning, pp. 1795–1803. PMLR (2015)

  52. Kloumann, I.M., Ugander, J., Kleinberg, J.: Block models and personalized pagerank. Proc. Natl. Acad. Sci. 114(1), 33–38 (2017)

    Article  Google Scholar 

  53. Li, Y., Liu, J., Lin, G., Hou, Y., Mou, M., Zhang, J.: Gumbel-softmax-based optimization: a simple general framework for optimization problems on graphs. Comput. Soc. Netw. 8(1), 1–16 (2021)

    Article  Google Scholar 

  54. Holland, P.W., Laskey, K.B., Leinhardt, S.: Stochastic blockmodels: First steps. Soc. Netw. 5(2), 109–137 (1983)

    Article  MathSciNet  Google Scholar 

  55. Weiss, P.: L’hypothèse du champ moléculaire et la propriété ferromagnétique. J. Phys. Theor. Appl. 6(1), 661–690 (1907)

    Article  MATH  Google Scholar 

  56. Klicpera, J., Weißenberger, S., Günnemann, S.: Diffusion improves graph learning. Advances in Neural Information Processing Systems, 32 (2019)

  57. Berberidis, D., Nikolakopoulos, A.N., Giannakis, G.B.: Adaptive diffusions for scalable learning over graphs. IEEE Trans. Signal Process. 67(5), 1307–1321 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  58. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)

  59. Leskovec, J., Sosič, R.: Snap: A general-purpose network analysis and graph-mining library. ACM Trans Intelli Syst Technol (TIST) 8(1), 1–20 (2016)

    Google Scholar 

  60. Getoor, L.: Link-based classification. In: Advanced Methods for Knowledge Discovery from Complex Data, pp. 189–207. Springer (2005)

  61. Namata, G., London, B., Getoor, L., Huang, B., EDU, U.: Query-driven active surveying for collective classification. In: 10th International Workshop on Mining and Learning with Graphs, vol. 8, p. 1 (2012)

Download references

Acknowledgements

The author would like to thank Wang Hanzhi and Zhang Ruoqi for their selfless and solid technical supports. This work is partially supported by the Fundamental Research Funds for the Central Universities (No.2020JS005).

Funding

This work is partially supported by the Fundamental Research Funds for the Central Universities (No.2020JS005).

Author information

Authors and Affiliations

Authors

Contributions

Yuan Zhe devised the methods and framework, wrote the whole manuscript text and prepared all materials.

Corresponding author

Correspondence to Zhe Yuan.

Ethics declarations

Human and Animal Ethics

Not applicable.

Ethics approval and consent to participate

Not applicable.

Consent for Publication

Not applicable.

Competing interests

The author have no relevant financial or non-financial interests to disclose.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: additional experiments

Appendix A: additional experiments

Data sources

We obtain the DBLP, Amazon from the Stanford Network Analysis Project(SNAP) [59], and the rest from their original works [60, 61]. We present the basic information of the datasets used in our experiments in Table 4, and take a view of the conductances of the ground-truth clusters with Figure 2. We can see that the conductance of the labeling clusters are rather large, which should make the information-based metrics conflict with the structure-based metrics, as we note in the following part.

Table 4 Statistics of graph datasets
Figure 2
figure 2

Conductance of the ground-truth of clusters

Competitor considerations

Since the effectiveness challenge has not been studied much and little work targets the conductance metric as we do, the competitor of our LearnedNibble may not be any specific research result or algorithm. Besides, the work we present here does not aim to beat any baseline but reveals the capacity of GPR measure family and explore the possibility and method to realize them, with being compatible to the mainstream approximate algorithms.

Comparisons

For GPR instances, we evaluate them by grid-searching a bunch of parameters with 2,000 trials for each, which is also the training budget for LearnedNibble, and take the best performance as their clustering capacities. Specifically, we set the α ∈ [0, 1, 0.0005] for PPR, h ∈ [1, 20, 0.01] for HKPR, 𝜃 ∈ [0, 1, 0.005] and vary the power of 𝜃 which determines the ϕ in [1, 5, 20, 50, 100] for IPR. For MEAN, we directly compute its exact conductance by the standard sweep operation. For GSO, we set the training budget of 200,000 for it since it has much much more parameters to train.

1.1 A.1 Training details

We make the LearnedNibble have the full accessibility of the graph adjacency matrix in the training phase but keep the algorithm local in the inference phase as other computing-based graph local clustering algorithms. The reason we make the algorithm not thoroughly local is twofold. 1) First, we should use the whole graph data since the topology is integrated and should not be sampled as the data points in the Euclidean space. 2) Second, we are looking forward to seeing that the framework have a good generalization ability to the whole graph, which is the crucial character we may depend on to develop the scalability and practicality of LearnedNibble while making the algorithm local seems weird and maybe conflict with the purpose.

For the trainable weighting parameters, we normalize the weight vector w to be one-norm ||x||1 = 1 in the inference phase but keep it free in the training phase for numerical stability sake.

1.2 A.2 Clustering capacity details

Comparisons

We report the average conductance of the 5 training seed nodes with the final model in each datasets with Table 2. The first 4 columns are the GPR family instances and the trivial MEAN pooling operation. The GSO column represents the GSO [53] framework. The last column with title GPR is our LearnedNibble framework.

Results with approximation in detail

We report the results of different datasets in turn and list them with Table 5.

Table 5 Comparisons with approximations

1.3 A.3 Generalization ability details

Comparisons

To see more clearly, we report the generalization abilities of our LearnedNibble framework with competitors in two aspects. 1)In-Cluster: We do inference on the node randomly selected within the same cluster as the training seed nodes. It’s represented by the c columns in Table 3. 2)In-Graph: We do inference on the node randomly selected from the whole graph. It’s represented by the g columns in Table 3. We report the average conductance of the 50 testing nodes with the final model in each dataset.

Results with approximation

We report the results of different datasets in turn with both in-cluster and in-graph situations, which have not been shown in Section 4 with Figure 3.

Figure 3
figure 3

Generalization ability with approximation. (a) DBLP (b) Amazon (c) PubMed (d) PubMed* (e) CiteSeer (f) Cora*

1.4 A.4 Parameter sensitivity

Initialization comparisons

We test the sensitivity of different initializations by training our LearnedNibble framework from different starting weights. Specifically, we use the PPR weighting vector with teleport constant α = 0.1 to challenge our model. We use the IPR weighting vector with 𝜃 = 0.99,ϕ = 0.9910 for IPR testing. The comparison results of different datasets are listed in Table 6. We can see that the training with different initialization methods achieves similar but slightly different performances. The trivial MEAN and RAW initializations perform a little better, and the IPR with theoretical advantage also plays well in some cases.

Table 6 Initialization sensitivity

Regradient and locality regularization

We investigate the regradient technique proposed in this work by conducting the ablation experiments. At the same time, we test the performance of the popular locality regularization term used in Graph Neural Networks(GNN), which keeps the information diffusion local with the minimizing the 2-norm of the difference between the graph signal after propagating and the initial signal which is the one-hot vector in our situation, i.e., ||gpr − \(\overrightarrow1_{s}\)||. The results under the exact settings with 𝜖 = 0 of both are presented by Table 7. We can see that the regradient sets with R = 1 shows better performance than its comparisons with R = 0, and the training settings with R = 1; L = 0 corresponding to the experiments with regradient technique and without the commonly-used locality regularization achieves the best performance in all situations.

Table 7 Ablation experiments

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yuan, Z. Self-supervised end-to-end graph local clustering. World Wide Web 26, 1157–1179 (2023). https://doi.org/10.1007/s11280-022-01081-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-022-01081-8

Keywords