Self-supervised end-to-end graph local clustering

Yuan, Zhe

doi:10.1007/s11280-022-01081-8

Self-supervised end-to-end graph local clustering

Published: 08 July 2022

Volume 26, pages 1157–1179, (2023)
Cite this article

World Wide Web Aims and scope Submit manuscript

Zhe Yuan¹

306 Accesses
1 Altmetric
Explore all metrics

Abstract

Graph clustering is a central and fundamental problem in numerous graph mining applications, especially in spatial-temporal system. The purpose of the graph local clustering is finding a set of nodes (cluster) containing seed node with high internal density. A series of works have been proposed to solve this problem with carefully designing the measuring metric and improving the efficiency-effectiveness trade-off. However, they are unable to provide a satisfying clustering quality guarantee. In this paper, we investigate the graph local clustering task and propose a End-to-End framework LearnedNibble to address the aforementioned limitation. In particular, we propose several techniques, including the practical self-supervised supervision manner with differential soft-mean-sweep operator, effective optimization method with regradient technique, and scalable inference manner with Approximate Graph Propagation (AGP) paradigm and search-selective method. To the best of our knowledge, LearnedNibble is the first attempt to take responsibility for the cluster quality and take both effectiveness and efficiency into consideration in an End-to-End paradigm with self-supervised manner. Extensive experiments on real-world datasets demonstrate the clustering capacity, generalization ability, and approximation compatibility of our LearnedNibble framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-Adaptive Clustering of Dynamic Multi-Graph Learning

Article 01 January 2021

Graph-based semi-supervised learning via improving the quality of the graph dynamically

Article 13 May 2021

Dynamic graph attention-guided graph clustering with entropy minimization self-supervision

Article 01 October 2024

Data availability

The graph datasets that support the findings of this study are available in SNAP project, https://snap.stanford.edu/data/index.html.

References

Girvan, M., Newman, M.E.: Community structure in social and biological networks. Proc. Nat. Acad. Sci. 99(12), 7821–7826 (2002)
Article MathSciNet MATH Google Scholar
Wasserman, S., Faust, K., et al.: Social network analysis: Methods and applications (1994)
Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.-U.: Complex networks: Structure and dynamics. physrep 424(4–5), 175–308 (2006). https://doi.org/10.1016/j.physrep.2005.10.009
MathSciNet MATH Google Scholar
Lu, Z., Wahlström, J., Nehorai, A.: Community detection in complex networks via clique conductance. Sci. Rep. 8(1), 1–16 (2018)
Google Scholar
Wang, M., Wang, C., Yu, J.X., Zhang, J.: Community detection in social networks: an in-depth benchmarking study with a procedure-oriented framework. Proc. VLDB Endow. 8(10), 998–1009 (2015)
Article Google Scholar
Fortunato, S.: Community detection in graphs. Phys. Rep. 486 (3–5), 75–174 (2010)
Article MathSciNet Google Scholar
Leskovec, J., Lang, K.J., Mahoney, M.: Empirical comparison of algorithms for network community detection. In: Proceedings of the 19th International Conference on World Wide Web, pp. 631–640 (2010)
Yi, F., Moon, I.: Image segmentation: A survey of graph-cut methods. In: 2012 International Conference on Systems and Informatics (ICSAI2012), pp. 1936–1941. IEEE (2012)
Vicente, S., Kolmogorov, V., Rother, C.: Graph cut based image segmentation with connectivity priors. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vis. 59(2), 167–181 (2004)
Article MATH Google Scholar
Tolliver, D.A., Miller, G.L.: Graph partitioning by spectral rounding: Applications in image segmentation and clustering. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 1, pp. 1053–1060. IEEE (2006)
Liao, C. -S., Lu, K., Baym, M., Singh, R., Berger, B.: Isorankn: Spectral methods for global alignment of multiple protein networks. Bioinformatics 25(12), 253–258 (2009)
Article Google Scholar
Voevodski, K., Teng, S. -H., Xia, Y.: Finding local communities in protein networks. BMC Bioinform. 10(1), 1–14 (2009)
Article Google Scholar
Zhou, S., Yang, X., Chang, Q.: Spatial clustering analysis of green economy based on knowledge graph. Journal of Intelligent & Fuzzy Systems (Preprint), 1–10 (2021)
Foysal, K.H., Chang, H.J., Bruess, F., Chong, J.W.: Smartfit: Smartphone application for garment fit detection. Electronics 10(1), 97 (2021)
Article Google Scholar
Zhu, D., Shen, G., Chen, J., Zhou, W., Kong, X.: A higher-order motif-based spatiotemporal graph imputation approach for transportation networks. Wirel. Commun. Mob. Comput., 2022 (2022)
Spielman, D.A., Teng, S. -H.: Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing, pp. 81–90 (2004)
Andersen, R., Chung, F., Lang, K.: Local graph partitioning using pagerank vectors. In: 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pp. 475–486. IEEE (2006)
Andersen, R., Peres, Y.: Finding sparse cuts locally using evolving sets. In: Proceedings of the Forty-first Annual ACM Symposium on Theory of Computing, pp. 235–244 (2009)
Spielman, D.A., Teng, S. -H.: A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning. SIAM J. Comput. 42(1), 1–26 (2013)
Article MathSciNet MATH Google Scholar
Lovász, L., Simonovits, M.: The mixing rate of markov chains, an isoperimetric inequality, and computing the volume. In: Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science, pp. 346–354. IEEE (1990)
Lovász, L., Simonovits, M.: Random walks in a convex body and an improved volume algorithm. Random Struct. Algor. 4(4), 359–412 (1993)
Article MathSciNet MATH Google Scholar
Andersen, R., Chung, F.: Detecting sharp drops in pagerank and a simplified local partitioning algorithm. In: International Conference on Theory and Applications of Models of Computation, pp. 1–12. Springer (2007)
Chung, F.: The heat kernel as the pagerank of a graph. Proc. Natl. Acad. Sci. 104(50), 19735–19740 (2007)
Article Google Scholar
Kloster, K., Gleich, D.F.: Heat kernel based community detection. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1386–1395 (2014)
Li, P., Chien, I., Milenkovic, O.: Optimizing generalized pagerank methods for seed-expansion community detection. Adv. Neural Inf. Process. Syst., 32 (2019)
Wang, H., He, M., Wei, Z., Wang, S., Yuan, Y., Du, X., Wen, J.-R.: Approximate graph propagation. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 1686–1696 (2021)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the Web. Stanford InfoLab, Technical report (1999)
Google Scholar
Chung, F., Simpson, O.: Solving linear systems with boundary conditions using heat kernel pagerank. In: International Workshop on Algorithms and Models for the Web-Graph, pp. 203–219. Springer (2013)
Yang, R., Xiao, X., Wei, Z., Bhowmick, S.S., Zhao, J., Li, R. -H.: Efficient estimation of heat kernel pagerank for local clustering. In: Proceedings of the 2019 International Conference on Management of Data, pp. 1339–1356 (2019)
Flake, G.W., Lawrence, S., Giles, C.L.: Efficient identification of web communities. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–160 (2000)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Article Google Scholar
Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying communities in networks. Proc. Nat. Acad. Sci. 101(9), 2658–2663 (2004)
Article Google Scholar
Newman, M.E.: Modularity and community structure in networks. Proc. Nat. Acad. Sci. 103(23), 8577–8582 (2006)
Article Google Scholar
Kobourov, S.G., Pupyrev, S., Simonetto, P.: Visualizing graphs as maps with contiguous regions. In: EuroVis (Short Papers) (2014)
Cheeger, J.: A lower bound for the smallest eigenvalue of the Laplacian. Probl. Anal. 625(195-199), 110 (1970)
MATH Google Scholar
Cox, I.J., Rao, S.B., Zhong, Y.: “ratio regions”: a technique for image segmentation. In: Proceedings of 13th International Conference on Pattern Recognition, vol. 2, pp. 557–564. IEEE (1996)
Sharon, E., Galun, M., Sharon, D., Basri, R., Brandt, A.: Hierarchy and adaptivity in segmenting visual scenes. Nature 442(7104), 810–813 (2006)
Article Google Scholar
Yang, J., Leskovec, J.: Defining and evaluating network communities based on ground-truth. Knowl. Inf. Syst. 42(1), 181–213 (2015)
Article Google Scholar
Benson, A.R., Gleich, D.F., Leskovec, J.: Higher-order organization of complex networks. Science 353(6295), 163–166 (2016)
Article Google Scholar
Tsourakakis, C.E., Pachocki, J., Mitzenmacher, M.: Scalable motif-aware graph clustering. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1451–1460 (2017)
Yin, H., Benson, A.R., Leskovec, J., Gleich, D.F.: Local higher-order graph clustering. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 555–564 (2017)
Ma, W., Cai, L., He, T., Chen, L., Cao, Z., Li, R.: Local expansion and optimization for higher-order graph clustering. IEEE Internet Things J. 6(5), 8702–8713 (2019)
Article Google Scholar
Huang, S., Li, Y., Bao, Z., Li, Z.: Towards efficient motif-based graph partitioning: An adaptive sampling approach. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 528–539. IEEE (2021)
Zhou, D., Zhang, S., Yildirim, M.Y., Alcorn, S., Tong, H., Davulcu, H., He, J.: High-order structure exploration on massive graphs: A local graph clustering perspective. ACM Trans. Knowl. Discov. Data (TKDD) 15(2), 1–26 (2021)
Article Google Scholar
Chhabra, A., Faraj, M.F., Schulz, C.: Local motif clustering via (hyper) graph partitioning. arXiv:2205.06176 (2022)
Emmons, S., Kobourov, S., Gallant, M., Börner, K.: Analysis of network clustering algorithms and cluster quality metrics at scale. PloS one 11(7), 0159161 (2016)
Article Google Scholar
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)
Article MathSciNet MATH Google Scholar
Meilă, M.: Comparing clusterings—an information based distance. J. Multivar. Anal. 98(5), 873–895 (2007)
Article MathSciNet MATH Google Scholar
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010)
MathSciNet MATH Google Scholar
Avron, H., Horesh, L.: Community detection using time-dependent personalized pagerank. In: International Conference on Machine Learning, pp. 1795–1803. PMLR (2015)
Kloumann, I.M., Ugander, J., Kleinberg, J.: Block models and personalized pagerank. Proc. Natl. Acad. Sci. 114(1), 33–38 (2017)
Article Google Scholar
Li, Y., Liu, J., Lin, G., Hou, Y., Mou, M., Zhang, J.: Gumbel-softmax-based optimization: a simple general framework for optimization problems on graphs. Comput. Soc. Netw. 8(1), 1–16 (2021)
Article Google Scholar
Holland, P.W., Laskey, K.B., Leinhardt, S.: Stochastic blockmodels: First steps. Soc. Netw. 5(2), 109–137 (1983)
Article MathSciNet Google Scholar
Weiss, P.: L’hypothèse du champ moléculaire et la propriété ferromagnétique. J. Phys. Theor. Appl. 6(1), 661–690 (1907)
Article MATH Google Scholar
Klicpera, J., Weißenberger, S., Günnemann, S.: Diffusion improves graph learning. Advances in Neural Information Processing Systems, 32 (2019)
Berberidis, D., Nikolakopoulos, A.N., Giannakis, G.B.: Adaptive diffusions for scalable learning over graphs. IEEE Trans. Signal Process. 67(5), 1307–1321 (2018)
Article MathSciNet MATH Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)
Leskovec, J., Sosič, R.: Snap: A general-purpose network analysis and graph-mining library. ACM Trans Intelli Syst Technol (TIST) 8(1), 1–20 (2016)
Google Scholar
Getoor, L.: Link-based classification. In: Advanced Methods for Knowledge Discovery from Complex Data, pp. 189–207. Springer (2005)
Namata, G., London, B., Getoor, L., Huang, B., EDU, U.: Query-driven active surveying for collective classification. In: 10th International Workshop on Mining and Learning with Graphs, vol. 8, p. 1 (2012)

Download references

Acknowledgements

The author would like to thank Wang Hanzhi and Zhang Ruoqi for their selfless and solid technical supports. This work is partially supported by the Fundamental Research Funds for the Central Universities (No.2020JS005).

Funding

This work is partially supported by the Fundamental Research Funds for the Central Universities (No.2020JS005).

Author information

Authors and Affiliations

The School of Information, Renmin University of China, Zhongguancun St. 59, Haidian, 100872, Beijing, China
Zhe Yuan

Authors

Zhe Yuan
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Yuan Zhe devised the methods and framework, wrote the whole manuscript text and prepared all materials.

Corresponding author

Correspondence to Zhe Yuan.

Ethics declarations

Human and Animal Ethics

Not applicable.

Ethics approval and consent to participate

Not applicable.

Consent for Publication

Not applicable.

Competing interests

The author have no relevant financial or non-financial interests to disclose.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: additional experiments

Data sources

We obtain the DBLP, Amazon from the Stanford Network Analysis Project(SNAP) [59], and the rest from their original works [60, 61]. We present the basic information of the datasets used in our experiments in Table 4, and take a view of the conductances of the ground-truth clusters with Figure 2. We can see that the conductance of the labeling clusters are rather large, which should make the information-based metrics conflict with the structure-based metrics, as we note in the following part.

Table 4 Statistics of graph datasets

Full size table

Competitor considerations

Since the effectiveness challenge has not been studied much and little work targets the conductance metric as we do, the competitor of our LearnedNibble may not be any specific research result or algorithm. Besides, the work we present here does not aim to beat any baseline but reveals the capacity of GPR measure family and explore the possibility and method to realize them, with being compatible to the mainstream approximate algorithms.

Comparisons

For GPR instances, we evaluate them by grid-searching a bunch of parameters with 2,000 trials for each, which is also the training budget for LearnedNibble, and take the best performance as their clustering capacities. Specifically, we set the α ∈ [0, 1, 0.0005] for PPR, h ∈ [1, 20, 0.01] for HKPR, 𝜃 ∈ [0, 1, 0.005] and vary the power of 𝜃 which determines the ϕ in [1, 5, 20, 50, 100] for IPR. For MEAN, we directly compute its exact conductance by the standard sweep operation. For GSO, we set the training budget of 200,000 for it since it has much much more parameters to train.

1.1 A.1 Training details

We make the LearnedNibble have the full accessibility of the graph adjacency matrix in the training phase but keep the algorithm local in the inference phase as other computing-based graph local clustering algorithms. The reason we make the algorithm not thoroughly local is twofold. 1) First, we should use the whole graph data since the topology is integrated and should not be sampled as the data points in the Euclidean space. 2) Second, we are looking forward to seeing that the framework have a good generalization ability to the whole graph, which is the crucial character we may depend on to develop the scalability and practicality of LearnedNibble while making the algorithm local seems weird and maybe conflict with the purpose.

For the trainable weighting parameters, we normalize the weight vector w to be one-norm ||x||₁ = 1 in the inference phase but keep it free in the training phase for numerical stability sake.

1.2 A.2 Clustering capacity details

Comparisons

We report the average conductance of the 5 training seed nodes with the final model in each datasets with Table 2. The first 4 columns are the GPR family instances and the trivial MEAN pooling operation. The GSO column represents the GSO [53] framework. The last column with title GPR is our LearnedNibble framework.

Results with approximation in detail

We report the results of different datasets in turn and list them with Table 5.

Table 5 Comparisons with approximations

Full size table

1.3 A.3 Generalization ability details

Comparisons

To see more clearly, we report the generalization abilities of our LearnedNibble framework with competitors in two aspects. 1)In-Cluster: We do inference on the node randomly selected within the same cluster as the training seed nodes. It’s represented by the c columns in Table 3. 2)In-Graph: We do inference on the node randomly selected from the whole graph. It’s represented by the g columns in Table 3. We report the average conductance of the 50 testing nodes with the final model in each dataset.

Results with approximation

We report the results of different datasets in turn with both in-cluster and in-graph situations, which have not been shown in Section 4 with Figure 3.

1.4 A.4 Parameter sensitivity

Initialization comparisons

We test the sensitivity of different initializations by training our LearnedNibble framework from different starting weights. Specifically, we use the PPR weighting vector with teleport constant α = 0.1 to challenge our model. We use the IPR weighting vector with 𝜃 = 0.99,ϕ = 0.99¹⁰ for IPR testing. The comparison results of different datasets are listed in Table 6. We can see that the training with different initialization methods achieves similar but slightly different performances. The trivial MEAN and RAW initializations perform a little better, and the IPR with theoretical advantage also plays well in some cases.

Table 6 Initialization sensitivity

Full size table

Regradient and locality regularization

We investigate the regradient technique proposed in this work by conducting the ablation experiments. At the same time, we test the performance of the popular locality regularization term used in Graph Neural Networks(GNN), which keeps the information diffusion local with the minimizing the 2-norm of the difference between the graph signal after propagating and the initial signal which is the one-hot vector in our situation, i.e., ||gpr − $\overrightarrow1_{s}$||. The results under the exact settings with 𝜖 = 0 of both are presented by Table 7. We can see that the regradient sets with R = 1 shows better performance than its comparisons with R = 0, and the training settings with R = 1; L = 0 corresponding to the experiments with regradient technique and without the commonly-used locality regularization achieves the best performance in all situations.

Table 7 Ablation experiments

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuan, Z. Self-supervised end-to-end graph local clustering. World Wide Web 26, 1157–1179 (2023). https://doi.org/10.1007/s11280-022-01081-8

Download citation

Received: 20 May 2022
Revised: 19 June 2022
Accepted: 24 June 2022
Published: 08 July 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11280-022-01081-8

Keywords

Part of a collection:

Special issue on spatiotemporal data management and analytics for recommender systems

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-supervised end-to-end graph local clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Self-Adaptive Clustering of Dynamic Multi-Graph Learning

Graph-based semi-supervised learning via improving the quality of the graph dynamically

Dynamic graph attention-guided graph clustering with entropy minimization self-supervision

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Human and Animal Ethics

Ethics approval and consent to participate

Consent for Publication

Competing interests

Additional information

Publisher’s note

Appendix A: additional experiments

Appendix A: additional experiments

Data sources

Competitor considerations

Comparisons

1.1 A.1 Training details

1.2 A.2 Clustering capacity details

Comparisons

Results with approximation in detail

1.3 A.3 Generalization ability details

Comparisons

Results with approximation

1.4 A.4 Parameter sensitivity

Initialization comparisons

Regradient and locality regularization

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now